It will be a one week long course with both theoretical and practical sessions.
Day 1: Introduction to foundation models and its applications. We will also review some fundamental and necessary concepts in deep learning to pave the way for the rest of the week.
Day 2: We will study in depth the Transformer architecture that underpins all the recent advancements in foundation models. We will also see how to efficiently fine-tune the large Transformer models for both Natural Language Processing (NLP) and Computer Vision (CV) tasks. Theory lessons will be complemented with hands-on sessions on fine-tuning Transformers on small datasets.
Day 3: We will study how the LLMs [2] are pre-trained on large datasets, objective functions used, followed by fine-tuning them on smaller curated datasets using techniques like Instruction Tuning & RLHF. We will have a hands-on session on how to use LLMs through prompt engineering.
Day 4: We will study another family of foundation models that can process multimodal data [3] (eg, images and text) instead of a single modality. We will also cover generative multimodal models (eg., Diffusion models [4]) that can generate synthetic images given a text query. The hands-on session will give a walk-through on how to run inference with these multimodal foundation models.
Day 5: On the final day we will cover a more advanced genre of augmented foundation models that can query external knowledge bases, utilities, tools and resources to make more informed predictions for various tasks. We will make the idea more concrete by designing a small application using foundation models.
Teaching methods
It consists of lectures and hands-on practical sessions. Lectures will be delivered via interactive slides. Hands-on sessions will be delivered through the use of Jupyter notebooks and PyTorch framework. Basic knowledge of python is beneficial to follow along with the hands-on sessions. Students must carry a regular laptop for the hands-on sessions.
Assessment methods
The students will form a team of 3-4 students (number to be decided based on the number of enrolled participants) and can choose one of the two assessment modalities: (i) A detailed project presentation where each team presents a well-know foundation model paper, and showcase how can this model be potentially used for a real-world application of choice. A valid use-case analysis needs to be presented; (ii) A demo project that uses foundation model(s) for a downstream application of choice. A very short presentation or a live demo can be shown during the presentation. Both the projects will be graded at two levels - pass or fail. The presentation of projects will be held on the final day of the course.
Bibliography
[1] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
[2] Brown, T. B. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
[3] Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In the International conference on machine learning (pp. 12888-12900). PMLR.
[4] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-resolution image synthesis with latent diffusion models. 2022 IEEE. In CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 1)
A minimum of 75% attendance is required.