Multimodal learning
General-purpose neural networks capable of handling diverse inputs and output tasks
# Resources
- Multimodal Deep Learning
- https://paperswithcode.com/methods/category/vision-and-language-pre-trained-models
- Vision Language models: towards multi-modal deep learning
# Code
- #CODE Pykale - Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research
- #CODE Unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
# Courses
# Books
- #BOOK Multimodal Deep Learning (Akkus 2023) - https://slds-lmu.github.io/seminar_multimodal_dl/index.html
# References
- #PAPER Multi-modal Transformer for Video Retrieval (Gabeur 2020)
- #PAPER #REVIEW Recent Advances and Trends in Multimodal Deep Learning: A Review (Summaira 2021)
- #PAPER
Perceiver: General Perception with Iterative Attention (Jaegle 2021)
- https://www.zdnet.com/article/googles-supermodel-deepmind-perceiver-is-a-step-on-the-road-to-an-ai-machine-that-could-process-everything/
- Multi-model with image, audio, video, 3d point clouds
- #PAPER PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python (Lu 2021) ^pykale
- #PAPER Perceiver IO: A General Architecture for Structured Inputs & Outputs (Jaegle 2021)
- #PAPER
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (Akbari 2021)
- #CODE https://paperswithcode.com/paper/vatt-transformers-for-multimodal-self
- VATT is trained to learn multimodal representations from unlabeled data using Transformer architectures
- #PAPER
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion (Wu 2021)
- #CODE https://paperswithcode.com/paper/nuwa-visual-synthesis-pre-training-for-neural
- Paper explained
- NÜWA consists of an adaptive encoder that takes either text or visual input, and a pre-trained decoder shared by 8 visual tasks
- 3D Nearby Attention mechanism (3DNA) is proposed to reduce computational complexity and improve visual quality of results, by considering the locality characteristics for both spatial and temporal axes to better deal with the nature of the visual data
- #PAPER data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language (Baevski 2022)
- #PAPER
A Generalist Agent (Reed 2022)
- Paper explained
- New approach, inspired by large-scale language models, that acts a single generalist agent. The agent, called Gato, is built to work as a multi-modal, multi-task, multi-embodiment generalist policy
- #PAPER Language Models are General-Purpose Interfaces (Hao 2022)
- #PAPER NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (Wu 2022)
# Vision and language models
- #PAPER DALL-E - Creating Images from Text (Ramesh 2021)
- #PAPER Learning Transferable Visual Models From Natural Language Supervision (Radford 2021)
- #PAPER SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (Wang 2022)
- #PAPER Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (Wang 2022)
- #PAPER Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac 2022)
- #PAPER LViT: Language meets Vision Transformer in Medical Image Segmentation (Li 2022) ^lvit
- #PAPER
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Yu 2022)
- #CODE https://github.com/google-research/parti
- https://parti.research.google/
- Pathways Autoregressive Text-to-Image model (Parti), an autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge
- #PAPER CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers (Ding 2022)
- #PAPER
Imagen - Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Saharia 2022)
- https://imagen.research.google/
- Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation
- #CODE https://paperswithcode.com/paper/photorealistic-text-to-image-diffusion-models
- #PAPER Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (Lu 2022)
- #PAPER #REVIEW A Survey of Vision-Language Pre-Trained Models (Du 2022)
- #PAPER #REVIEW The Creativity of Text-to-Image Generation (Oppenlaender 2022)
- #PAPER MultiMAE: Multi-modal Multi-task Masked Autoencoders (Bachman 2022)
- #PAPER GLIGEN: Open-Set Grounded Text-to-Image Generation (Li 2023)
- #PAPER Scaling up GANs for Text-to-Image Synthesis (Kang 2023)
- #PAPER
OpenFlamingo (Awadalla 2023)
- An open-source framework for training vision-language models with in-context learning (like GPT-4!)
- Includes a Python framework to train Flamingo-style LMMs, a large-scale multimodal dataset with interleaved image and text sequences, an in-context learning evaluation benchmark for vision-language tasks and a trained models (eg OpenFlamingo-9B model based on LLaMA)
- Demo
- #PAPER Modulating Pretrained Diffusion Models for Multimodal Image Synthesis (Ham 2023)
- #PAPER 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2024)