본문 바로가기

Multi-Modal5

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayConversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs froarxiv.org1. IntroductionCurrent investigations focus on un.. 2025. 3. 5.
[Survey] VQA 1. VQA: Visual Question Answering (2015) Image는 CNN encoder, Question은 LSTM encoder 거친 vector를 합치는 방식 Pretrained VGG 16, LSTM 사용 2. Hierarchical Question-Image Co-Attention for Visual Question Answering (2016) Image와 Question 사이의 관계를 설명하기 위해 Attention 사용 Image feature 추출은 거의 변한 것이 없음 Question에서 더 semantic한 정보를 뽑기 위해 LSTM 구조를 hierarchical하게 변경 Image와 Question을 attention해서 unified context vector 추.. 2024. 3. 24.
🦩 Flamingo: a Visual Language Model for Few-Shot Learning Flamingo: a Visual Language Model for Few-Shot Learning Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propo arxiv.org 0. Abstract Flamingo의 주요 아키텍쳐 발전 (1) 사전 학습된 강력한 시각 전용 모델과 언어 전용 모델을 연결 (2) .. 2024. 3. 20.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations t arxiv.org 1. Abstract VATT는 트랜스포머 아키텍처를 사용해, 레이블이 없.. 2024. 3. 4.