Mini-Gemini

InternVL

MiniCPM-V-2_6

Pre-training

Stage-1. We randomly initialize the compression layer and train this module in stage-1, keeping other parameters frozen. The visual encoder’s resolution is set to 224×224, which is the same as the visual encoder’s pre-training setting. To warm up the compression layer, we randomly select 200M data from the Image Captioning data
Stage-2. Trainable Modules. we extend the image resolution from 224×224 to 448×448. The whole visual encoder is trained, leaving other parameters frozen. To extend the pre-trained resolution, we additionally select 200M data from the Image Captioning data
Stage-3. After extending the primary input resolution of the visual encoder, we finally train the visual modules using the adaptive visual encoding strategy, which can further accommodate highresolution inputs with any aspect ratio. Both the compression layer and the visual encoder are trained to adapt to the language model embedding space. The LLM is kept frozen to avoid disruption from the relatively low-quality pre-training data. Different from the previous stages with only image captioning data, during the highresolution pre-training stage, we additionally introduce OCR data to enhance the visual encoders’ OCR capability

Supervised Fine-tuning

we unlock all model parameters to better exploit the data and learn rich knowledge during SFT phase.
Recent works show that data near the end of training plays a more important role in shaping the models’ capabilities and response styles. We categorize the SFT data into two parts. Part-1 focuses on bolstering the models’ basic recognition capabilities, while part-2 is tailored to enhance their capabilities in generating detailed responses and following human instructions. Specifically, part-1 data consists of the traditional QA/captioning datasets with relatively short response lengths, which helps enhance the model’s basic recognition capabilities. In comparison, part-2 encompasses datasets featuring long responses with complex interactions, either in text or multimodal context. During SFT, these two parts of data are concatenated and sequentially fed into the model. For MiniCPM-Llama3-V 2.5, we integrate 2M data from the recent Cauldron dataset for multimodal knowledge augmentation, and 90K multilingual data over 36 languages for boosting the multilingual conversation capability.

LLaVA

这里直接先放出onevistion来他整体的训练策略

Stage-1

数据：558K subset of the LAION-CC-SBU dataset with BLIP captions

https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain

Stage-1.5

github中给的这一阶段用到的数据（除了instruct_azure_dc_zh_92K）包含：

https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-CC3M

https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-118K

https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-558K

https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data

https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-CC12M

Single-image stage

上面看着有点吓人， git中还没公开混合策略

OneVision Stage

Around 800K higher-quality data re-sampled from previous stage (yes, it's data replay!).
M4-Instruct Data (multi-imgae)
Video Data

NVLM

🔖 https://research.nvidia.com/labs/adlr/NVLM-1/

模型结构

NVLM-D: Decoder-only Model: 上图下边结构，图像和文本拼接用LLM做decoder（和常见的VLM结构相同），并加入1-D flattened tile tag: , , · · · , , .
NVLM-X: X-attention Model：上图上边结构，图像和文本在Gated X-Atention做cross attention（类似Flamingo），同样加入1-D flattened tile tag
NVLM-H: Hybrid Model ：融合X和D，只将缩略图和text送入LLM, 同样加入Gated X-Atention，用其他图像tail和self-attention之后的特征做cross Attention

模型训练

pretrain

freeze both the LLM backbone and vision encoder for all models. We only train the modality-alignment modules, i.e., projector MLP or X-attention layers

SFT

keep the vision encoder frozen while training both the LLM and modality-alignment modules

Ovis

🔖 https://github.com/AIDC-AI/Ovis/tree/main

模型结构

之前的方法在对齐离散的text特征（对应为一维token embeding）和visual 特征（对应为二维离散特征序列，patches embeding）时，二者的维度不同，所以基本都是采用比较简单的方法去对齐，比如llava就是将视觉特征再通过MLP将其转换为一维的视觉特征序列，再将二者合并后送入LLM进行训练。

Ovis的做法是，将视觉离散化，统一为和文本特征统一的形式，也就是对齐到一个统一的空间再做对齐。具体为，给定一个视觉token $r_i \in \mathbb{R}^d$ (可以理解为通过VIT提取的一个patch特征)，通过

视觉嵌入表（Visual Embedding Table）

每个视觉单词 $w_i $作为一个原型(prototype)，存储在嵌入表中，形式为 $ {w_i \in \mathbb{R}^d}_{i=1}^K$，其中$K$为词典的大小，也就代表了视觉词典中包含有$K$个独特的视觉单词，

匹配机制

为了将连续的视觉token $r_i $与嵌入表中的 $K$ 个视觉单词进行匹配，利用内积计算相似度, 内积值越高，表示 $r_i $与该视觉单词的相似度越高。通过Softmax归一化，最终得到的 $v_i $是 $r_i $与所有视觉单词的归一化相似度分布, 即

\[ v_i=\mathrm{softmax}(\boldsymbol{W}\mathbf{r_i})\quad\boldsymbol{W}\in\mathbb{R}^{K\times d} \]

通过将视觉token $r_i$ 表示为一个概率分布 $v_i$ ，实现视觉特征与离散视觉词汇之间的对齐。

视觉嵌入向量

进一步，每个视觉单词在嵌入表中对应一个嵌入向量 $ e_k \in \mathbb{R}^{d'} $，其中 $d'$ 是嵌入向量的维度。

为了使视觉token和文本token的嵌入具有兼容的形状，设置视觉嵌入表的维度与文本嵌入表的维度相同。

具体的，给定视觉token $v_i \in \Delta^K$ （一个概率分布），其嵌入向量 $V_i$ 通过视觉嵌入表计算为：

\[ V_i = \sum_{k=1}^K v_{i,k} e_k, \quad V_i \in \mathbb{R}^{d'} \]

其中，$v_{i,k}$ 是 $v_i$ 的第 $k $ 个分量，表示视觉token对第 $ k$ 个视觉单词的相关性权重。公式也可以等价地表示为：

\[ V_i = \mathbb{E}_{e_k \sim v_i}[e_k] \]

这表明 $V_i$ 是视觉单词嵌入 $e_k$ 的加权期望值，其中权重由概率分布 $ v_i$ 决定。

多义性处理
加权组合

训练流程和数据

Stage 1：视觉编码器微调与视觉-文本对齐
Stage2：视觉理解能力增强
Stage3：多模态指令学习
这种渐进式训练策略先让模型学习基本的视觉-文本对齐能力，然后增强其视觉理解能力，最后教会它遵循多模态指令，从而构建一个全面的多模态 AI 系统下面是文章给出的训练数据表

可以看出这里跟其他模型不同的点是这里第一个阶段用了大量数据来训练这个视觉嵌入表

整体模型训练的参数如下：

Ross(RECONSTRUCTIVE VISUAL INSTRUCTION TUNING)

🔖 https://haochen-wang409.github.io/ross/

ROSS (Reconstructive Visual Instruction Tuning) 是一种新型的大规模多模态模型训练方法。它的核心思想是通过重建输入图像的方式来提供视觉监督信号，从而增强模型的视觉理解能力。

传统多模态模型的条件因果分布可表示为：

\[ p_\Theta(x) = \prod_{i=1}^T p_\Theta(x_i|x_{<i}, v)，v = H_\phi \circ G_\xi(I) \]

其中：

$x_i$ 表示第 $i$ 个 text token
$\Theta = \{\theta, \xi, \phi\}$ 表示模型参数
$v \in R^{N×D}$ 表示投影后的视觉token
N是视觉token数量
D是特征通道数
传统的视觉指令调整方法仅监督文本输出，对应的训练目标为：

\[ L_{LMM}^{text}(\Theta = \{\theta, \xi, \phi\}, x, I) = -\frac{1}{T-N}\sum_{i=N+1}^T \log p_\Theta(x_i|x_{<i}, v) \]

模型结构

Ross的总体理念是在视觉输出 $x_{i≤N}$ 上构建重建性视觉监督信号。训练目标包括

下图右侧所示的 $x_{i>N}$ 的原始下一步预测
下图左侧的另一个重建项，即 $L_{Ross} = L_{LMM}^{text} +L_{LMM}^{visual}$。
具体而言，对于视觉部分可以是 $x_{i\leq N}$ 和图像 $I$ 的特定重建目标之间的任何自定义测量值：