
The article introduces Mini-Gemini, a framework designed to enhance the performance of multi-modality Vision Language Models (VLMs) in terms of image understanding, reasoning, and generation. By utilizing high-resolution visual tokens, high-quality data, and VLM-guided generation, Mini-Gemini narrows the performance gap with advanced models like GPT-4 and Gemini, and supports a series of ...