Kubefeeds Team A dedicated and highly skilled team at Kubefeeds, driven by a passion for Kubernetes and Cloud-Native technologies, delivering innovative solutions with expertise and enthusiasm.

5 Multimodal AI Models That Are Actually Open Source

2 min read

open

Multimodal AI is attracting a lot of attention, thanks to the tantalizing promise of AI systems that are designed to be jacks of all trades — capable of processing a combination of text, image, audio, and video.

But while there is already a constellation of powerful, proprietary multimodal AI systems on the market, smaller multimodal AI models and open source alternatives are also rapidly gaining ground, as users continue to seek out options that are more accessible and adaptable, and prioritize transparency and collaboration. To get you up to speed on the latest open source multimodal AI systems, we’ll outline some of the more popular options — including their features and uses.

1. Aria

The recently introduced Aria AI model from Rhymes AI is touted as the world’s first open source, multimodal native mixture-of-experts (MoE) model that can process text, code, images, and video — all within one architecture.

This versatile model is relatively powerful compared to even larger models, yet is more efficient, as it selectively leverages relevant subsets (or “mini-experts”) of its framework, depending on the task. Its architecture is designed for ease of scalability, as new “experts” could be added to address new tasks without straining the system. Aria excels at long multimodal input understanding, meaning that it is adept at quickly and accurately parsing long documents and videos.

Aria’s architecture.

2. Leopard

Developed by an interdisciplinary team of researchers from  University of Notre Dame, Tencent AI Seattle Lab, and the University of Illinois Urbana-Champaign (UIUC), Leopard is an open source multimodal model that is specifically designed for text-rich image tasks.

Leopard is intended to tackle two of the biggest challenges in the multimodal AI space, namely the scarcity of high-quality multi-image datasets, and balancing image resolution with sequence length. To achieve this, the model is trained with a curated dataset featuring over 1 million high-quality, human-made and synthetic data pieces that have been collected from real-world examples. It is also openly available for use in other models.

“Leopard stands out with its novel adaptive high-resolution encoding module, which dynamically optimizes the allocation of visual sequence lengths based on the original aspect ratios and resolutions of the input images,” Wenhao Yu, a senior researcher at Tencent America and one of the creators of Leopard, explained to The New Stack. “Additionally, it uses pixel shuffling to losslessly compress long visual feature sequences into shorter ones. This design enables the model to handle multiple high-resolution images without sacrificing detail or clarity.”

These features make Leopard an excellent tool for multi-page document understanding (think slide decks, scientific and financial reports), data visualization, webpage comprehension, and in deploying multimodal AI agents capable fo handling tasks in visually complex environments.

Leopard’s overall model pipeline.

3. CogVLM

Utilizing deep fusion techniques to attain high performance, CogVLM stands for Cognitive Visual Language Model, an open source, state-of-the-art visual language foundational model that can be used for visual question answering (VQA) and image captioning.

CogVLM uses an attention-based fusion mechanism that fuses text and image embeddings, and freezes network layers to keep performance high. It also employs a EVA2-CLIP-E visual encoder and a multi-layer perceptron (MLP) adapter for co-mapping visual and text features onto the same space.

4. LLaVA

Large Language and Vision Assistant (LLaVA) is another open source, state-of-the-art option. It leverages Vicuna to decode language, and CLIP for fine-tuning on instruction-following textual data. The model has been trained using instruction-following text-based data generated by ChatGPT and GPT-4. LLaVA uses a trainable projection matrix to map visual representations onto the language embedding space.

As a versatile visual assistant, LLaVA can be used to create more advanced chatbots that can handle text- and image-based queries.

5. xGen-MM

Also known as BLIP-3, this state-of-the-art, open source suite of multimodal models from Salesforce features a line of variants, including a base pretrained model, an instruction-tuned model, and a safety-tuned model that is intended to reduce harmful outputs.

One crucial development is that the systems were trained using a massive, open source trillion-token dataset of “interleaved” image and text data, which the researchers characterize as the “the most natural form of multimodal data”. That means the models are skilled at handling inputs with text and multiple images, which could be useful in a wide range of settings — such as autonomous vehicles, or image analysis and diagnosing diseases in healthcare, or creating interactive educational tools, or promotional marketing materials.

Conclusion

There is still an ongoing, vigorous debate surrounding the actual definition of open source AI, peppered with accusations of large tech companies “open washing” their AI models in order to gain wider credibility and cachet.

Regardless of how the open source AI debate unfolds, it’s clear that there’s still a further need for truly open source systems — and datasets — that emphasize transparency, collaboration and accessibility and that actually live up to the open source ethos.

The post 5 Multimodal AI Models That Are Actually Open Source appeared first on The New Stack.

Kubefeeds Team A dedicated and highly skilled team at Kubefeeds, driven by a passion for Kubernetes and Cloud-Native technologies, delivering innovative solutions with expertise and enthusiasm.