GitHub - unum-cloud/uform: Multi-Modal AI inference library for Multi-Lingual Text, Image, and Video Search, Recommendations, and other Vision-Language tasks, up to 5x faster than OpenAI CLIP 🖼️ & 🖋️

UForm is a multi-modal inference library designed to encode multi-lingual texts, images, and soon, audio, video, and documents into a shared vector space. It comes with pre-trained networks and is available on HuggingFace. The library supports three types of multi-modal encoding: late-fusion models, early-fusion models, and mid-fusion models. The late-fusion models encode each modality independently, making them suitable for retrieval in extensive collections. Early-fusion models encode both modalities jointly, making them ideal for re-ranking relatively small retrieval results. Mid-fusion models are a combination of the two, allowing for encoding each modality separately and enhancing them with a cross-attention mechanism.

UForm provides a range of models with different architectures and languages. The multilingual models were trained on a language-balanced dataset. The library also provides additional tools to calculate semantic compatibility between an image and a text, such as Cosine Similarity and Matching Score. Cosine Similarity is computationally cheap and suitable for retrieval in large collections, while Matching Score captures fine-grained features and is suitable for re-ranking.

Key takeaways

UForm is a Multi-Modal Modal inference library designed to encode Multi-Lingual Texts, Images, and soon, Audio, Video, and Documents, into a shared vector space.
It offers three types of multi-modal encoding: late-fusion models, early-fusion models, and mid-fusion models, each with different capabilities and use cases.
The UForm library is efficient and can be run on various platforms, from large servers to mobile phones, and is available on HuggingFace.
It also provides tools to calculate semantic compatibility between an image and a text, namely Cosine Similarity and Matching Score.

GitHub - unum-cloud/uform: Multi-Modal AI inference library for Multi-Lingual Text, Image, and Video Search, Recommendations, and other Vision-Language tasks, up to 5x faster than OpenAI CLIP 🖼️ & 🖋️

Key takeaways

Discussion (0)