Microsoft Phi-3-Vision Powerful Small Multimodal Model Inference Walkthrough Colab Demo OCR

Опубликовано: 30 Март 2025
на канале: AI WITH Rithesh

386

If you like to support me financially, It is totally optional and voluntary. Buy me a coffee here: https://www.buymeacoffee.com/rithesh

Microsoft Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support.
The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require
1. memory/compute constrained environments;
2. latency bound scenarios;
3. general image understanding;
4. OCR;
5. chart and table understanding.
It takes as input images and text.

https://huggingface.co/microsoft/Phi-...
Colab notebook: https://colab.research.google.com/dri...

If you like such content please subscribe to the channel here:
https://www.youtube.com/c/RitheshSree...