๐Ÿ”— Short Links

July 3, 2024 ยท View on GitHub

Florence-2: Microsoft's Cutting-edge Vision Language Models

๐Ÿ•ธ LinkedIn โ€ข ๐Ÿ“™ Kaggle โ€ข ๐Ÿ’ป Medium Blog โ€ข ๐Ÿค— Hugging Face โ€ข


Open In

๐Ÿ”— Short Links

๐Ÿ“ƒ Model Description

Florence-2, released by Microsoft in June 2024, is an advanced, lightweight foundation vision-language model open-sourced under the MIT license. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks. Despite its small size, it achieves results comparable to those of much larger models, such as Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.

Florence-2 model series

ModelModel sizeModel Description
Florence-2-base [HF]0.23BPretrained model with FLD-5B
Florence-2-large [HF]0.77BPretrained model with FLD-5B
Florence-2-base-ft [HF]0.23BFinetuned model on a colletion of downstream tasks
Florence-2-large-ft [HF]0.77BFinetuned model on a colletion of downstream tasks

Tasks

Florence 2 supports many tasks out of the box:

  • Caption,
  • Detailed Caption,
  • More Detailed Caption,
  • Dense Region Caption,
  • Object Detection,
  • OCR,
  • Caption to Phrase Grounding,
  • segmentation,
  • Region proposal,
  • OCR,
  • OCR with Region.
    You can try out the model via HF Space.

๐Ÿ•ธ Unified Representation

Vision tasks are diverse and vary in terms of spatial hierarchy and semantic granularity. Instance segmentation provides detailed information about object locations within an image but lacks semantic information. On the other hand, image captioning allows for a deeper understanding of the relationships between objects, but without reference to their actual locations.

Open In
Figure 1. Illustration showing the level of spatial hierarchy and semantic granularity expressed by each task. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

The authors of Florence-2 decided that instead of training a series of separate models capable of executing individual tasks, they would unify their representation and train a single model capable of executing over 10 tasks. However, this requires a new dataset.

๐Ÿ’Ž Dataset

Florence-2's strength doesn't stem from its architecture, but from the massive dataset it was pre-trained on. The authors noted that leading computer vision datasets typically contain limited information - WIT only includes image/caption pairs, SA-1B only contains images and associated segmentation masks. Therefore, they decided to build a new FLD-5B dataset containing a wide range of information about each image - boxes, masks, captions, and grounding. The dataset creation process was largely automated. The authors used off-the-shelf task-specific models and a set of heuristics and quality checks to clean the obtained results. The result was a new dataset containing over 5 billion annotations for 126 million images, which was used to pre-train the Florence-2 model.

Open In An illustrative example of an image and its corresponding annotations in the FLD-5B dataset. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

FLD-5B is not yet publicly available, but the authors announced its upcoming release during CVPR 2024.

Open In Summary of size, spatial hierarchy, and semantic granularity of top datasets. Source: Florence-2 CVPR 2024 poster.

๐Ÿงฉ Architecture and Pre-training details

Regardless of the computer vision task being performed, Florence-2 formulates the problem as a sequence-to-sequence task. Florence-2 takes an image and text as inputs, and generates text as output. The model has a simple structure. It uses a DaViT vision encoder to convert images into visual embeddings, and BERT to convert text prompts into text and location embeddings. The resulting embeddings are then processed by a standard encoder-decoder transformer architecture, generating text and location tokens.

Open In Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

For region-specific tasks, location tokens representing quantized coordinates are added to the tokenizer's vocabulary.

  • Box Representation (x0, y0, x1, y1): Location tokens correspond to the box coordinates, specifically the top-left and bottom-right corners.
  • Polygon Representation (x0, y0, ..., xn, yn): Location tokens represent the polygon's vertices in clockwise order.

๐Ÿฆพ Capabilities

Florence-2 is smaller and more accurate than its predecessors. The Florence-2 series consists of two models: Florence-2-base and Florence-2-large, with 0.23 billion and 0.77 billion parameters, respectively. This size allows for deployment on even mobile devices. Despite its small size, Florence-2 achieves better zero-shot results than Kosmos-2 across all benchmarks, even though Kosmos-2 has 1.6 billion parameters.

Examples

๐Ÿ‹๐Ÿพโ€โ™‚๏ธ Finetuning

Even if Florence-2 supports many tasks, maybe your task or domain might not be supported, or you may want to better control the model's output for your task. That's when you will need to fine-tune.

๐Ÿ—‚ Resources

TitleTypeBrief DescriptionLinks
Florence-2 DemoDemoHF SpaceLink
Florence-2 DocVQA DemoDemoHF SpaceLink
Florence-2 Finetuned DemoDemoHF SpaceLink
Florence-2 Inference NotebookNotebookNotebookLink
Florence-2 Finetuning NotebookNotebookNotebookLink
Vision Language Models ExplainedBlog articlearticleLink
Florence-2 Finetuning on DocVQAVideoVideoLink
Florence-2 Finetuning onVideoVidoLink

๐Ÿ”— Citations and References

  • @article{xiao2023florence, title={Florence-2: Advancing a unified representation for a variety of vision tasks}, author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu}, journal={arXiv preprint arXiv:2311.06242}, year={2023} }

  • Piotr Skalski. (Jun 20, 2024). Florence-2: Open Source Vision Foundation Model by Microsoft. Roboflow Blog

  • Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models