Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Idefics 2 #309

Open
wants to merge 73 commits into
base: master
Choose a base branch
from
Open

Add support for Idefics 2 #309

wants to merge 73 commits into from

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented May 15, 2024

This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)!

Implementation TODOs:

  • VisionTransformer
    • Encoder
      • Attention
      • MLP
    • VisionEmbedding
  • Connector
    • MLP
    • PerceiverLayer
  • Model
    • Forward pass
      • Remove padding images
      • Generate the patch/pixel attention mask
      • Run vision submodel and connector submodel
        • Allow Mistral to run trained embedding head on any input tokens
        • Inputs merger to inject embeddings correctly
      • Pass input to Mistral model
        • Allow Mistral to take an embeddings vector instead of using the trained embedding head.
  • Image processor analogous to Idefics2ImageProcessor
    • Resizing
    • Rescaling
    • Normalization
    • Padding
      • Generate pixel attention mask for padded images
        • Pass and use in input injection
    • Create pixel values tensors
  • Vision Model Pipeline
    • Add a VisionModel trait similar to NormalModel
    • Add a ModelCategory: vision, text, embedding etc
    • Handle sequence scheduling with image dimensions
    • Abstract input preparation logic
      • Handle padding to same, resized shape, across batch dimension
  • Add proper handling of chat templates
    • Load preprocessor/processor config JSON files
    • Support configuration of inputs processor via preprocessor
  • Messages API generalization
    • Support OpenAI compatible method of specifying images
    • Update messages to optionally encode type (akin to examples here).
    • Use processor config to abstract the chat template application process
  • HTTP API
    • Handle decoding from base64
    • Support loading from HTTP.
  • Rust API
  • Python API

Other TODOs:

  • Introduce model type enum to reject mixing of text/multimodal models in speculative decoding
    • Perhaps introduce VisionModel akin to NormalModel.
  • Ergonomic API support (OpenAI compatible on the HTTP side, but hopefully nicer on the Rust/Python side)
  • Support device mapping
  • Support ISQ

Pending issues:

@EricLBuehler EricLBuehler added new feature New feature or request models Additions to model or architectures labels May 15, 2024
Copy link

github-actions bot commented May 15, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 21          741          622           21           98
 TOML                   15          393          355            1           37
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1054            0          781          273
 |- BASH                 6          203          190            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1563          472          790          301
-------------------------------------------------------------------------------
 Rust                   86        28468        26049          381         2038
 |- Markdown            42          440            0          428           12
 (Total)                          28908        26049          809         2050
===============================================================================
 Total                 151        31153        27441         1184         2528
===============================================================================
  

@EricLBuehler EricLBuehler marked this pull request as ready for review May 24, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant