Skip to content

Latest commit

 

History

History
27 lines (19 loc) · 1.18 KB

README.md

File metadata and controls

27 lines (19 loc) · 1.18 KB

NeMo Multimodal Collections

The NeMo Multimodal Collection supports a diverse range of multimodal models tailored for various tasks, including text-2-image generation, text-2-NeRF synthesis, multimodal language models (LLM), and foundational vision and language models. Leveraging existing modules from other NeMo collections such as LLM and Vision whenever feasible, our multimodal collections prioritize efficiency by avoiding redundant implementations and maximizing reuse of NeMo's existing modules. Here's a detailed list of the models currently supported within the multimodal collection:

  • Foundation Vision-Language Models:

    • CLIP
  • Foundation Text-to-Image Generation:

    • Stable Diffusion
    • Imagen
  • Customizable Text-to-Image Models:

    • SD-LoRA
    • SD-ControlNet
    • SD-Instruct pix2pix
  • Multimodal Language Models:

    • NeVA
    • LLAVA
  • Text-to-NeRF Synthesis:

    • DreamFusion++
  • NSFW Detection Support

Our documentation offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects.