Skip to content

Shibli-Nomani/Open-Source-Models-with-Hugging-Face

Repository files navigation

Open Source Models with Hugging Face

πŸ€— Hugging Face Overview: Hugging Face is a leading platform for natural language processing (NLP), offering a vast repository of pre-trained models, datasets, and tools, empowering developers and researchers to build innovative NLP applications with ease.

alt text

Github πŸ‘‡

Kaggle πŸ‘‡

-Image

😸 Jupyter Notebook Shortcuts

https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330

🐍 Python Environment Setup

  • Install vscode

  • Install Python pyenv

  • Install python 3.10.8 using pyenv (Pyevn cheat sheet added below)

  • video link to install pyenv and python

    https://www.youtube.com/watch?v=HTx18uyyHw8
    https://k0nze.dev/posts/install-pyenv-venv-vscode/

ref link: https://github.com/Shibli-Nomani/MLOps-Project-AirTicketPricePrediction

Activate Python Env in VSCODE
   pyenv shell 3.10.8
Create Virtual Env
  • #create directory if requires
mkdir nameoftheproject
  • #install virtualenv
pip install virtualenv
  • #create virtualenv uder the project directory
python -m venv hugging_env
  • #activate virtual env in powershell
   .\hugging_env\Scripts\activate
  • #install dependancy(python libraries)
pip install -r requirements.txt 

πŸ’ Hugging Face Hub and Model Selection Process

''' https://huggingface.co/ ''' alt text

Model Memory Requirement

ref link: https://huggingface.co/openai/whisper-large-v3/tree/main

  • #model weight alt text

Memory Requirement: trained model weights * 1.2 here, 3.09 - model weight = (3.09GB*1.2) = 3.708GB memory requires in your local pc.

Task Perform with Hugging Face

alt text

For Automatic Speech Recognation

Hugging Face >> Tasks >> Scroll Down >> Choose your task >> Automatic Speech Recognation >> openai /whisper-large-v3 [best recommendation as per Hugging Face]

  • Pipeline code snippets to perform complex preprocessing of data and model integration.

alt text

πŸ”€ Defination of NLP

NLP πŸ§ πŸ’¬ (Natural Language Processing) is like teaching computers to understand human language πŸ€–πŸ“. It helps them read, comprehend, extract πŸ“‘πŸ”, translate πŸŒπŸ”€, and even generate text πŸ“šπŸ”.

alt text

πŸ€– Transformers

A Transformer πŸ€–πŸ”„ is a deep learning model designed for sequential tasks, like language processing. It's used in NLP πŸ§ πŸ’¬ for its ability to handle long-range dependencies and capture context effectively, making it crucial for tasks such as machine translation πŸŒπŸ”€, text summarization πŸ“πŸ”, and sentiment analysis πŸ˜ŠπŸ”. It's considered state-of-the-art due to its efficiency in training large-scale models and achieving impressive performance across various language tasks. πŸš€πŸ“ˆ

πŸ”₯ PyTorch

PyTorch is an open-source machine learning framework developed by Facebook's AI Research lab (FAIR), featuring dynamic computational graphs and extensive deep learning support, empowering flexible and efficient model development.

πŸ€— Task-01: Chatbot

github link: kaggle link:

🌐 Libraries Installation

! pip install transformers

πŸ€– The "blenderbot-400M-distill" model, detailed in the paper "Recipes for Building an Open-Domain Chatbot," enhances chatbot performance by emphasizing conversational skills like engagement and empathy. Through large-scale models and appropriate training data, it outperforms existing approaches in multi-turn dialogue, with code and models available for public use.

✨ Find Appropiate LLM Model For Specific Task

πŸ€— Task-02: Text Translation and Summarization

🎭 Text Translation

🌐 Text Translation: Converting text from one language to another, facilitating cross-cultural communication and understanding.

🌐 Libraries Installation

! pip install transformers ! pip install torch

🌐 NLLB-200(No Language Left Behind), the distilled 600M variant, excels in machine translation research, offering single-sentence translations across 200 languages. Detailed in the accompanying paper, it's evaluated using BLEU, spBLEU, and chrF++ metrics, and trained on diverse multilingual data sources with ethical considerations in mind. While primarily for research, its application extends to improving access and education in low-resource language communities. Users should assess domain compatibility and acknowledge limitations regarding input lengths and certification.

alt text

### 🚩 To Clear Memory Allocation

Delete model and clear memory using **Garbage Collector**.

#garbage collector
import gc
#del model
del translator
#reclaiming memory occupied by objects that are no longer in use by the program
gc.collect()

β›½ Text Summarization

πŸ“‘ Text Summarization: Condensing a piece of text while retaining its essential meaning, enabling efficient information retrieval and comprehension.

🌐 Libraries Installation

! pip install transformers ! pip install torch

πŸ€– BART (large-sized model), fine-tuned on CNN Daily Mail, excels in text summarization tasks. It employs a transformer architecture with a bidirectional encoder and an autoregressive decoder, initially introduced in the paper "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" by Lewis et al. This model variant, although lacking a specific model card from the original team, is particularly effective for generating summaries, demonstrated by its fine-tuning on CNN Daily Mail text-summary pairs.

πŸ€— Task-03: Sentence Embedding

  • πŸ” Sentence Embedding Overview: Sentence embedding represents a sentence as a dense vector in a high-dimensional space, capturing its semantic meaning.

alt text

  • πŸ”§ Encoder Function: During embedding, the encoder transforms the sentence into a fixed-length numerical representation by encoding its semantic information into a vector format.

  • πŸ“ Cosine Similarity/Distance: Cosine similarity measures the cosine of the angle between two vectors, indicating their similarity in orientation. It's vital for comparing the semantic similarity between sentences, irrespective of their magnitude.

alt text

  • 🎯 Importance of Cosine Similarity/Distance: Cosine similarity is crucial for tasks like information retrieval, document clustering, and recommendation systems, facilitating accurate assessment of semantic similarity while ignoring differences in magnitude.
🌐 Libraries Installation

! pip install transformers ! pip install sentence-transformers

alt text

πŸ” All-MiniLM-L6-v2 Overview: The All-MiniLM-L6-v2 sentence-transformers model efficiently maps sentences and paragraphs into a 384-dimensional dense vector space, facilitating tasks such as clustering or semantic search with ease.

πŸ€— Task-04: Audio Classification

πŸ“›Zero-Shot:

"Zero-shot" refers to the fact that the model makes predictions without direct training on specific classes. Instead, it utilizes its understanding of general patterns learned during training on diverse data to classify samples it hasn't seen before. Thus, "zero-shot" signifies that the model doesn't require any training data for the specific classes it predicts.

  • πŸ€– transformers: This library provides access to cutting-edge pre-trained models for natural language processing (NLP), enabling tasks like text classification, language generation, and sentiment analysis with ease. It streamlines model implementation and fine-tuning, fostering rapid development of NLP applications.

  • πŸ“Š datasets: Developed by Hugging Face, this library offers a comprehensive collection of datasets for NLP tasks, simplifying data acquisition, preprocessing, and evaluation. It enhances reproducibility and facilitates experimentation by providing access to diverse datasets in various languages and domains.

  • πŸ”Š soundfile: With functionalities for reading and writing audio files in Python, this library enables seamless audio processing for tasks such as speech recognition, sound classification, and acoustic modeling. It empowers users to handle audio data efficiently, facilitating feature extraction and analysis.

  • 🎡 librosa: Specializing in music and sound analysis, this library provides tools for audio feature extraction, spectrogram computation, and pitch estimation. It is widely used in applications like music information retrieval, sound classification, and audio-based machine learning tasks, offering essential functionalities for audio processing projects.

πŸ”Š Ashraq/ESC50 Dataset Overview:

The Ashraq/ESC50 dataset is a collection of 2000 environmental sound recordings, categorized into 50 classes, designed for sound classification tasks. Each audio clip is 5 seconds long and represents various real-world environmental sounds, including animal vocalizations, natural phenomena, and human activities.

🌐 Libraries Installation

! pip install transformers !pip install datasets !pip install soundfile !pip install librosa

alt text

πŸ› οΈ clap-htsat-unfused offers a pipeline for contrastive language-audio pretraining, leveraging large-scale audio-text pairs from LAION-Audio-630K dataset. The model incorporates feature fusion mechanisms and keyword-to-caption augmentation, enabling processing of variable-length audio inputs. Evaluation across text-to-audio retrieval and audio classification tasks showcases its superior performance and availability for public use.

alt text

alt text

πŸ‘€ Human Speech Recording: 16,000 Hz
πŸ“‘ Walkie Talkie/Telephone: 8,000 Hz
πŸ”Š High Resolution Audio: 192,000 Hz

πŸ“Œ key note for Audio Signal Processing:

For 5 sec of video, SIGNAL VALUE is (5 x 8000) = 40,000

In case of transformers, the SIGNAL VALUE relies on πŸ”† Sequences and πŸ”† Attention Mechanism.. SIGNAL VALUE will look like 60 secs for 1 secs.

In the case of transformers, particularly in natural language processing tasks, the SIGNAL VALUE is determined by the length of the πŸ”†input sequences and the πŸ”† attention mechanism employed. Unlike traditional video processing, where each frame corresponds to a fixed time interval, in transformers, the SIGNAL VALUE may appear to be elongated due to the attention mechanism considering sequences of tokens. For example, if the attention mechanism processes 60 tokens per second, the SIGNAL VALUE for 1 second of input may appear equivalent to 60 seconds in terms of processing complexity.

In natural language processing, the input sequence refers to a series of tokens representing words or characters in a text. The attention mechanism in transformers helps the model focus on relevant parts of the input sequence during processing by assigning weights to each token, allowing the model to prioritize important information. Think of it like giving more attention to key words in a sentence while understanding its context, aiding in tasks like translation and summarization.

πŸ€— Task-05: Automatic Speech Recognation(ASR)

🌐 Libraries Installation

! pip install transformers !pip install datasets !pip install soundfile !pip install librosa !pip install gradio

  • !pip install transformers: Access state-of-the-art natural language processing models and tools. πŸ€–
  • !pip install datasets: Simplify data acquisition and preprocessing for natural language processing tasks. πŸ“Š
  • !pip install soundfile: Handle audio data reading and writing tasks efficiently. πŸ”Š
  • !pip install librosa: Perform advanced audio processing and analysis tasks. 🎡
  • !pip install gradio: Develop interactive web-based user interfaces for machine learning models. 🌐

Librosa is a Python library designed for audio and music signal processing. It provides functionalities for tasks such as audio loading, feature extraction, spectrogram computation, pitch estimation, and more. Librosa is commonly used in applications such as music information retrieval, sound classification, speech recognition, and audio-based machine learning tasks.

πŸŽ™οΈ LibriSpeech ASR: A widely-used dataset for automatic speech recognition (ASR), containing a large collection of English speech recordings derived from audiobooks. With over 1,000 hours of labeled speech data, it facilitates training and evaluation of ASR models for transcription tasks.

πŸ‘‰ dataset: https://huggingface.co/datasets/librispeech_asr

πŸ‘‰ model: https://huggingface.co/distil-whisper

πŸ‘‰ model: https://github.com/huggingface/distil-whisper

πŸ” Distil-Whisper:

Distil-Whisper, a distilled variant of Whisper, boasts 6 times faster speed, 49% smaller size, and maintains a word error rate (WER) within 1% on out-of-distribution evaluation sets. With options ranging from distil-small.en to distil-large-v2, it caters to diverse latency and resource constraints. πŸ“ˆπŸ”‰

  • Virtual Assistants
  • Voice-Controlled Devices
  • Dictation Software
  • Mobile Devices
  • Edge Computing Platforms
  • Online Transcription Services

✨ Gradio:

πŸ› οΈπŸš€ Build & Share Delightful Machine Learning Apps

Gradio offers the fastest way to showcase your machine learning model, providing a user-friendly web interface that enables anyone to utilize it from any location!

πŸ‘‰ Gradio Website: https://www.gradio.app/

πŸ‘‰ Gradio In Hugging Face: https://huggingface.co/gradio

πŸ‘‰ Gradio Github: https://github.com/gradio-app/gradio

πŸŒπŸ› οΈ Gradio: Develop Machine Learning Web Apps with Ease

Gradio, an open-source Python package, enables swift creation of demos or web apps for your ML models, APIs, or any Python function. Share your creations instantly using built-in sharing features, requiring no JavaScript, CSS, or web hosting expertise.

alt text

alt text

πŸ“Œ error: DuplicateBlockError: At least one block in this Blocks has already been rendered.

πŸ’‰ solution: change the block name that we have declared earlier.

demonstrations = gr.Blocks()

  • 🚦 note: The app will continue running unless you run demo.close()

πŸ€— Task-06: Text to Speech

Libraries Installation

!pip install transformers !pip install gradio !pip install timm !pip install timm !pip install inflect !pip install phonemizer

  • !pip install transformers: Installs the Transformers library, which provides state-of-the-art natural language processing models for various tasks such as text classification, translation, summarization, and question answering.

  • !pip install gradio: Installs Gradio, a Python library that simplifies the creation of interactive web-based user interfaces for machine learning models, allowing users to interact with models via a web browser.

  • !pip install timm: Installs Timm, a PyTorch library that offers a collection of pre-trained models and a simple interface to use them, primarily focused on computer vision tasks such as image classification and object detection.

  • !pip install inflect: Installs Inflect, a Python library used for converting numbers to words, pluralizing and singularizing nouns, and generating ordinals and cardinals.

  • !pip install phonemizer: Installs Phonemizer, a Python library for converting text into phonetic transcriptions, useful for tasks such as text-to-speech synthesis and linguistic analysis.

πŸ“ŒNote: py-espeak-ng is only available Linux operating systems.

To run locally in a Linux machine, follow these commands:

  sudo apt-get update
  sudo apt-get install espeak-ng
  pip install py-espeak-ng

πŸ“• APT stands for Advanced Package Tool. It is a package management system used by various Linux distributions, including Debian and Ubuntu. APT allows users to install, update, and remove software packages on their system from repositories. It also resolves dependencies automatically, ensuring that all required dependencies for a package are installed.

  • sudo apt-get update: Updates the package index of APT.
  • sudo apt-get install espeak-ng: Installs the espeak-ng text-to-speech synthesizer.
  • pip install py-espeak-ng: Installs the Python interface for espeak-ng.

πŸ‘‰ model: https://github.com/huggingface/distil-whisper

πŸ” kakao-enterprise/vits-ljs:

πŸ”ŠπŸ“š VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

  • Overview:

VITS is an end-to-end model for speech synthesis, utilizing a conditional variational autoencoder (VAE) architecture. It predicts speech waveforms based on input text sequences, incorporating a flow-based module and a stochastic duration predictor to handle variations in speech rhythm.

  • Features:

πŸ”Ή The model generates spectrogram-based acoustic features using a Transformer-based text encoder and coupling layers, allowing it to capture complex linguistic patterns.

πŸ”Ή It includes a stochastic duration predictor, enabling it to synthesize speech with diverse rhythms from the same input text.

  • Training and Inference:

πŸ”Ή VITS is trained with a combination of variational lower bound and adversarial training losses.

πŸ”Ή Normalizing flows are applied to enhance model expressiveness.

πŸ”Ή During inference, text encodings are up-sampled based on duration predictions and mapped into waveforms using a flow module and HiFi-GAN decoder.

  • Variants and Datasets:

πŸ”Ή Two variants of VITS are trained on LJ Speech and VCTK datasets.

πŸ”Ή LJ Speech comprises 13,100 short audio clips (approx. 24 hours), while VCTK includes approximately 44,000 short audio clips from 109 native English speakers (approx. 44 hours).

πŸ‘‰ model: https://huggingface.co/kakao-enterprise/vits-ljs

alt text

πŸ“πŸ”Š Text to Audio Wave

Text-to-audio waveform array for speech generation is the process of converting textual input into a digital audio waveform representation. This involves synthesizing speech from text, where a machine learning model translates written words into spoken language. The model analyzes the text, generates corresponding speech signals, and outputs an audio waveform array that can be played back as human-like speech. The benefits include enabling natural language processing applications such as virtual assistants, audiobook narration, and automated customer service, enhancing accessibility for visually impaired individuals, and facilitating audio content creation in various industries.

{'audio': array([[ 0.00112925, 0.00134222, 0.00107496, ..., -0.00083117, -0.00077596, -0.00064528]], dtype=float32), 'sampling_rate': 22050}

πŸ“Œnote: This dictionary contains an audio waveform represented as a NumPy array, along with its corresponding sampling rate. 🎡 The audio array consists of amplitude values sampled at a rate of 22,050 Hz.

{'audio': array([[ 0.00112925, 0.00134222, 0.00107496, ..., -0.00083117, -0.00077596, -0.00064528]], dtype=float32), 'sampling_rate': 22050}

πŸ“Œnote: This dictionary contains an audio waveform represented as a NumPy array, along with its corresponding sampling rate. 🎡 The audio array consists of amplitude values sampled at a rate of 22,050 Hz.

πŸ€— Task-07: Object Detection

πŸ™€ Image to Audio Generation
πŸ” What is Object Detection

πŸ“· Object detection is a computer vision task that involves identifying and locating objects within an image or video. The goal is to not only recognize what objects are present but also to precisely locate them by drawing bounding boxes around them.

It's crucial for automating tasks like surveillance, autonomous driving, and quality control, enhancing safety, efficiency, and user experiences across various industries.

alt text

###πŸ””To Find Out the State of Art Models for Object Detection in Hagging Face

πŸ‘‰ Haggig Face Models: https://huggingface.co/models?sort=trending

πŸ‘‰ Haggig Face SoTA Models for Object Detection: https://huggingface.co/models?pipeline_tag=object-detection&sort=trending

πŸ‘‰ Model: https://huggingface.co/facebook/detr-resnet-50

alt text

🎊 facebook/detr-resnet-50

DETR (End-to-End Object Detection) model with ResNet-50 backbone:

DETR (DEtection TRansformer) model, trained on COCO 2017 dataset, is an end-to-end object detection model with ResNet-50 backbone. Utilizing encoder-decoder transformer architecture, it employs object queries for detection and bipartite matching loss for optimization, achieving accurate object localization and classification.

πŸ“¦ COCO Dataset:

The COCO (Common Objects in Context) 2017 dataset πŸ“· is a widely used benchmark dataset for object detection, segmentation, and captioning tasks in computer vision. It consists of a large collection of images with complex scenes containing multiple objects in various contexts. The dataset is annotated with bounding boxes, segmentation masks, and captions for each object instance, providing rich and diverse training data for developing and evaluating object detection algorithms.

alt text

✨ Gradio Apps for Object Detection:

πŸ› οΈπŸš€ Build & Share Delightful Machine Learning Apps For Image Genartion

Gradio offers the fastest way to showcase your machine learning model, providing a user-friendly web interface that enables anyone to utilize it from any location!

πŸ‘‰ Gradio Website: https://www.gradio.app/

πŸ‘‰ Gradio In Hugging Face: https://huggingface.co/gradio

πŸ‘‰ Gradio Github: https://github.com/gradio-app/gradio

alt text

✨ Make An AI powered Audio Assistant

  • by importing summarize_predictions_natural_language for pipeline text generated by object detection model

✨ Generate Audio Narration Of An Image

  • using kakao-enterprise/vits-ljs, generate text to audio

πŸ” kakao-enterprise/vits-ljs:

πŸ”ŠπŸ“š VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

πŸ‘‰ model: https://huggingface.co/kakao-enterprise/vits-ljs

  • Overview:

VITS is an end-to-end model for speech synthesis, utilizing a conditional variational autoencoder (VAE) architecture. It predicts speech waveforms based on input text sequences, incorporating a flow-based module and a stochastic duration predictor to handle variations in speech rhythm.

alt text

πŸ€— Task-08: Image Segmentation

Image segmentation involves dividing an image into multiple segments, each representing a distinct object or region. This process is crucial for various applications, as it simplifies image analysis and enhances understanding.

  • πŸ₯ Medical Imaging: Used to identify tumors or anomalies in MRI or CT scans, aiding diagnosis and treatment planning.

  • πŸš— Autonomous Vehicles: Enables object detection and obstacle avoidance, crucial for safe navigation on roads.

  • 🌍 Satellite Imagery: Facilitates land cover classification, assisting in urban planning, agriculture, and environmental monitoring.

  • In each of these examples, image segmentation plays a vital role in extracting meaningful information from complex visual data, contributing to advancements in healthcare, transportation, and environmental science.

Libraries Installation

!pip install transformers !pip install gradio !pip install timm !pip install torchvision

torch

  • !pip install transformers: Installs the Transformers library, which provides state-of-the-art natural language processing models for various tasks such as text classification, translation, summarization, and question answering.

  • !pip install gradio: Installs Gradio, a Python library that simplifies the creation of interactive web-based user interfaces for machine learning models, allowing users to interact with models via a web browser.

  • !pip install timm: Installs Timm, a PyTorch library that offers a collection of pre-trained models and a simple interface to use them, primarily focused on computer vision tasks such as image classification and object detection.

  • !pip install torchvision: is used to install the torchvision library, facilitating computer vision tasks in Python environments.

πŸ‘‰ model: https://huggingface.co/Zigeng/SlimSAM-uniform-77

πŸ” Segmentation Anything Model (SAM) is a versatile deep learning architecture πŸ§ πŸ–ΌοΈ designed for pixel-wise segmentation tasks, capable of accurately delineating objects within images for various applications such as object detection, medical imaging, and autonomous driving.

alt text

πŸ” SlimSAM:

SlimSAM, a novel SAM compression method, efficiently reuses pre-trained SAMs through a unified pruning-distillation framework and employs masking for selective parameter retention. By integrating an innovative alternate slimming strategy and a label-free pruning criterion, SlimSAM reduces parameter counts to 0.9%, MACs to 0.8%, and requires only 0.1% of the training data compared to the original SAM-H. Extensive experiments demonstrate superior performance with over 10 times less training data usage compared to other SAM compression methods.

alt text

alt text

Masking in SlimSAM selectively retains crucial parameters, enabling efficient compression of pre-trained SAMs without sacrificing performance, by focusing on essential features and discarding redundancies.

🎨 Segmentation Mask Generation

Segmentation mask generation involves creating pixel-wise masks that delineate different objects or regions within an image. For example, in a photo of various fruits 🍎🍌, segmentation masks would outline each fruit separately, aiding in their identification and analysis.

πŸ’‘ Key Notes:

points_per_batch = 32 in image processing denotes the number of pixel points considered in each batch during model training or inference πŸ–ΌοΈ, aiding in efficient computation of gradients and optimization algorithms, thereby enhancing training speed and resource utilization.

πŸ“Œ note: for smaller size in points_per_batch is lesser accuracy but less computationally expensive.

Raw image

alt text

Image after Segmetation:

alt text

πŸ” Use SlimSAM(Segment Anything Model) without Pipeline
  • 🧠 The model variable initializes a SlimSAM model instance loaded from pre-trained weights πŸ§ πŸ”— located at "./models/Zigeng/SlimSAM-uniform-77", enabling tasks like inference or fine-tuning.

  • πŸ”— The processor variable initializes a SamProcessor instance loaded with pre-trained settings πŸ› οΈπŸ”— located at "./models/Zigeng/SlimSAM-uniform-77", facilitating data preprocessing for compatibility with the SlimSAM model during inference or fine-tuning processes.

  • πŸ› οΈ Pretrained settings encompass pre-defined configurations or parameters obtained from training a model πŸ§ πŸ”—, facilitating effective performance in related tasks with minimal fine-tuning or adjustment.

Import Libraries

import torch

πŸ“Œ note:

no_grad runs the model inference without tracking operations for gradient computation, thereby conserving memory resources and speeding up the inference process.

with torch.no_grad(): 
  outputs = model(**inputs)

Gradient computation πŸ“ˆ refers to calculating the derivatives of a loss function with respect to the model parameters, crucial for updating weights during training. These gradients indicate the direction and magnitude of parameter updates needed to minimize the loss during training through optimization algorithms like gradient descent.

πŸ€— DPT

DPT (Dense Pretrained Transformer) enhances dense prediction tasks using Vision Transformer (ViT) as its backbone. It provides finer-grained predictions compared to fully-convolutional networks, yielding substantial improvements in performance, especially with large training data. DPT achieves state-of-the-art results in tasks like monocular depth estimation and semantic segmentation on datasets like ADE20K, NYUv2, KITTI, and Pascal Context.

πŸ‘‰ model: https://huggingface.co/docs/transformers/model_doc/dpt

πŸ‘‰ model in Github: https://github.com/isl-org/DPT

πŸ‘‰ research paper of model : https://arxiv.org/abs/2103.13413

alt text

Intel/dpt-hybrid-midas:

DPT-Hybrid, also known as MiDaS 3.0, is a monocular depth estimation model based on the Dense Prediction Transformer (DPT) architecture, utilizing a Vision Transformer (ViT) backbone with additional components for enhanced performance. Trained on 1.4 million images, it offers accurate depth predictions for various applications such as autonomous navigation, augmented reality, and robotics, providing crucial depth perception for tasks like obstacle avoidance, scene understanding, and 3D reconstruction.

πŸ‘‰ model: https://huggingface.co/Intel/dpt-hybrid-midas

alt text

alt text

πŸ€— Task-09: Image to Text Retrieval

  • πŸŒπŸ“ΈπŸ”Š Multimodal

Multimodal models πŸŒπŸ“ΈπŸ”Š are machine learning architectures designed to process and integrate information from multiple modalities, such as text, images, audio, and other data types, into a cohesive representation. These models utilize various techniques like fusion mechanisms, attention mechanisms, and cross-modal learning to capture rich interactions between different modalities, enabling them to perform tasks like image captioning, video understanding, and more, by leveraging the complementary information present across different modalities.

  • Fusion mechanisms πŸ”„: Techniques to combine information from different modalities, like averaging features from text and images to make a unified representation.

  • Attention mechanisms πŸ‘€: Mechanisms that focus on relevant parts of each modality's input, like attending to specific words in a sentence and regions in an image.

  • Cross-modal learning πŸ§ πŸ’‘: Learning strategies where information from one modality helps improve understanding in another, like using audio features to enhance image recognition accuracy.

🚡 Application: ChatGPT --> SEE, HEAR AND SPEAK

alt text

Bootstrapping Language-Image Pre-trainingπŸŒπŸ“ΈπŸ“

BLIP Model: Proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.

Tasks: BLIP excels in various multi-modal tasks such as:

  • Visual Question Answering πŸ€”βž‘οΈπŸ“Έ

  • Image-Text Retrieval (Image-text matching) πŸ”ŽπŸ–ΌοΈπŸ“

  • Image Captioning πŸ–ΌοΈπŸ“

  • Abstract: BLIP is a versatile VLP framework adept at both understanding and generation tasks. It effectively utilizes noisy web data through bootstrapping captions, resulting in state-of-the-art performance across vision-language tasks.

πŸ‘‰ model: https://huggingface.co/docs/transformers/model_doc/blip

πŸ‘‰ model: https://huggingface.co/Salesforce/blip-itm-base-coco

alt text

πŸŒπŸ’Ό About Salesforce AI

Salesforce AI Research is dedicated to pioneering AI advancements to revolutionize our company, customers, and global communities πŸš€. Their innovative products harness AI to enhance customer relationship management, optimize sales processes, and drive business intelligence, empowering organizations to thrive in the digital era πŸŒπŸ’Ό.

πŸ‘‰ model: https://huggingface.co/Salesforce/blip-itm-base-coco

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Model card for BLIP trained on image-text matching - base architecture (with ViT base backbone) trained on COCO dataset.

alt text

Libraries Installation

!pip install transformers

!pip install torch

  • AutoProcessor πŸ“ŠπŸ€– is a comprehensive tool developed by Salesforce AI Research to automate and streamline data processing tasks πŸ› οΈ. It efficiently handles data extraction, transformation, and loading processes, accelerating data-driven decision-making and improving operational efficiency across various domains such as sales, marketing, and customer service.

πŸ› οΈ pt stands for pytorch

inputs = processor(images = raw_images,
                   text = text,
                   return_tensors = "pt"
        )

🚴 Code Snippits:

itm_scores = model(**inputs)[0]
  • model(**inputs): This calls the model with the provided inputs. The **inputs syntax in Python unpacks the dictionary inputs and passes its contents as keyword arguments to the model function.

  • [0]: This accesses the first element of the output returned by the model. The output is likely a tuple or a list containing various elements, and [0] retrieves the first element.

  • itm_scores: This assigns the result obtained from step 2 to the variable itm_scores, which likely contains the predicted scores for different classes.

πŸ“Œ note: To open a raw image

  • images.jpg (image name with directory)
raw_image = Image.open("images.jpg")
raw_image

πŸ€— Task-10: Image Captioning

πŸ“ΈπŸ–‹οΈ Image Captioning: Generating descriptive textual descriptions for images, enhancing accessibility and understanding of visual content.

Real-life Use: Image captioning is employed in social media platforms like Instagram to provide accessibility for visually impaired users, in content management systems for organizing and indexing images, and in educational settings for creating inclusive learning materials.

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  • πŸš€ BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Utilizes noisy web data by bootstrapping captions, achieving state-of-the-art results in tasks like πŸ“·πŸ“ image-text retrieval and image captioning. Accessible for both conditional and unconditional image captioning.

πŸ‘‰ model: https://huggingface.co/Salesforce/blip-image-captioning-base

alt text

  • Salesforce AI Research is dedicated to pioneering AI advancements to revolutionize our company, customers, and global communities πŸš€. Their innovative products harness AI to enhance customer relationship management, optimize sales processes, and drive business intelligence, empowering organizations to thrive in the digital era πŸŒπŸ’Ό.

  • πŸ€– AutoProcessor

πŸ€– from transformers import AutoProcessor: Imports the AutoProcessor module from Transformers, allowing automatic loading of data processors for NLP tasks with ease. They group processing objects for text, vision, and audio modalities, providing flexibility and ease of use for various NLP tasks.

alt text

🎯 Image:

alt text

🎯 Text Captioning after Decoding->

Output: two kittens in a basket with flowers

#decoding text
print(processor.decode(outputs[0], skip_special_tokens = True))

alt text

πŸ€— Task-11: Visual Questions and Answerings

Visual question answering (VQA)πŸ‘οΈπŸ’¬ is a task in artificial intelligence where systems are designed to answer questions about images. It involves combining computer vision and natural language processing techniques.

  • Icons:

    • πŸ‘οΈ for computer vision
    • πŸ’¬ for natural language processing
  • Use Cases:

    • E-commerce: Users can ask about product details from images.
    • Healthcare: Clinicians can inquire about abnormalities in medical images for diagnosis.
  • Self-attention: Useful in tasks requiring capturing global dependencies within a sequence, such as sentiment analysis or document classification.

  • Cross-attention: Beneficial in tasks where relationships between elements from two different sequences are crucial, like in machine translation or summarization.

  • Bi-directional self-attention:** Enhances contextual understanding in tasks like language understanding or sequence generation.

  • Causal self-attention: Ensures that the model generates outputs sequentially and is beneficial in tasks like autoregressive generation, where the order of elements matters.

😬 Distinguishing Factors:
  • Self-attention attends within a single sequence.
  • Cross-attention attends between two different sequences.
  • Bi-directional self-attention attends bidirectionally within a single sequence.
  • Causal self-attention attends in a unidirectional, causal manner within a single sequence.
Self-attention: πŸ”„

Example: In a VQA model, each region of an image attends to other regions and words in the question to understand the visual context and answer the question accurately. For instance, if the question is "What color is the car?" and the image contains multiple cars, self-attention helps the model focus on the relevant car region while considering the question.

Cross-attention: β†”οΈπŸ”

Example: In VQA, the question embeddings attend to different regions of the image, enabling the model to focus on relevant visual information corresponding to the words in the question. For example, if the question is "How many people are there?" the model's attention mechanism would focus on regions representing people in the image.

Bi-directional self-attention: β†”οΈπŸ”„

Example: In VQA, each region of an image attends to all other regions as well as words in the question, allowing the model to capture both intra-modal (visual) and inter-modal (visual-textual) relationships effectively. For instance, when asked "What is the person holding?" the model can attend to both the person's hand region and the corresponding words describing the action in the question.

Causal self-attention: πŸ”„πŸ”

Example: In VQA tasks with sequential processing, each word in the question attends only to preceding words and relevant regions of the image, ensuring that the model generates answers sequentially while maintaining the context of the question and the image. For instance, if the question is "What is the color of the sky?" the model would attend to preceding words and relevant sky regions to generate the answer progressively.

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP (Bootstrapping Language-Image Pre-training) is a novel framework for Vision-Language Pre-training (VLP), enhancing performance across understanding and generation tasks. By intelligently utilizing noisy web data through bootstrapping captionsβ€”where a captioner generates synthetic captions filtered for noiseβ€”BLIP achieves state-of-the-art results in various tasks like image-text retrieval, image captioning, and VQA. It also exhibits strong generalization when applied to video-language tasks in a zero-shot manner. πŸš€πŸ–ΌοΈπŸ“

πŸ‘‰ model: https://huggingface.co/Salesforce/blip-vqa-base

image

🎯 Image:

image

🎯 Ask Question:

questions= "How many dogs are in picture?"

🎯 Output Result:

dog and woman

πŸ€— Task-12: Zero Shot Image Classification

πŸ˜‰ Zero Shot Image Classification

In a zero-shot image classification scenario with dogs and cats, imagine training a model on a dataset containing images of various dog breeds and cat breeds. During training, the model learns to associate visual features of these animals with their respective class labels (e.g., "Labrador Retriever," "Siamese Cat").

Now, suppose the model encounters images of new dog and cat breeds during inference that it has never seen before, such as "Pomeranian" or "Maine Coon." In a zero-shot setting, the model can still classify these images accurately without specific training examples of these breeds.

This is achieved by providing the model with additional information about the classes, such as textual descriptions or class attributes. For example, the model might learn that Pomeranians are small-sized dogs with fluffy coats, while Maine Coon cats are large-sized cats with tufted ears and bushy tails.

🎯note: Zero-shot learning relies on the model's ability to generalize from known classes to unseen classes based on semantic embeddings or attributes associated with those classes.

image

🌟 CLIP (Constructive Language-Image Pre-Training)

CLIP, developed by OpenAI, is a model that learns from both images and text to understand the world. It's trained to recognize patterns in pictures and words together, helping it to classify images even when it hasn't seen them before. However, it's not meant for all kinds of tasks and needs to be carefully studied before using it in different situations. It can be helpful in tasks like image search, where you describe what you're looking for in words, and the model finds matching images.

πŸ‘‰ model: https://huggingface.co/openai/clip-vit-large-patch14

image

🎯 Image:

image

🎯 Labels:

labels = ["A Photo of Cat", "A Photo of Dog"]

🎯 Outputs:

outputs.logits_per_image
probs = outputs.logits_per_image.softmax(dim=1)[0]

print(f"probs : {probs}")
probs = list(probs)
for i in range(len(labels)):
  print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")

image