Huggingface agent #2599

whiskyboy · 2024-05-05T16:24:30Z

Why are these changes needed?

Introducing a new agent named HuggingFaceAgent which can connect to models in HuggingFace Hub to achieve several multimodal capabilities.

This agent essentially consists of a pairing between an assistant and a user-proxy agent, both are registered with the huggingface-hub models capabilities. Users could seamlessly access this agent to leverage its multimodal capabilities, without the need for manual registration of toolkits for execution.

Some key changes:

added HuggingFaceClient class in autogen/agentchat/contrib/huggingface_utils.py: this class simplifies calling HuggingFace models locally or remotely.
added HuggingFaceAgent class in autogen/agentchat/contrib/huggingface_agent.py: this agent utilizes HuggingFaceClient to achieve multimodal capabilities.
added HuggingFaceImageGenerator class in autogen/agentchat/contrib/capabilities/generate_images.py: this class enables text-based LLMs to generate images using HuggingFaceClient.
added notebook samples to demostrate how these new classes work
fixed some bugs

Related issue number

The second approach mentioned in #2577

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

codecov-commenter · 2024-05-05T16:25:58Z

Codecov Report

Attention: Patch coverage is 0.89286% with 222 lines in your changes are missing coverage. Please review.

Project coverage is 44.73%. Comparing base (11d9336) to head (3282ae4).
Report is 34 commits behind head on main.

Files	Patch %	Lines
autogen/agentchat/contrib/huggingface_agent.py	0.00%	104 Missing and 1 partial ⚠️
autogen/agentchat/contrib/huggingface_utils.py	0.00%	95 Missing ⚠️
.../agentchat/contrib/capabilities/generate_images.py	0.00%	22 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2599       +/-   ##
===========================================
+ Coverage   33.60%   44.73%   +11.13%     
===========================================
  Files          87       89        +2     
  Lines        9336     9641      +305     
  Branches     1987     2211      +224     
===========================================
+ Hits         3137     4313     +1176     
+ Misses       5933     4959      -974     
- Partials      266      369      +103

Flag	Coverage Δ
unittest	`12.37% <0.00%> (?)`
unittests	`44.00% <0.89%> (+10.40%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WaelKarkoub · 2024-05-05T22:15:19Z

@whiskyboy thanks for the PR! I had a couple of design questions and wanted your opinion on them.

Autogen has an image generation capability, which allows anyone to add text-to-image capabilities to any LLM.

autogen/autogen/agentchat/contrib/capabilities/generate_images.py

Line 112 in e878be5

class ImageGeneration(AgentCapability):

What do you think about implementing a new custom ImageGenerator that uses huggingface's apis, as opposed to creating a new agent type? We have dalle image generator implemented for reference.

autogen/autogen/agentchat/contrib/capabilities/generate_images.py

Line 22 in e878be5

class ImageGenerator(Protocol):

For image-to-text, we also have a capability called VisionCapability. @BeibinLi has more information on the design choices for that capability but I just wanted to bring it up for awareness.

autogen/autogen/agentchat/contrib/capabilities/vision_capability.py

Line 25 in e878be5

class VisionCapability(AgentCapability):

whiskyboy · 2024-05-06T01:47:04Z

@WaelKarkoub Thanks for your comment!
Yes, and in fact I have got inspired and learned a lot from the design of the two capabilities you mentioend above, and also from the MultimodalConversableAgent and LLaVAAgent, during development. Here are my thoughts:

Can we achieve the same functionality within the current multimodal capability implementations?
Certainly, we can implement a custom ImageGenerator or a custom custom_caption_func to realize the text-to-image and image-to-text capabilities using Huggingface's APIs. However, Huggingface provides the potential of many other multimodal capabilities, such as 'image-to-image', 'audio-to-audio', etc, which go beyond the current implementations. (A full list could be found here.) This draft PR serves as a PoC only now to show how a huggingface agent works. Once we align on the design, I'll proceed with implementing additional capabilities
Should we add a new agent type or should we add some new multimodal capabilities to leveraging Huggingface multimodal models?
Both designs make sense to me. Introducing a new agent type would allow for covering a diverse range of different multimodal capabilities for general purpose easily, while registering a new capability is more suitable for a specific task. (But we can also have a general capability or register multiple capabilities to one agent. So I'm flexible and open to either approach)
Do we really need a built-in support to Huggingface multimodal models?
I got the idea inspired from Transformers Agents and JARVIS . It's appealing (to me at least) to have a non-openai and out-of-box solution for adding multimodal capabilities to a text-only LLM in autogen. Huggingface stands out as a suitable choice due to its diverse range of multimodal models spanning from general-purpose to domain-specific areas. Additionally, it offers a cost-effective solution.

WaelKarkoub · 2024-05-06T02:31:28Z

@whiskyboy This is very cool and I appreciate your efforts! Your reasoning fits well with what I think now. Both approaches could be beneficial to the autogen community and could coexist. We can have standalone huggingface conversible agents as well as huggingface image generators, audio generators, etc.

I look at Autogen as a lego world where users can mix and match different useful tools (lego pieces), and the tools you've developed are valuable and versatile enough to be applicable across many areas (e.g., agent capabilities). For a concrete example, what do you think about breaking down the text-to-image functionality and implementing it as an ImageGenerator that HuggingFaceAgent could also utilize? The HuggingFaceAgent wouldn't implement it as a capability but could directly use this newly decoupled logic. We could apply a similar strategy to other modalities as well.

One last question, is the image-to-image capability the same as image editing? If so, I'm considering improving the image generator capability to allow for this.

whiskyboy · 2024-05-06T12:31:36Z

@WaelKarkoub It's glad to know we are working towards the same goal!

what do you think about breaking down the text-to-image functionality and implementing it as an ImageGenerator that HuggingFaceAgent could also utilize?

Sounds like a versatile lego block that could be utilized by both standalone agents and agent capabilities? I think it's a good idea! As it could enhance the function reusability, and make the code more readable and maintainable.

is the image-to-image capability the same as image editing?

Yes, some typical user scenarios include style transfer, image inpainting, etc. For instance, the timbrooks/instruct-pix2pix model could transform a dog in one image into a cat. These models are usually diffusion models that accept a souce image and a prompt text as input.

…4v format output

…method

whiskyboy · 2024-05-17T04:04:39Z

@WaelKarkoub @BeibinLi minding take a review of this PR? I'll add the documentation and tests once you approve the design.

WaelKarkoub · 2024-05-19T01:52:56Z

autogen/agentchat/contrib/huggingface_agent.py

+            @self._user_proxy.register_for_execution()
+            @self._assistant.register_for_llm(
+                name=HuggingFaceCapability.TEXT_TO_IMAGE.name,
+                description="Generates images from input text.",
+            )


What's the idea behind using function registration instead of using the text analyzer agent?

Basically I need the agent to identify which capability (text-to-image, image-to-text, etc.) should be called to complete the task, and extract the arguments for the call. Will text analyzer agent be better for this ask?

WaelKarkoub · 2024-05-19T02:01:23Z

autogen/agentchat/contrib/huggingface_agent.py

+        self._assistant = AssistantAgent(
+            self.name + "_inner_assistant",
+            system_message=system_message,
+            llm_config=inner_llm_config,
+            is_termination_msg=lambda x: False,
+        )


We may have to expose these two agents to the public by initializing them in the constructor for a couple of reasons:

Users can apply transform messages capability to limit token count by either truncation or compression.

Expose to the users that we'll be making extra API calls

Hum... It's a bit odd for me to explicitly pass two agents to the constructor here. Do you have an example code?
BTW, I was following the design pattern in WebSurferAgent and ImageGeneration, both have inner agents that are not exposed.

WaelKarkoub · 2024-05-19T02:05:21Z

autogen/agentchat/contrib/huggingface_utils.py

+from autogen.agentchat.contrib import img_utils
+
+
+class HuggingFaceClient:


Is this meant to be a model client?

autogen/autogen/oai/client.py

Line 64 in 19de99e

class ModelClient(Protocol):

It did not implement the model client protocol yet. But it's a good suggestion. I'll make the change.

gitguardian · 2024-05-27T05:55:17Z

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
10493810	Triggered	Generic Password	`d422c63`	notebook/agentchat_pgvector_RetrieveChat.ipynb	View secret
10493810	Triggered	Generic Password	`d422c63`	notebook/agentchat_pgvector_RetrieveChat.ipynb	View secret
10493810	Triggered	Generic Password	`d422c63`	notebook/agentchat_pgvector_RetrieveChat.ipynb	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

whiskyboy added 7 commits May 4, 2024 15:16

Add HuggingFaceAgent

feed2fd

Add a PoC notebook with text-to-image capability

9cd2eb1

Add image_to_text tool

0822be5

rename hf task to hf capability

218fc7d

Add image-to-image capability

705741f

update notebook

74191f0

update notebook

a758f21

whiskyboy mentioned this pull request May 5, 2024

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

Open

sonichi requested a review from BeibinLi May 5, 2024 17:07

sonichi added multimodal language + vision, speech etc. integration software integration alt-models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) labels May 5, 2024

whiskyboy and others added 11 commits May 8, 2024 13:57

Merge branch 'microsoft:main' into huggingface_agent

0c96f60

Support multiple turns between inner proxy and assistant; support gpt…

45d36b9

…4v format output

Merge branch 'main' into huggingface_agent

9a17774

simplify arguments

1b5ddfb

add HuggingFaceImageGenerator to image generation capability

e53fdff

add HuggingFaceClient

0cf54fd

add HuggingFaceImageGenerator example

002ea7b

update model used in sample notebook

1c1c70c

add default model and inference_mode to HuggingFaceClient.__init__() …

c0c60aa

…method

use HuggingFaceClient() to execute task and add VQA capability

5d65d7d

bug fix and remove unused import

960b3f0

whiskyboy mentioned this pull request May 17, 2024

[Bug]: Dalle-Critic not working #2510

Closed

bugs fix

a469514

update notebook sample

c158916

whiskyboy marked this pull request as ready for review May 17, 2024 04:04

WaelKarkoub reviewed May 19, 2024

View reviewed changes

whiskyboy added 2 commits May 24, 2024 18:35

Merge branch 'microsoft:main' into huggingface_agent

89aebb8

Merge branch 'microsoft:main' into huggingface_agent

d422c63

refactor HuggingFaceAgent init

3282ae4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huggingface agent #2599

Huggingface agent #2599

whiskyboy commented May 5, 2024 •

edited

codecov-commenter commented May 5, 2024 •

edited

WaelKarkoub commented May 5, 2024

whiskyboy commented May 6, 2024

WaelKarkoub commented May 6, 2024

whiskyboy commented May 6, 2024 •

edited

whiskyboy commented May 17, 2024

WaelKarkoub May 19, 2024

whiskyboy May 22, 2024

WaelKarkoub May 19, 2024

whiskyboy May 23, 2024 •

edited

WaelKarkoub May 19, 2024

whiskyboy May 22, 2024

gitguardian bot commented May 27, 2024 •

edited

		from autogen.agentchat.contrib import img_utils


		class HuggingFaceClient:

Huggingface agent #2599

Are you sure you want to change the base?

Huggingface agent #2599

Conversation

whiskyboy commented May 5, 2024 • edited

Why are these changes needed?

Related issue number

Checks

codecov-commenter commented May 5, 2024 • edited

Codecov Report

WaelKarkoub commented May 5, 2024

whiskyboy commented May 6, 2024

WaelKarkoub commented May 6, 2024

whiskyboy commented May 6, 2024 • edited

whiskyboy commented May 17, 2024

WaelKarkoub May 19, 2024

Choose a reason for hiding this comment

whiskyboy May 22, 2024

Choose a reason for hiding this comment

WaelKarkoub May 19, 2024

Choose a reason for hiding this comment

whiskyboy May 23, 2024 • edited

Choose a reason for hiding this comment

WaelKarkoub May 19, 2024

Choose a reason for hiding this comment

whiskyboy May 22, 2024

Choose a reason for hiding this comment

gitguardian bot commented May 27, 2024 • edited

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

whiskyboy commented May 5, 2024 •

edited

codecov-commenter commented May 5, 2024 •

edited

whiskyboy commented May 6, 2024 •

edited

whiskyboy May 23, 2024 •

edited

gitguardian bot commented May 27, 2024 •

edited