Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train or fine-tune models for computer automation agents #11

Open
James4Ever0 opened this issue Feb 24, 2024 · 4 comments
Open

Train or fine-tune models for computer automation agents #11

James4Ever0 opened this issue Feb 24, 2024 · 4 comments

Comments

@James4Ever0
Copy link

James4Ever0 commented Feb 24, 2024

Hello there Microsoft UFO Team! Excellent work for you to do such remarkable job, bringing AI closer to Windows system. I am doing similar works like training custom GPT2 models on computer automation datasets.

I have created two comprehensive datasets, over terminal and GUI environments. My strategy is to create data by random keyboard and mouse actions, collect observations mixed with other textual datasets.

This naive attempt shows my strong interest over computer agents. I like the idea of GUI agent benchmark systems like WindowsBench, and have thought of building some reward system by program exit codes or VimGolf.

If you ever consider my suggestion useful I would like to hear from your reply! Furthermore, if cooperation is possible I would be thrilled to join your team for building better computer agents!


Update: Google has posted an unsupervised action space training method called Genie. Consider that as highly applicable in the area of computer agents.

@vyokky
Copy link
Contributor

vyokky commented Feb 26, 2024

Hi @James4Ever0, thanks for getting in touch. We are defenitely interested in training a local model to enable faster inference. Would you minding sharing more context and perhaps a snippet of the dataset you create? We are welcome to cooperation and contribution if this is a good fit.

@James4Ever0
Copy link
Author

James4Ever0 commented Mar 5, 2024

The terminal dataset is comprised of an unique trajectory identifier, observations of the terminal, and actions taken by the agent.

The observation can either be the full view of the terminal or only the updated lines, with line numbers surrounded by square brackets.

The actions taken by the agent is called Godlang, a language which can empower LLM to interface with TUI and GUI.

Preview of the terminal dataset:

====================JSON RESPONSES====================
identifier received from websocket 77bf0b60-056d-4a15-afa4-62431d6ba773
====================JSON RESPONSES====================
Cursur at: (0, 0)
Updated content:
[0 ]
[1 ]
[2 ]
[3 ]
[4 ]
[5 ]
[6 ]
[7 ]
[8 ]
[9 ]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Updated lines: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Fullscreen:
====================JSON RESPONSES====================
Cursur at: (0, 4)
Updated content:
[0 ] / #
Updated lines: 0
Fullscreen:
/ #
VIEW
SPECIAL CTRL+C
SPECIAL TAB
VIEW
SPECIAL CTRL+6
Command list: ['VIEW', 'SPECIAL CTRL+C', 'SPECIAL TAB', 'VIEW', 'SPECIAL CTRL+6']
Regular sleep for 0.200000 seconds
Exiting reading action list because of 'VIEW' command
WAIT 0.548
TYPE n
REM Random actions

@James4Ever0
Copy link
Author

James4Ever0 commented Mar 5, 2024

After extracting the RAR file, you will find a bunch of folders named by timestamps, in which you can find these files:

hid_record.jsonl     video_record.mp4	     video_timestamps.json
hid_timestamps.json  video_record_script.sh

video_record.mp4 is a video file at 30fps with 1280x768 resolution, in which each frame is a screenshot taken not at the video play speed.

In hid_record.jsonl you shall find:

{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["key_press", "Key.ctrl"], ["key_press", "Key.shift"], ["key_press", "Key.page_up"], ["key_release", "Key.page_up"], ["key_release", "Key.shift"], ["key_release", "Key.ctrl"]]}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["mouse_move", [782, 682]]]}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["key_press", "Key.alt"], ["key_press", "'l'"], ["key_release", "'l'"], ["key_release", "Key.alt"]]}

video_timestamps.json contains the corresponding UNIX timestamps for every frame recorded:

[
    1685664003.6361628,
    1685664003.6745877,
    1685664003.6882446,
    1685664003.715868,
    1685664003.7464304,
    1685664003.7711987,
    1685664003.7833188,
    1685664003.8149195,
    ...
]

hid_timestamps.json is similar to video_timestamps.json and contains every timestamp for every HID action, event, including those empty ones, found in hid_record.jsonl.

@James4Ever0
Copy link
Author

James4Ever0 commented Mar 10, 2024

Even though UFO can handle simple UI interfaces like Microsoft Word and Calculator, would it be possible to handle games like Cyberpunk 2077 or complex professional softwares like Premiere Pro and Photoshop? I doubt it and think it needs extensive training datasets, complex training & evaulation regime and advanced algoritms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants