Train or fine-tune models for computer automation agents #11

James4Ever0 · 2024-02-24T15:11:08Z

Hello there Microsoft UFO Team! Excellent work for you to do such remarkable job, bringing AI closer to Windows system. I am doing similar works like training custom GPT2 models on computer automation datasets.

I have created two comprehensive datasets, over terminal and GUI environments. My strategy is to create data by random keyboard and mouse actions, collect observations mixed with other textual datasets.

This naive attempt shows my strong interest over computer agents. I like the idea of GUI agent benchmark systems like WindowsBench, and have thought of building some reward system by program exit codes or VimGolf.

If you ever consider my suggestion useful I would like to hear from your reply! Furthermore, if cooperation is possible I would be thrilled to join your team for building better computer agents!

Update: Google has posted an unsupervised action space training method called Genie. Consider that as highly applicable in the area of computer agents.

vyokky · 2024-02-26T11:06:25Z

Hi @James4Ever0, thanks for getting in touch. We are defenitely interested in training a local model to enable faster inference. Would you minding sharing more context and perhaps a snippet of the dataset you create? We are welcome to cooperation and contribution if this is a good fit.

James4Ever0 · 2024-03-05T03:56:52Z

The terminal dataset is comprised of an unique trajectory identifier, observations of the terminal, and actions taken by the agent.

The observation can either be the full view of the terminal or only the updated lines, with line numbers surrounded by square brackets.

The actions taken by the agent is called Godlang, a language which can empower LLM to interface with TUI and GUI.

Preview of the terminal dataset:

====================JSON RESPONSES====================
identifier received from websocket 77bf0b60-056d-4a15-afa4-62431d6ba773
====================JSON RESPONSES====================
Cursur at: (0, 0)
Updated content:
[0 ]
[1 ]
[2 ]
[3 ]
[4 ]
[5 ]
[6 ]
[7 ]
[8 ]
[9 ]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Updated lines: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Fullscreen:
====================JSON RESPONSES====================
Cursur at: (0, 4)
Updated content:
[0 ] / #
Updated lines: 0
Fullscreen:
/ #
VIEW
SPECIAL CTRL+C
SPECIAL TAB
VIEW
SPECIAL CTRL+6
Command list: ['VIEW', 'SPECIAL CTRL+C', 'SPECIAL TAB', 'VIEW', 'SPECIAL CTRL+6']
Regular sleep for 0.200000 seconds
Exiting reading action list because of 'VIEW' command
WAIT 0.548
TYPE n
REM Random actions

James4Ever0 · 2024-03-05T07:35:49Z

After extracting the RAR file, you will find a bunch of folders named by timestamps, in which you can find these files:

hid_record.jsonl     video_record.mp4	     video_timestamps.json
hid_timestamps.json  video_record_script.sh

video_record.mp4 is a video file at 30fps with 1280x768 resolution, in which each frame is a screenshot taken not at the video play speed.

In hid_record.jsonl you shall find:

{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["key_press", "Key.ctrl"], ["key_press", "Key.shift"], ["key_press", "Key.page_up"], ["key_release", "Key.page_up"], ["key_release", "Key.shift"], ["key_release", "Key.ctrl"]]}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["mouse_move", [782, 682]]]}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["key_press", "Key.alt"], ["key_press", "'l'"], ["key_release", "'l'"], ["key_release", "Key.alt"]]}

video_timestamps.json contains the corresponding UNIX timestamps for every frame recorded:

[
    1685664003.6361628,
    1685664003.6745877,
    1685664003.6882446,
    1685664003.715868,
    1685664003.7464304,
    1685664003.7711987,
    1685664003.7833188,
    1685664003.8149195,
    ...
]

hid_timestamps.json is similar to video_timestamps.json and contains every timestamp for every HID action, event, including those empty ones, found in hid_record.jsonl.

James4Ever0 · 2024-03-10T07:54:34Z

Even though UFO can handle simple UI interfaces like Microsoft Word and Calculator, would it be possible to handle games like Cyberpunk 2077 or complex professional softwares like Premiere Pro and Photoshop? I doubt it and think it needs extensive training datasets, complex training & evaulation regime and advanced algoritms.

James4Ever0 mentioned this issue Mar 14, 2024

Integration with existing computer agent systems and further development BAAI-Agents/Cradle#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train or fine-tune models for computer automation agents #11

Train or fine-tune models for computer automation agents #11

James4Ever0 commented Feb 24, 2024 •

edited

vyokky commented Feb 26, 2024

James4Ever0 commented Mar 5, 2024 •

edited

James4Ever0 commented Mar 5, 2024 •

edited

James4Ever0 commented Mar 10, 2024 •

edited

Train or fine-tune models for computer automation agents #11

Train or fine-tune models for computer automation agents #11

Comments

James4Ever0 commented Feb 24, 2024 • edited

vyokky commented Feb 26, 2024

James4Ever0 commented Mar 5, 2024 • edited

James4Ever0 commented Mar 5, 2024 • edited

James4Ever0 commented Mar 10, 2024 • edited

James4Ever0 commented Feb 24, 2024 •

edited

James4Ever0 commented Mar 5, 2024 •

edited

James4Ever0 commented Mar 5, 2024 •

edited

James4Ever0 commented Mar 10, 2024 •

edited