Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support return final task states for evaluation #1755

Merged
merged 19 commits into from
May 15, 2024

Conversation

xingyaoww
Copy link
Collaborator

This is part of the modification in #1468.

  • It allows main to return the final task state (with trajectory) to be saved by the eval script for further evaluation.
  • It allows the user to provide fake_user_response_fn to fake user response based on the State to the model during the main loop. (We probably need to figure out a way to extend our existing integration test to test this, cc @li-boxuan)

Comment on lines 160 to 163
except Exception as e:
logger.error('Error in loop', exc_info=True)
self.finish_state = self.get_state()
self.finish_state.error = str(e)
Copy link
Collaborator

@yufansong yufansong May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to get clear about our logic, do we need the state after every step or only need the one after the final step?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need one after the final step -- in this case when error occurs, it will break the loop, hence end the ._run function. But i guess you caught a bug - we shouldn't re-get the state at the end of the lopp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be hopefully fixed in the new commit

Comment on lines 94 to 97
if fake_user_response_fn is None:
message = input('Request user input >> ')
else:
message = fake_user_response_fn(controller.get_state())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think add fake_user_response_fn into main function signature will cause test code and normal logic get coupled. But I also didn't find a good way to decouple it and mock the test. I prefer to merge it first and optimize it in the future. cc @li-boxuan Do you have any good idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't for testing purpose - I believe it's for SWE-Bench evaluation, see here.

Looking at that giant PR, it looks like we could make this prompt as a built-in one?

'Please continue working on the task on whatever approach you think is suitable.\n'
'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'

The agent could have a full-autonomous mode so that it doesn't ask for user inputs, and instead, it uses the above prompt. @xingyaoww what do you think? I could help on that if you need a hand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li-boxuan That's actually a great idea!! Fully support that -- I can help starts a PR to isolate that out - but the issue would be, do we want to implement this in the controller, or inside codeact agent like the count down?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh cool! The other agents have only "fully autonomous" mode currently, and need interactive mode, CodeAct the other way around. Maybe an option to set on the agent when run?

Like agent.mode=interactive or agent.mode=autonomous. It is the agent who picks the prompt to use, isn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think the prompt itself will be in the agent itself but there should be an interface/api call from controller such that any agent can leverage this feature.

Copy link
Collaborator Author

@xingyaoww xingyaoww May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well i decided to keep it a separate PR after we got the SWE-Bench in -- supporting this now probably takes more time :(

Added an issue to track this: #1798

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, impelement feature first and refactor it when we have time, we also can help. :)

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General looks good to me. Want to hear other maintainers opinion about the test change.

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just call controller.get_state() once it's finished?

Double-tracking final_state and current_state seems like an anti-patern

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented May 13, 2024

@rbren The major issue of that is, when controller sets agent to other states, it completely reset the state for a new task (all histories are removed). We need to store the state before it got reseted, or we need to mess around with the set agent state which could be more complex.

@rbren
Copy link
Collaborator

rbren commented May 13, 2024

@xingyaoww I ran into this problem too recently. My recent PR refactored this a bit--I've got another one around that changes the reset_task behavior to avoid this problem.

Can you try simply removing the call to reset_task when AgentState is ERROR? I think that might do the trick

@li-boxuan
Copy link
Collaborator

@xingyaoww I ran into this problem too recently. My recent PR refactored this a bit--I've got another one around that changes the reset_task behavior to avoid this problem.

Can you try simply removing the call to reset_task when AgentState is ERROR? I think that might do the trick

I ran into this problem (sort of, but non-blocking) too, when I worked on #1735. Would be nice to see an elegant solution!

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented May 14, 2024

Can you try simply removing the call to reset_task when AgentState is ERROR? I think that might do the trick

@rbren but when state is FINISHED, the state would still get removed by reset_task hence not accessible for return value? Can you share with me what you changed in that refractor?

Copy link
Collaborator Author

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbren @li-boxuan @yufansong feel free to take another look! This should be the final PR before the SWE-Bench integration!

tests/integration/test_agent.py Show resolved Hide resolved
Comment on lines 118 to 119
final_state = await controller.close()
return final_state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller.close() doesn't seem to return a state, and I don't think we want it to. Maybe you just mean return controller.state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! How about now!

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is much cleaner

opendevin/core/main.py Outdated Show resolved Hide resolved
@xingyaoww xingyaoww enabled auto-merge (squash) May 15, 2024 03:38
@xingyaoww xingyaoww merged commit d1fd277 into main May 15, 2024
25 checks passed
@xingyaoww xingyaoww deleted the xw/controller-return-final-state branch May 15, 2024 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants