-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support return final task states for evaluation #1755
Conversation
except Exception as e: | ||
logger.error('Error in loop', exc_info=True) | ||
self.finish_state = self.get_state() | ||
self.finish_state.error = str(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to get clear about our logic, do we need the state after every step or only need the one after the final step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need one after the final step -- in this case when error occurs, it will break
the loop, hence end the ._run
function. But i guess you caught a bug - we shouldn't re-get the state at the end of the lopp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be hopefully fixed in the new commit
opendevin/core/main.py
Outdated
if fake_user_response_fn is None: | ||
message = input('Request user input >> ') | ||
else: | ||
message = fake_user_response_fn(controller.get_state()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think add fake_user_response_fn
into main function signature will cause test code and normal logic get coupled. But I also didn't find a good way to decouple it and mock the test. I prefer to merge it first and optimize it in the future. cc @li-boxuan Do you have any good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't for testing purpose - I believe it's for SWE-Bench evaluation, see here.
Looking at that giant PR, it looks like we could make this prompt as a built-in one?
'Please continue working on the task on whatever approach you think is suitable.\n'
'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
The agent could have a full-autonomous
mode so that it doesn't ask for user inputs, and instead, it uses the above prompt. @xingyaoww what do you think? I could help on that if you need a hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@li-boxuan That's actually a great idea!! Fully support that -- I can help starts a PR to isolate that out - but the issue would be, do we want to implement this in the controller, or inside codeact agent like the count down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh cool! The other agents have only "fully autonomous" mode currently, and need interactive mode, CodeAct the other way around. Maybe an option to set on the agent when run?
Like agent.mode=interactive
or agent.mode=autonomous
. It is the agent who picks the prompt to use, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think the prompt itself will be in the agent itself but there should be an interface/api call from controller such that any agent can leverage this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well i decided to keep it a separate PR after we got the SWE-Bench in -- supporting this now probably takes more time :(
Added an issue to track this: #1798
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, impelement feature first and refactor it when we have time, we also can help. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General looks good to me. Want to hear other maintainers opinion about the test change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just call controller.get_state()
once it's finished?
Double-tracking final_state
and current_state
seems like an anti-patern
@rbren The major issue of that is, when controller sets agent to other states, it completely reset the state for a new task (all histories are removed). We need to store the state before it got reseted, or we need to mess around with the set agent state which could be more complex. |
@xingyaoww I ran into this problem too recently. My recent PR refactored this a bit--I've got another one around that changes the reset_task behavior to avoid this problem. Can you try simply removing the call to |
I ran into this problem (sort of, but non-blocking) too, when I worked on #1735. Would be nice to see an elegant solution! |
@rbren but when state is FINISHED, the state would still get removed by |
…oller-return-final-state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbren @li-boxuan @yufansong feel free to take another look! This should be the final PR before the SWE-Bench integration!
opendevin/core/main.py
Outdated
final_state = await controller.close() | ||
return final_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
controller.close()
doesn't seem to return a state, and I don't think we want it to. Maybe you just mean return controller.state
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! How about now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is much cleaner
This is part of the modification in #1468.
main
to return the final task state (with trajectory) to be saved by the eval script for further evaluation.fake_user_response_fn
to fake user response based on theState
to the model during the main loop. (We probably need to figure out a way to extend our existing integration test to test this, cc @li-boxuan)