调试健身房 – 搞英语 → 看世界

Microsoft Research 的新论文和代码尝试让 LLM 访问 Python 调试器。他们发现最好的模型确实可以通过运行 pdb 作为工具来改善其结果。

他们在 Claude 3.7 Sonnet 与SWE-bench Lite 的对比中看到了总体最好的结果，在没有调试器的重写模式下得分为 37.2%，使用调试器工具得分为 48.4%，使用 debug(5) 得分为 52.1% – 在这种机制中，pdb 工具仅在第五次重写尝试后才可用。

他们的代码可以在 GitHub 上找到。我找到了pdb工具的实现，并在agents/debug_agent.py中跟踪了主系统和用户提示：

系统提示：

Your goal is to debug a Python program to make sure it can pass a set of test functions. You have access to the pdb debugger tools, you can use them to investigate the code, set breakpoints, and print necessary values to identify the bugs. Once you have gained enough information, propose a rewriting patch to fix the bugs. Avoid rewriting the entire code, focus on the bugs only.

用户提示（他们称之为“操作提示”）：

Based on the instruction, the current code, the last execution output, and the history information, continue your debugging process using pdb commands or to propose a patch using rewrite command. Output a single command, nothing else. Do not repeat your previous commands unless they can provide more information. You must be concise and avoid overthinking.

通过导入AI

标签：提示工程、 llms 、 python 、 generative-ai 、 llm-tool-use 、 ai 、 microsoft 、 claude

原文： https://simonwillison.net/2025/Mar/31/debug-gym/#atom-everything