DeepResearch代码浅析
概述
代码:DeepResearch
主要看一下inference 下面的ReAct推理流程。 - inference
- ├── eval_data
- │ ├── example_with_file.jsonl
- │ ├── example.jsonl
- │ └── file_corpus
- │ └── hello.txt
- ├── file_tools
- │ ├── __pycache__
- │ │ └── file_parser.cpython-313.pyc
- │ ├── file_parser.py
- │ ├── idp.py
- │ ├── utils.py
- │ ├── video_agent.py
- │ └── video_analysis.py
- ├── prompt.py
- ├── react_agent.py
- ├── run_multi_react.py
- ├── run_react_infer.sh
- ├── tool_file.py
- ├── tool_python.py
- ├── tool_scholar.py
- ├── tool_search.py
- └── tool_visit.py
复制代码代码的入口是run_react_infer.sh 中的run_multi_react.py 文件
run_multi_react.py 负责初始化节点环境,加载数据集,加载模型配置,进行多次rollout采样。
react_agent 是ReAct 架构的Agent,负责迭代输出,调用工具。
- from react_agent import MultiTurnReactAgent
- test_agent = MultiTurnReactAgent(
- llm=llm_cfg,
- function_list=["search", "visit", "google_scholar", "PythonInterpreter"]
- )
复制代码react_agent
主体的ReAct agent,统一调度处理模型的输出,进行tool extract and execute和tool response的拼接
执行ReAct的全部流程,给出最后的执行状态,处理运行中的异常现象
工具调用
搜索具体的url,并根据goal总结返回。\((url, goal;\pi)\rightarrow summary\) - JINA_API_KEYS = os.getenv("JINA_API_KEYS", "")
- def readpage_jina(self, url: str, goal: str) -> str:
- """
- Attempt to read webpage content by alternating between jina and aidata services.
- Args:
- url: The URL to read
- goal: The goal/purpose of reading the page
- Returns:
- str: The webpage content or error message
- """
-
- # def call_server用于根据goal总结网页的内容
- summary_page_func = self.call_server
- max_retries = int(os.getenv('VISIT_SERVER_MAX_RETRIES', 1))
- # 使用jina将url的网页信息转化为 markdown格式
- content = self.html_readpage_jina(url)
- #############################################################
- # 处理markdown的网页信息 content
- #############################################################
- # 如果网页信息可以被jina提取
- if content and not content.startswith("[visit] Failed to read page.") and content != "[visit] Empty content." and not content.startswith("[document_parser]"):
- # pre-process 先处理content的token长度,避免llm的上下文超长
- content = truncate_to_tokens(content, max_tokens=95000)
- # 总结promopt
- messages = [{"role":"user","content": EXTRACTOR_PROMPT.format(webpage_content=content, goal=goal)}]
- parse_retry_times = 0
- # 得到网页总结后的信息 raw
- raw = summary_page_func(messages, max_retries=max_retries)
- summary_retries = 3
- # 如果raw少于10个字符,那么认为总结失败,因为raw是json格式,```json {"rational":..., "evidence":..., "summary":...}```
- while len(raw) < 10 and summary_retries >= 0:
- # 尝试截断30%的长度
- truncate_length = int(0.7 * len(content)) if summary_retries > 0 else 25000
- status_msg = (
- f"[visit] Summary url[{url}] "
- f"attempt {3 - summary_retries + 1}/3, "
- f"content length: {len(content)}, "
- f"truncating to {truncate_length} chars"
- ) if summary_retries > 0 else (
- f"[visit] Summary url[{url}] failed after 3 attempts, "
- f"final truncation to 25000 chars"
- ) # 截断30%不行,尝试只留下25000字符
- print(status_msg)
- content = content[:truncate_length]
- extraction_prompt = EXTRACTOR_PROMPT.format(
- webpage_content=content,
- goal=goal
- )
- messages = [{"role": "user", "content": extraction_prompt}]
- raw = summary_page_func(messages, max_retries=max_retries)
- summary_retries -= 1
- # 解析总结的格式
- parse_retry_times = 2
- if isinstance(raw, str):
- raw = raw.replace("```json", "").replace("```", "").strip()
- while parse_retry_times < 3:
- try:
- raw = json.loads(raw)
- break
- except:
- # 解析失败的话,就重新生成总结
- raw = summary_page_func(messages, max_retries=max_retries)
- parse_retry_times += 1
- # 解析失败
- if parse_retry_times >= 3:
- useful_information = "The useful information in {url} for user goal {goal} as follows: \n\n".format(url=url, goal=goal)
- useful_information += "Evidence in page: \n" + "The provided webpage content could not be accessed. Please check the URL or file format." + "\n\n"
- useful_information += "Summary: \n" + "The webpage content could not be processed, and therefore, no information is available." + "\n\n"
- # 解析成功,把evidence和summary一并返回
- else:
- useful_information = "The useful information in {url} for user goal {goal} as follows: \n\n".format(url=url, goal=goal)
- useful_information += "Evidence in page: \n" + str(raw["evidence"]) + "\n\n"
- useful_information += "Summary: \n" + str(raw["summary"]) + "\n\n"
- if len(useful_information) < 10 and summary_retries < 0:
- print("[visit] Could not generate valid summary after maximum retries")
- useful_information = "[visit] Failed to read page"
- return useful_information
- # If no valid content was obtained after all retries
- # 如果网页的原始信息就不合理,jina无法提取,返回失败信息
- else:
- useful_information = "The useful information in {url} for user goal {goal} as follows: \n\n".format(url=url, goal=goal)
- useful_information += "Evidence in page: \n" + "The provided webpage content could not be accessed. Please check the URL or file format." + "\n\n"
- useful_information += "Summary: \n" + "The webpage content could not be processed, and therefore, no information is available." + "\n\n"
- return useful_information
-
复制代码jina举例
输入https://r.jina.ai/ +{url(https://www.axtonliu.ai/newsletters/ai-2/posts/jina-reader-api-four-usage-methods-guide)}
原始网页:
jina由三部分组成:
- title
- url
- markdown content(图片的url信息,超链接等)
- Title: Jina Reader API完全指南:4种实用集成方案详解 | AI开发教程
- URL Source: https://www.axtonliu.ai/newsletters/ai-2/posts/jina-reader-api-four-usage-methods-guide
- Markdown Content:
- 构建知识库,或者分析各种文章数据,是大家使用 AI 很重要的一个应用场景,
复制代码
-
tool_file
根据url的文件,和goal,返回总结信息,类似于tool_visit 。但是要借助于file_tools 进行指定url文件的读取(visit是借用jina进行指定url网页信息的读取)。 - """
- input:
- - query/goal: str
- - Docs: List[file]/List[url]
- - file type: 'pdf', 'docx', 'pptx', 'txt', 'html', 'csv', 'tsv', 'xlsx', 'xls', 'doc', 'zip', '.mp4', '.mov', '.avi', '.mkv', '.webm', '.mp3', '.wav', '.aac', '.ogg', '.flac'
- output:
- - answer: str
- - useful_information: str
- """
复制代码
-
tool_search
调用google 进行search。\((q;Enginer)\rightarrow docs\)
-
tool_scholar
类似于tool_search ,区别在于 tool_scholar 在goole scholar上进行文章的搜索
Prompt
分为react的system prompt,以及visit 总结的extract prompt - SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within </answer> tags.
- # Tools
- You may call one or more functions to assist with the user query.
- You are provided with function signatures within <tools></tools> XML tags:
- <tools>
- {"type": "function", "function": {"name": "search", "description": "Perform Google web searches then returns a string of the top search results. Accepts multiple queries.", "parameters": {"type": "object", "properties": {"query": {"type": "array", "items": {"type": "string", "description": "The search query."}, "minItems": 1, "description": "The list of search queries."}}, "required": ["query"]}}}
- {"type": "function", "function": {"name": "visit", "description": "Visit webpage(s) and return the summary of the content.", "parameters": {"type": "object", "properties": {"url": {"type": "array", "items": {"type": "string"}, "description": "The URL(s) of the webpage(s) to visit. Can be a single URL or an array of URLs."}, "goal": {"type": "string", "description": "The specific information goal for visiting webpage(s)."}}, "required": ["url", "goal"]}}}
- {"type": "function", "function": {"name": "PythonInterpreter", "description": "Executes Python code in a sandboxed environment. To use this tool, you must follow this format:
- 1. The 'arguments' JSON object must be empty: {}.
- 2. The Python code to be executed must be placed immediately after the JSON block, enclosed within and tags.
- IMPORTANT: Any output you want to see MUST be printed to standard output using the print() function.
- Example of a correct call:
- <tool_call>
- {"name": "PythonInterpreter", "arguments": {}}
- import numpy as np
- # Your code here
- print(f"The result is: {np.mean([1,2,3])}")
- </tool_call>", "parameters": {"type": "object", "properties": {}, "required": []}}}
- {"type": "function", "function": {"name": "google_scholar", "description": "Leverage Google Scholar to retrieve relevant information from academic publications. Accepts multiple queries. This tool will also return results from google search", "parameters": {"type": "object", "properties": {"query": {"type": "array", "items": {"type": "string", "description": "The search query."}, "minItems": 1, "description": "The list of search queries for Google Scholar."}}, "required": ["query"]}}}
- {"type": "function", "function": {"name": "parse_file", "description": "This is a tool that can be used to parse multiple user uploaded local files such as PDF, DOCX, PPTX, TXT, CSV, XLSX, DOC, ZIP, MP4, MP3.", "parameters": {"type": "object", "properties": {"files": {"type": "array", "items": {"type": "string"}, "description": "The file name of the user uploaded local files to be parsed."}}, "required": ["files"]}}}
- </tools>
- For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
- <tool_call>
- {"name": <function-name>, "arguments": }
- </tool_call>
- Current date: """
- EXTRACTOR_PROMPT = """Please process the following webpage content and user goal to extract relevant information:
- ## **Webpage Content**
- {webpage_content}
- ## **User Goal**
- {goal}
- ## **Task Guidelines**
- 1. **Content Scanning for Rational**: Locate the **specific sections/data** directly related to the user's goal within the webpage content
- 2. **Key Extraction for Evidence**: Identify and extract the **most relevant information** from the content, you never miss any important information, output the **full original context** of the content as far as possible, it can be more than three paragraphs.
- 3. **Summary Output for Summary**: Organize into a concise paragraph with logical flow, prioritizing clarity and judge the contribution of the information to the goal.
- **Final Output Format using JSON format has "rational", "evidence", "summary" feilds**
- """
复制代码 来源:程序园用户自行投稿发布,如果侵权,请联系站长删除 免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |