如何通过迭代精炼来总结文本
大型语言模型可以从文本中总结和提炼所需 的信息,包括大量文本。在许多情况下,特别是当文本量相对于模型的上下文窗口大小较大时,将总结任务拆分为更小的组件可能是有帮助的(或必要的)。
迭代精炼代表了一种总结长文本的策略。该策略如下:
- 将文本拆分为更小的文档;
- 总结第一个文档;
- 根据下一个文档精炼或更新结果;
- 在文档序列中重复,直到完成。
请注意,这种策略不是并行化的。当对子文档的理解依赖于先前的上下文时,这种策略尤其有效——例如,在总结一本小说或具有内在顺序的文本时。
LangGraph,基于langchain-core
构建,非常适合这个问题:
- LangGraph允许单个步骤(例如连续的总结)进行流式处理,从而提供更大的执行控制;
- LangGraph的检查点支持错误恢复,扩展人机协作工作流,并更容易融入对话应用程序。
- 由于它是由模块化组件组装而成,因此也很容易扩展或修改(例如,融入工具调用或其他行为)。
下面,我们演示如何通过迭代优化来总结文本。
加载聊天模型
让我们首先加载一个聊天模型:
- OpenAI
- Anthropic
- Azure
- Cohere
- NVIDIA
- FireworksAI
- Groq
- MistralAI
- TogetherAI
pip install -qU langchain-openai
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
pip install -qU langchain-anthropic
import getpass
import os
os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")
pip install -qU langchain-openai
import getpass
import os
os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
pip install -qU langchain-google-vertexai
import getpass
import os
os.environ["GOOGLE_API_KEY"] = getpass.getpass()
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model="gemini-1.5-flash")
pip install -qU langchain-cohere
import getpass
import os
os.environ["COHERE_API_KEY"] = getpass.getpass()
from langchain_cohere import ChatCohere
llm = ChatCohere(model="command-r-plus")
pip install -qU langchain-nvidia-ai-endpoints
import getpass
import os
os.environ["NVIDIA_API_KEY"] = getpass.getpass()
from langchain import ChatNVIDIA
llm = ChatNVIDIA(model="meta/llama3-70b-instruct")
pip install -qU langchain-fireworks
import getpass
import os
os.environ["FIREWORKS_API_KEY"] = getpass.getpass()
from langchain_fireworks import ChatFireworks
llm = ChatFireworks(model="accounts/fireworks/models/llama-v3p1-70b-instruct")
pip install -qU langchain-groq
import getpass
import os
os.environ["GROQ_API_KEY"] = getpass.getpass()
from langchain_groq import ChatGroq
llm = ChatGroq(model="llama3-8b-8192")
pip install -qU langchain-mistralai
import getpass
import os
os.environ["MISTRAL_API_KEY"] = getpass.getpass()
from langchain_mistralai import ChatMistralAI
llm = ChatMistralAI(model="mistral-large-latest")
pip install -qU langchain-openai
import getpass
import os
os.environ["TOGETHER_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://api.together.xyz/v1",
api_key=os.environ["TOGETHER_API_KEY"],
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
加载文档
接下来,我们需要一些文档来进行总结。下面,我们生成一些玩具文档以作说明。请参阅文档加载器使用手册和集成页面以获取其他数据来源。总结教程还包括一个总结博客文章的示例。
<!--IMPORTS:[{"imported": "Document", "source": "langchain_core.documents", "docs": "https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html", "title": "How to summarize text through iterative refinement"}]-->
from langchain_core.documents import Document
documents = [
Document(page_content="Apples are red", metadata={"title": "apple_book"}),
Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
Document(page_content="Bananas are yelow", metadata={"title": "banana_book"}),
]
创建图
下面我们展示了这个过程的LangGraph实现:
- 我们生成一个简单的链,用于初始摘要,提取第一个文档,将其格式化为提示词并使用我们的LLM进行推理。
- 我们生成第二个
refine_summary_chain
,对每个后续文档进行操作,细化初始摘要。
我们需要安装langgraph
:
pip install -qU langgraph
<!--IMPORTS:[{"imported": "StrOutputParser", "source": "langchain_core.output_parsers", "docs": "https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html", "title": "How to summarize text through iterative refinement"}, {"imported": "ChatPromptTemplate", "source": "langchain_core.prompts", "docs": "https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html", "title": "How to summarize text through iterative refinement"}, {"imported": "RunnableConfig", "source": "langchain_core.runnables", "docs": "https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.config.RunnableConfig.html", "title": "How to summarize text through iterative refinement"}]-->
import operator
from typing import List, Literal, TypedDict
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph
# Initial summary
summarize_prompt = ChatPromptTemplate(
[
("human", "Write a concise summary of the following: {context}"),
]
)
initial_summary_chain = summarize_prompt | llm | StrOutputParser()
# Refining the summary with new docs
refine_template = """
Produce a final summary.
Existing summary up to this point:
{existing_answer}
New context:
------------
{context}
------------
Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_summary_chain = refine_prompt | llm | StrOutputParser()
# We will define the state of the graph to hold the document
# contents and summary. We also include an index to keep track
# of our position in the sequence of documents.
class State(TypedDict):
contents: List[str]
index: int
summary: str
# We define functions for each node, including a node that generates
# the initial summary:
async def generate_initial_summary(state: State, config: RunnableConfig):
summary = await initial_summary_chain.ainvoke(
state["contents"][0],
config,
)
return {"summary": summary, "index": 1}
# And a node that refines the summary based on the next document
async def refine_summary(state: State, config: RunnableConfig):
content = state["contents"][state["index"]]
summary = await refine_summary_chain.ainvoke(
{"existing_answer": state["summary"], "context": content},
config,
)
return {"summary": summary, "index": state["index"] + 1}
# Here we implement logic to either exit the application or refine
# the summary.
def should_refine(state: State) -> Literal["refine_summary", END]:
if state["index"] >= len(state["contents"]):
return END
else:
return "refine_summary"
graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)
graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()
LangGraph允许绘制图结构,以帮助可视化其功能:
from IPython.display import Image
Image(app.get_graph().draw_mermaid_png())
调用图
我们可以按如下方式逐步执行,打印出细化后的摘要:
async for step in app.astream(
{"contents": [doc.page_content for doc in documents]},
stream_mode="values",
):
if summary := step.get("summary"):
print(summary)
Apples are characterized by their red color.
Apples are characterized by their red color, while blueberries are known for their blue hue.
Apples are characterized by their red color, blueberries are known for their blue hue, and bananas are recognized for their yellow color.
最终的step
包含从整个文档集综合得出的摘要。
下一步
查看摘要 使用手册 以获取更多摘要策略,包括针对较大文本量的策略。
有关摘要的更多详细信息,请参见 本教程。
有关使用 LangGraph 构建的详细信息,请参见 LangGraph 文档。