如何使用每个文档的多个向量进行检索
通常,将每个文档存储多个向量是非常有用的。这在多个用例中是有益的。例如,我们可以嵌入文档的多个块,并将这些嵌入与父文档关联,从而允许对块的检索命中返回更大的文档。
LangChain 实现了一个基础的 MultiVectorRetriever,简化了这个过程。大部分复杂性在于如何为每个文档创建多个向量。这个笔记本涵盖了一些创建这些向量的常见方法以及如何使用 MultiVectorRetriever
。
为每个文档创建多个向量的方法包括:
- 较小的块:将文档拆分为较小的块,并嵌入这些块(这就是 ParentDocumentRetriever)。
- 摘要:为每个文档创建摘要,将其与文档一起嵌入(或替代文档)。
- 假设性问题:创建每个文档适合回答的假设性问题,将这些问题与文档一起嵌入(或替代文档)。
请注意,这也启用了另一种添加嵌入的方法 - 手动添加。这是有用的,因为您可以明确添加应该导致文档被检索的问题或查询,从而给予您更多的控制权。
下面我们将通过一个示例进行演示。首先,我们实例化一些文档。我们将在一个(内存中的)Chroma 向量存储中使用 OpenAI 嵌入对它们进行索引,但任何 LangChain 向量存储或嵌入模型都可以满足要求。
%pip install --upgrade --quiet langchain-chroma langchain langchain-openai > /dev/null
<!--IMPORTS:[{"imported": "InMemoryByteStore", "source": "langchain.storage", "docs": "https://python.langchain.com/api_reference/core/stores/langchain_core.stores.InMemoryByteStore.html", "title": "How to retrieve using multiple vectors per document"}, {"imported": "Chroma", "source": "langchain_chroma", "docs": "https://python.langchain.com/api_reference/chroma/vectorstores/langchain_chroma.vectorstores.Chroma.html", "title": "How to retrieve using multiple vectors per document"}, {"imported": "TextLoader", "source": "langchain_community.document_loaders", "docs": "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.text.TextLoader.html", "title": "How to retrieve using multiple vectors per document"}, {"imported": "OpenAIEmbeddings", "source": "langchain_openai", "docs": "https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html", "title": "How to retrieve using multiple vectors per document"}, {"imported": "RecursiveCharacterTextSplitter", "source": "langchain_text_splitters", "docs": "https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html", "title": "How to retrieve using multiple vectors per document"}]-->
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
loaders = [
TextLoader("paul_graham_essay.txt"),
TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
更小的块
通常情况下,检索较大信息块是有用的,但嵌入较小的块。这允许嵌入尽可能准确地捕捉语义含义,同时将尽可能多的上下文传递给下游。请注意,这正是ParentDocumentRetriever所做的。在这里,我们展示了其背后的工作原理。
我们将区分向量存储,它索引(子)文档的嵌入,以及文档存储,它存放“父”文档并将其与标识符关联。
<!--IMPORTS:[{"imported": "MultiVectorRetriever", "source": "langchain.retrievers.multi_vector", "docs": "https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.multi_vector.MultiVectorRetriever.html", "title": "How to retrieve using multiple vectors per document"}]-->
import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
接下来,我们通过拆分原始文档生成“子”文档。请注意,我们将文档标识符存储在相应Document对象的metadata
中。
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)
最后,我们在向量存储和文档存储中索引文档:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
仅向量存储将检索小块:
retriever.vectorstore.similarity_search("justice breyer")[0]
Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '064eca46-a4c4-4789-8e3b-583f9597e54f', 'source': 'state_of_the_union.txt'})
而检索器将返回较大的父文档:
len(retriever.invoke("justice breyer")[0].page_content)
9875
检索器在向量数据库上执行的默认搜索类型是相似性搜索。LangChain向量存储还支持通过Max Marginal Relevance进行搜索。这可以通过检索器的search_type
参数进行控制:
<!--IMPORTS:[{"imported": "SearchType", "source": "langchain.retrievers.multi_vector", "docs": "https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.multi_vector.SearchType.html", "title": "How to retrieve using multiple vectors per document"}]-->
from langchain.retrievers.multi_vector import SearchType
retriever.search_type = SearchType.mmr
len(retriever.invoke("justice breyer")[0].page_content)
9875