Activeloop 深度记忆
Activeloop 深度记忆 是一套工具,能够帮助您优化您的向量存储以适应您的用例,并在您的大型语言模型应用中实现更高的准确性。
检索增强生成
(RAG
) 最近引起了广泛关注。随着先进的 RAG 技术和代理的出现,它们扩展了 RAG 能够实现的潜力。然而,几个挑战可能限制 RAG 在生产中的集成。在生产环境中实施 RAG 时,主要考虑的因素是准确性(召回率)、成本和延迟。对于基本用例,OpenAI 的 Ada 模型配合简单的相似性搜索可以产生令人满意的结果。然而,对于更高的准确性或搜索召回率,可能需要采用先进的检索技术。这些方法可能涉及不同的数据块大小、多次重写查询等,可能会增加延迟和成本。Activeloop 的 深度记忆 是 Activeloop Deep Lake
用户可用的一个功能,通过引入一个微小的神经网络层,训练以将用户查询与语料库中的相关数据匹配,解决了这些问题。虽然这个附加功能在搜索时引入的延迟很小,但可以将检索准确性提高多达 27%
%,并且保持成本效益高且易于使用,无需任何额外的高级 RAG 技术。
在本教程中,我们将解析 DeepLake
文档,并创建一个能够回答文档中问题的 RAG 系统。
1. 数据集创建
我们将使用 BeautifulSoup
库和 LangChain 的文档解析器,如 Html2TextTransformer
、AsyncHtmlLoader
来解析 activeloop 的文档。因此,我们需要安装以下库:
%pip install --upgrade --quiet tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas
此外,您需要创建一个 Activeloop 账户。
ORG_ID = "..."
<!--IMPORTS:[{"imported": "RetrievalQA", "source": "langchain.chains", "docs": "https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html", "title": "Activeloop Deep Memory"}, {"imported": "DeepLake", "source": "langchain_community.vectorstores", "docs": "https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.deeplake.DeepLake.html", "title": "Activeloop Deep Memory"}, {"imported": "ChatOpenAI", "source": "langchain_openai", "docs": "https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html", "title": "Activeloop Deep Memory"}, {"imported": "OpenAIEmbeddings", "source": "langchain_openai", "docs": "https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html", "title": "Activeloop Deep Memory"}]-->
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import DeepLake
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
# # activeloop token is needed if you are not signed in using CLI: `activeloop login -u <USERNAME> -p <PASSWORD>`
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass(
"Enter your ActiveLoop API token: "
) # Get your API token from https://app.activeloop.ai, click on your profile picture in the top right corner, and select "API Tokens"
token = os.getenv("ACTIVELOOP_TOKEN")
openai_embeddings = OpenAIEmbeddings()
db = DeepLake(
dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory", # org_id stands for your username or organization from activeloop
embedding=openai_embeddings,
runtime={"tensor_db": True},
token=token,
# overwrite=True, # user overwrite flag if you want to overwrite the full dataset
read_only=False,
)
使用 BeautifulSoup
解析网页中的所有链接
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def get_all_links(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {url}")
return []
soup = BeautifulSoup(response.content, "html.parser")
# Finding all 'a' tags which typically contain href attribute for links
links = [
urljoin(url, a["href"]) for a in soup.find_all("a", href=True) if a["href"]
]
return links
base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)
加载数据:
<!--IMPORTS:[{"imported": "AsyncHtmlLoader", "source": "langchain_community.document_loaders.async_html", "docs": "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.async_html.AsyncHtmlLoader.html", "title": "Activeloop Deep Memory"}]-->
from langchain_community.document_loaders.async_html import AsyncHtmlLoader
loader = AsyncHtmlLoader(all_links)
docs = loader.load()
将数据转换为用户可读格式:
<!--IMPORTS:[{"imported": "Html2TextTransformer", "source": "langchain_community.document_transformers", "docs": "https://python.langchain.com/api_reference/community/document_transformers/langchain_community.document_transformers.html2text.Html2TextTransformer.html", "title": "Activeloop Deep Memory"}]-->
from langchain_community.document_transformers import Html2TextTransformer
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
现在,让我们进一步分块文档,因为其中一些包含太多文本:
<!--IMPORTS:[{"imported": "RecursiveCharacterTextSplitter", "source": "langchain_text_splitters", "docs": "https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html", "title": "Activeloop Deep Memory"}]-->
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 4096
docs_new = []
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
)
for doc in docs_transformed:
if len(doc.page_content) < chunk_size:
docs_new.append(doc)
else:
docs = text_splitter.create_documents([doc.page_content])
docs_new.extend(docs)
填充向量存储:
docs = db.add_documents(docs_new)