scikit-learn
scikit-learn 是一个开源的机器学习算法集合,包括一些 k 最近邻 的实现。
SKLearnVectorStore
封装了这个实现,并增加了将向量存储持久化为 json、bson(二进制 json)或 Apache Parquet 格式的可能性。
本笔记本展示了如何使用 SKLearnVectorStore
向量数据库。
您需要使用 pip install -qU langchain-community
安装 langchain-community
才能使用此集成。
%pip install --upgrade --quiet scikit-learn
# # if you plan to use bson serialization, install also:
%pip install --upgrade --quiet bson
# # if you plan to use parquet serialization, install also:
%pip install --upgrade --quiet pandas pyarrow
要使用OpenAI嵌入,您需要一个OpenAI密钥。您可以在https://platform.openai.com/account/api-keys获取一个,或者随意使用其他任何嵌入。
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI key:")
基本用法
加载示例文档语料库
<!--IMPORTS:[{"imported": "TextLoader", "source": "langchain_community.document_loaders", "docs": "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.text.TextLoader.html", "title": "scikit-learn"}, {"imported": "SKLearnVectorStore", "source": "langchain_community.vectorstores", "docs": "https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.sklearn.SKLearnVectorStore.html", "title": "scikit-learn"}, {"imported": "OpenAIEmbeddings", "source": "langchain_openai", "docs": "https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html", "title": "scikit-learn"}, {"imported": "CharacterTextSplitter", "source": "langchain_text_splitters", "docs": "https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html", "title": "scikit-learn"}]-->
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()