模型缓存
本笔记涵盖如何使用不同的缓存来缓存单个大型语言模型调用的结果。
首先,让我们安装一些依赖项
%pip install -qU langchain-openai langchain-community
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass()
<!--IMPORTS:[{"imported": "set_llm_cache", "source": "langchain.globals", "docs": "https://python.langchain.com/api_reference/langchain/globals/langchain.globals.set_llm_cache.html", "title": "Model caches"}, {"imported": "OpenAI", "source": "langchain_openai", "docs": "https://python.langchain.com/api_reference/openai/llms/langchain_openai.llms.base.OpenAI.html", "title": "Model caches"}]-->
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
# To make the caching really obvious, lets use a slower and older model.
# Caching supports newer chat models as well.
llm = OpenAI(model="gpt-3.5-turbo-instruct", n=2, best_of=2)
内存
缓存
<!--IMPORTS:[{"imported": "InMemoryCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.InMemoryCache.html", "title": "Model caches"}]-->
from langchain_community.cache import InMemoryCache
set_llm_cache(InMemoryCache())
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 7.57 ms, sys: 8.22 ms, total: 15.8 ms
Wall time: 649 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 551 µs, sys: 221 µs, total: 772 µs
Wall time: 1.23 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
SQLite
缓存
!rm .langchain.db
<!--IMPORTS:[{"imported": "SQLiteCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.SQLiteCache.html", "title": "Model caches"}]-->
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 12.6 ms, sys: 3.51 ms, total: 16.1 ms
Wall time: 486 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 52.6 ms, sys: 57.7 ms, total: 110 ms
Wall time: 113 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
Upstash Redis
缓存
标准缓存
使用 Upstash Redis 通过无服务器 HTTP API 缓存提示词和响应。
%pip install -qU upstash_redis
<!--IMPORTS:[{"imported": "UpstashRedisCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.UpstashRedisCache.html", "title": "Model caches"}]-->
import langchain
from langchain_community.cache import UpstashRedisCache
from upstash_redis import Redis
URL = "<UPSTASH_REDIS_REST_URL>"
TOKEN = "<UPSTASH_REDIS_REST_TOKEN>"
langchain.llm_cache = UpstashRedisCache(redis_=Redis(url=URL, token=TOKEN))
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 7.56 ms, sys: 2.98 ms, total: 10.5 ms
Wall time: 1.14 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 2.78 ms, sys: 1.95 ms, total: 4.73 ms
Wall time: 82.9 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
语义缓存
使用 Upstash Vector 进行语义相似性搜索,并在数据库中缓存最相似的响应。向量化由所选的嵌入模型在创建 Upstash Vector 数据库时自动完成。
%pip install upstash-semantic-cache
<!--IMPORTS:[{"imported": "set_llm_cache", "source": "langchain.globals", "docs": "https://python.langchain.com/api_reference/langchain/globals/langchain.globals.set_llm_cache.html", "title": "Model caches"}]-->
from langchain.globals import set_llm_cache
from upstash_semantic_cache import SemanticCache
UPSTASH_VECTOR_REST_URL = "<UPSTASH_VECTOR_REST_URL>"
UPSTASH_VECTOR_REST_TOKEN = "<UPSTASH_VECTOR_REST_TOKEN>"
cache = SemanticCache(
url=UPSTASH_VECTOR_REST_URL, token=UPSTASH_VECTOR_REST_TOKEN, min_proximity=0.7
)
set_llm_cache(cache)
%%time
llm.invoke("Which city is the most crowded city in the USA?")
CPU times: user 28.4 ms, sys: 3.93 ms, total: 32.3 ms
Wall time: 1.89 s
'\n\nNew York City is the most crowded city in the USA.'
%%time
llm.invoke("Which city has the highest population in the USA?")
CPU times: user 3.22 ms, sys: 940 μs, total: 4.16 ms
Wall time: 97.7 ms
'\n\nNew York City is the most crowded city in the USA.'
Redis
缓存
有关详细信息,请参阅主要的 Redis 缓存文档。
标准缓存
使用 Redis 来缓存提示词和响应。
%pip install -qU redis
<!--IMPORTS:[{"imported": "RedisCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.RedisCache.html", "title": "Model caches"}]-->
# We can do the same thing with a Redis cache
# (make sure your local Redis instance is running first before running this example)
from langchain_community.cache import RedisCache
from redis import Redis
set_llm_cache(RedisCache(redis_=Redis()))
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 6.88 ms, sys: 8.75 ms, total: 15.6 ms
Wall time: 1.04 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 1.59 ms, sys: 610 µs, total: 2.2 ms
Wall time: 5.58 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
语义缓存
使用 Redis 来缓存提示词和响应,并根据语义相似性评估命中率。
%pip install -qU redis
<!--IMPORTS:[{"imported": "RedisSemanticCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.RedisSemanticCache.html", "title": "Model caches"}, {"imported": "OpenAIEmbeddings", "source": "langchain_openai", "docs": "https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html", "title": "Model caches"}]-->
from langchain_community.cache import RedisSemanticCache
from langchain_openai import OpenAIEmbeddings
set_llm_cache(
RedisSemanticCache(redis_url="redis://localhost:6379", embedding=OpenAIEmbeddings())
)
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 351 ms, sys: 156 ms, total: 507 ms
Wall time: 3.37 s
"\n\nWhy don't scientists trust atoms?\nBecause they make up everything."
%%time
# The second time, while not a direct hit, the question is semantically similar to the original question,
# so it uses the cached result!
llm.invoke("Tell me one joke")
CPU times: user 6.25 ms, sys: 2.72 ms, total: 8.97 ms
Wall time: 262 ms
"\n\nWhy don't scientists trust atoms?\nBecause they make up everything."
GPTCache
我们可以使用 GPTCache 进行精确匹配缓存或基于语义相似性缓存结果。
让我们首先从一个精确匹配的示例开始。
%pip install -qU gptcache
<!--IMPORTS:[{"imported": "GPTCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.GPTCache.html", "title": "Model caches"}]-->
import hashlib
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from langchain_community.cache import GPTCache
def get_hashed_name(name):
return hashlib.sha256(name.encode()).hexdigest()
def init_gptcache(cache_obj: Cache, llm: str):
hashed_llm = get_hashed_name(llm)
cache_obj.init(
pre_embedding_func=get_prompt,
data_manager=manager_factory(manager="map", data_dir=f"map_cache_{hashed_llm}"),
)
set_llm_cache(GPTCache(init_gptcache))
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 21.5 ms, sys: 21.3 ms, total: 42.8 ms
Wall time: 6.2 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 571 µs, sys: 43 µs, total: 614 µs
Wall time: 635 µs
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
现在让我们展示一个相似性缓存的示例。
<!--IMPORTS:[{"imported": "GPTCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.GPTCache.html", "title": "Model caches"}]-->
import hashlib
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain_community.cache import GPTCache
def get_hashed_name(name):
return hashlib.sha256(name.encode()).hexdigest()
def init_gptcache(cache_obj: Cache, llm: str):
hashed_llm = get_hashed_name(llm)
init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")
set_llm_cache(GPTCache(init_gptcache))
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 1.42 s, sys: 279 ms, total: 1.7 s
Wall time: 8.44 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'
%%time
# This is an exact match, so it finds it in the cache
llm.invoke("Tell me a joke")
CPU times: user 866 ms, sys: 20 ms, total: 886 ms
Wall time: 226 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'
%%time
# This is not an exact match, but semantically within distance so it hits!
llm.invoke("Tell me joke")
CPU times: user 853 ms, sys: 14.8 ms, total: 868 ms
Wall time: 224 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'
MongoDB Atlas
缓存
MongoDB Atlas 是一个完全托管的云数据库,适用于 AWS、Azure 和 GCP。它原生支持 在 MongoDB 文档数据上的向量搜索。 使用 MongoDB Atlas 向量搜索 进行语义缓存提示词和响应。
MongoDBCache
一个在 MongoDB 中存储简单缓存的抽象。这不使用语义缓存,也不需要在生成之前在集合上创建索引。
要导入此缓存,首先安装所需的依赖项:
%pip install -qU langchain-mongodb
<!--IMPORTS:[{"imported": "MongoDBCache", "source": "langchain_mongodb.cache", "docs": "https://python.langchain.com/api_reference/mongodb/cache/langchain_mongodb.cache.MongoDBCache.html", "title": "Model caches"}]-->
from langchain_mongodb.cache import MongoDBCache
要将此缓存与您的大型语言模型 (LLMs) 一起使用:
<!--IMPORTS:[{"imported": "set_llm_cache", "source": "langchain_core.globals", "docs": "https://python.langchain.com/api_reference/core/globals/langchain_core.globals.set_llm_cache.html", "title": "Model caches"}]-->
from langchain_core.globals import set_llm_cache
# use any embedding provider...
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
mongodb_atlas_uri = "<YOUR_CONNECTION_STRING>"
COLLECTION_NAME="<YOUR_CACHE_COLLECTION_NAME>"
DATABASE_NAME="<YOUR_DATABASE_NAME>"
set_llm_cache(MongoDBCache(
connection_string=mongodb_atlas_uri,
collection_name=COLLECTION_NAME,
database_name=DATABASE_NAME,
))
MongoDBAtlasSemanticCache
语义缓存允许用户根据用户输入与先前缓存结果之间的语义相似性检索缓存的提示词。在底层,它将 MongoDBAtlas 作为缓存和向量存储结合在一起。
MongoDBAtlasSemanticCache 继承自 MongoDBAtlasVectorSearch
,并需要定义一个 Atlas 向量搜索索引才能工作。请查看 使用示例 了解如何设置索引。
要导入此缓存:
<!--IMPORTS:[{"imported": "MongoDBAtlasSemanticCache", "source": "langchain_mongodb.cache", "docs": "https://python.langchain.com/api_reference/mongodb/cache/langchain_mongodb.cache.MongoDBAtlasSemanticCache.html", "title": "Model caches"}]-->
from langchain_mongodb.cache import MongoDBAtlasSemanticCache
要将此缓存与您的大型语言模型 (LLMs) 一起使用:
<!--IMPORTS:[{"imported": "set_llm_cache", "source": "langchain_core.globals", "docs": "https://python.langchain.com/api_reference/core/globals/langchain_core.globals.set_llm_cache.html", "title": "Model caches"}]-->
from langchain_core.globals import set_llm_cache
# use any embedding provider...
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
mongodb_atlas_uri = "<YOUR_CONNECTION_STRING>"
COLLECTION_NAME="<YOUR_CACHE_COLLECTION_NAME>"
DATABASE_NAME="<YOUR_DATABASE_NAME>"
set_llm_cache(MongoDBAtlasSemanticCache(
embedding=FakeEmbeddings(),
connection_string=mongodb_atlas_uri,
collection_name=COLLECTION_NAME,
database_name=DATABASE_NAME,
))
要找到更多关于使用MongoDBSemanticCache的资源,请访问这里
Momento
缓存
使用Momento来缓存提示和响应。
使用时需要momento,取消下面的注释以安装:
%pip install -qU momento
您需要获取一个Momento认证令牌才能使用此类。可以将其作为命名参数auth_token
传递给momento.CacheClient(如果您想直接实例化),或者可以将其设置为环境变量MOMENTO_AUTH_TOKEN
。
<!--IMPORTS:[{"imported": "MomentoCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.MomentoCache.html", "title": "Model caches"}]-->
from datetime import timedelta
from langchain_community.cache import MomentoCache
cache_name = "langchain"
ttl = timedelta(days=1)
set_llm_cache(MomentoCache.from_client_params(cache_name, ttl))
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 40.7 ms, sys: 16.5 ms, total: 57.2 ms
Wall time: 1.73 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
%%time
# The second time it is, so it goes faster
# When run in the same region as the cache, latencies are single digit ms
llm.invoke("Tell me a joke")
CPU times: user 3.16 ms, sys: 2.98 ms, total: 6.14 ms
Wall time: 57.9 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
SQLAlchemy
缓存
您可以使用SQLAlchemyCache
与任何由SQLAlchemy
支持的SQL数据库进行缓存。
<!--IMPORTS:[{"imported": "SQLAlchemyCache", "source": "langchain.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.SQLAlchemyCache.html", "title": "Model caches"}]-->
# from langchain.cache import SQLAlchemyCache
# from sqlalchemy import create_engine
# engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
# set_llm_cache(SQLAlchemyCache(engine))
自定义SQLAlchemy模式
<!--IMPORTS:[{"imported": "SQLAlchemyCache", "source": "langchain_community.cache", "docs": "https://python.langchain.com/api_reference/community/cache/langchain_community.cache.SQLAlchemyCache.html", "title": "Model caches"}]-->
# You can define your own declarative SQLAlchemyCache child class to customize the schema used for caching. For example, to support high-speed fulltext prompt indexing with Postgres, use:
from langchain_community.cache import SQLAlchemyCache
from sqlalchemy import Column, Computed, Index, Integer, Sequence, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy_utils import TSVectorType
Base = declarative_base()
class FulltextLLMCache(Base): # type: ignore
"""Postgres table for fulltext-indexed LLM Cache"""
__tablename__ = "llm_cache_fulltext"
id = Column(Integer, Sequence("cache_id"), primary_key=True)
prompt = Column(String, nullable=False)
llm = Column(String, nullable=False)
idx = Column(Integer)
response = Column(String)
prompt_tsv = Column(
TSVectorType(),
Computed("to_tsvector('english', llm || ' ' || prompt)", persisted=True),
)
__table_args__ = (
Index("idx_fulltext_prompt_tsv", prompt_tsv, postgresql_using="gin"),
)
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
set_llm_cache(SQLAlchemyCache(engine, FulltextLLMCache))
Cassandra
缓存
Apache Cassandra® 是一个NoSQL、行导向、高度可扩展和高度可用的数据库。从5.0版本开始,该数据库提供向量搜索功能。
您可以使用Cassandra来缓存大型语言模型的响应,可以选择精确匹配的CassandraCache
或基于向量相似度的CassandraSemanticCache
。
让我们看看这两者的实际应用。接下来的单元将指导您完成(少量的)必要设置,随后单元展示了两个可用的缓存类。
必要依赖
%pip install -qU "cassio>=0.1.4"
连接到数据库
本页面中显示的Cassandra缓存可以与Cassandra以及其他派生数据库一起使用,例如使用CQL(Cassandra查询语言)协议的Astra DB。
DataStax Astra DB 是一个基于Cassandra构建的托管无服务器数据库,提供相同的接口和优势。
根据您是通过CQL连接到Cassandra集群还是Astra DB,您在实例化缓存时(通过初始化CassIO连接)将提供不同的参数。
连接到Cassandra集群
您首先需要创建一个cassandra.cluster.Session
对象,如Cassandra驱动程序文档中所述。具体细节会有所不同(例如网络设置和身份验证),但这可能类似于:
from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
您现在可以将会话以及所需的键空间名称设置为全局CassIO参数:
import cassio
CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")
cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
CASSANDRA_KEYSPACE = demo_keyspace