6大核心模块(Modules)
示例
HTML

LangChain

HTML

本文介绍如何将HTML文档加载到我们可以向下使用的文档格式中。

from langchain.document_loaders import UnstructuredHTMLLoader
 
loader = UnstructuredHTMLLoader("example_data/fake-content")
 
data = loader.load()
 
data
 
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content'}, lookup_index=0)]
 

Loading HTML with BeautifulSoup4 #

We can also use BeautifulSoup4 to load HTML documents using the BSHTMLLoader . This will extract the text from the html into page_content , and the page title as title into metadata .

from langchain.document_loaders import BSHTMLLoader
 
loader = BSHTMLLoader("example_data/fake-content")
data = loader.load()
data
 
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', lookup_str='', metadata={'source': 'example_data/fake-content', 'title': 'Test Title'}, lookup_index=0)]