如何加载HTML

超文本标记语言或HTML是为在网页浏览器中显示的文档设计的标准标记语言。

这部分介绍如何将HTML文档加载到LangChain Document对象中，以便我们在后续使用。

解析HTML文件通常需要专门的工具。在这里，我们演示了通过Unstructured和BeautifulSoup4进行解析，这些工具可以通过pip安装。请访问集成页面以查找与其他服务的集成，例如Azure AI Document Intelligence或FireCrawl。

使用Unstructured加载HTML

%pip install unstructured

<!--IMPORTS:[{"imported": "UnstructuredHTMLLoader", "source": "langchain_community.document_loaders", "docs": "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.html.UnstructuredHTMLLoader.html", "title": "How to load HTML"}]-->
from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "../../docs/integrations/document_loaders/example_data/fake-content.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'})]

使用 BeautifulSoup4 加载 HTML

我们还可以使用 BeautifulSoup4 通过 BSHTMLLoader 加载 HTML 文档。这将从 HTML 中提取文本到 page_content，并将页面标题作为 title 提取到 metadata。

%pip install bs4

<!--IMPORTS:[{"imported": "BSHTMLLoader", "source": "langchain_community.document_loaders", "docs": "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.html_bs.BSHTMLLoader.html", "title": "How to load HTML"}]-->
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]

如何加载HTML

使用Unstructured加载HTML

使用 BeautifulSoup4 加载 HTML

Was this page helpful?

You can also leave detailed feedback on GitHub.

使用Unstructured加载HTML​

使用 BeautifulSoup4 加载 HTML​

Was this page helpful?

You can also leave detailed feedback on GitHub.

使用Unstructured加载HTML

使用 BeautifulSoup4 加载 HTML