6大核心模块(Modules)
示例
URL

LangChain

这涵盖了如何从URL列表中加载HTML文档,以便我们可以在下游使用。

 from langchain.document_loaders import UnstructuredURLLoader
 
urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
 
loader = UnstructuredURLLoader(urls=urls)
 
data = loader.load()
 

Selenium URL加载器#

这涵盖了如何使用SeleniumURLLoader从URL列表中加载HTML文档。

使用selenium可以加载需要JavaScript渲染的页面。

设置#

要使用SeleniumURLLoader,您需要安装seleniumunstructured

from langchain.document_loaders import SeleniumURLLoader
 
urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8"
]
 
loader = SeleniumURLLoader(urls=urls)
 
data = loader.load()
 

Playwright URL加载器#

这涵盖了如何使用PlaywrightURLLoader从URL列表中加载HTML文档。

与Selenium的情况类似,Playwright允许我们加载需要JavaScript渲染的页面。

设置#

要使用PlaywrightURLLoader,您需要安装playwrightunstructured

此外,您需要安装Playwright Chromium浏览器:

# Install playwright
!pip install "playwright"
!pip install "unstructured"
!playwright install
 
from langchain.document_loaders import PlaywrightURLLoader
 
urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8"
]
 
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
 
data = loader.load()