🎉 svc-infra v1.0 is here — Production-ready backend infrastructure
What's new
nfrax logonfrax

Infrastructure that just works. Ship products, not boilerplate.

Frameworks

  • svc-infra
  • ai-infra
  • fin-infra
  • robo-infra

Resources

  • Getting Started
  • What's New
  • Contributing

Community

  • GitHub

© 2026 nfrax. All rights reserved.

nfrax logonfrax
Start HereWhat's New
GitHub
svc-infra / API Reference

URLLoader

from svc_infra.loaders import URLLoader
View source
svc_infra.loaders
Extends:BaseLoader

Load content from one or more URLs. Fetches content from URLs and optionally extracts readable text from HTML. Supports redirects, custom headers, and batch loading.

Args

urls: Single URL or list of URLs to load. headers: Optional HTTP headers to send with requests. extract_text: If True (default), extract readable text from HTML pages. Raw HTML is returned if False or if content is not HTML. follow_redirects: Follow HTTP redirects (default: True). timeout: Request timeout in seconds (default: 30). extra_metadata: Additional metadata to attach to all loaded content. on_error: How to handle errors ("skip" or "raise"). Default: "skip"

Example

>>> # Load single URL >>> loader = URLLoader("https://example.com/docs/guide.md") >>> contents = await loader.load() >>> print(contents[0].content[:100]) >>> >>> # Load multiple URLs >>> loader = URLLoader([ ... "https://example.com/page1", ... "https://example.com/page2", ... ]) >>> contents = await loader.load() >>> >>> # Disable HTML text extraction >>> loader = URLLoader("https://example.com", extract_text=False) >>> contents = await loader.load() # Returns raw HTML >>> >>> # With custom headers (e.g., for APIs) >>> loader = URLLoader( ... "https://api.example.com/docs", ... headers={"Authorization": "Bearer token123"}, ... ) >>> contents = await loader.load()

Note

- HTML text extraction removes scripts, styles, nav, footer, etc. - If BeautifulSoup is not installed, falls back to basic regex extraction - Content type is detected from HTTP headers

Constructor
URLLoader(urls: str | list[str], headers: dict[str, str] | None = None, extract_text: bool = True, follow_redirects: bool = True, timeout: float = 30.0, extra_metadata: dict[str, Any] | None = None, on_error: ErrorStrategy = 'skip') -> None
ParameterTypeDefaultDescription
urlsrequiredstr|list[str]—Single URL or list of URLs
headersdict[str, str] |NoneNoneHTTP headers to send
extract_textboolTrueExtract text from HTML (default: True)
follow_redirectsboolTrueFollow redirects (default: True)
timeoutfloat30.0Request timeout in seconds
extra_metadatadict[str, Any] |NoneNoneAdditional metadata for all content
on_errorErrorStrategy'skip'Error handling strategy

Methods

On This Page

Constructorloadasync