svc-infra / API Reference

URLLoader

from svc_infra.loaders import URLLoader

svc_infra.loaders

Extends:BaseLoader

Load content from one or more URLs. Fetches content from URLs and optionally extracts readable text from HTML. Supports redirects, custom headers, and batch loading.

Args

urls: Single URL or list of URLs to load. headers: Optional HTTP headers to send with requests. extract_text: If True (default), extract readable text from HTML pages. Raw HTML is returned if False or if content is not HTML. follow_redirects: Follow HTTP redirects (default: True). timeout: Request timeout in seconds (default: 30). extra_metadata: Additional metadata to attach to all loaded content. on_error: How to handle errors ("skip" or "raise"). Default: "skip"

Example

>>> # Load single URL >>> loader = URLLoader("https://example.com/docs/guide.md") >>> contents = await loader.load() >>> print(contents[0].content[:100]) >>> >>> # Load multiple URLs >>> loader = URLLoader([ ... "https://example.com/page1", ... "https://example.com/page2", ... ]) >>> contents = await loader.load() >>> >>> # Disable HTML text extraction >>> loader = URLLoader("https://example.com", extract_text=False) >>> contents = await loader.load() # Returns raw HTML >>> >>> # With custom headers (e.g., for APIs) >>> loader = URLLoader( ... "https://api.example.com/docs", ... headers={"Authorization": "Bearer token123"}, ... ) >>> contents = await loader.load()

Note

- HTML text extraction removes scripts, styles, nav, footer, etc. - If BeautifulSoup is not installed, falls back to basic regex extraction - Content type is detected from HTTP headers

Constructor

URLLoader(urls: str | list[str], headers: dict[str, str] | None = None, extract_text: bool = True, follow_redirects: bool = True, timeout: float = 30.0, extra_metadata: dict[str, Any] | None = None, on_error: ErrorStrategy = 'skip') -> None

Parameter	Type	Default	Description
`urls`required	`str\|list[str]`	—	Single URL or list of URLs
`headers`	`dict[str, str] \|None`	None	HTTP headers to send
`extract_text`	`bool`	True	Extract text from HTML (default: True)
`follow_redirects`	`bool`	True	Follow redirects (default: True)
`timeout`	`float`	30.0	Request timeout in seconds
`extra_metadata`	`dict[str, Any] \|None`	None	Additional metadata for all content
`on_error`	`ErrorStrategy`	'skip'	Error handling strategy

Methods

Parameter

Type

Default

Description

urlsrequired

str|list[str]

—

Single URL or list of URLs

headers

dict[str, str] |None

None

HTTP headers to send

extract_text

bool

True

Extract text from HTML (default: True)

follow_redirects

bool

True

Follow redirects (default: True)

timeout

float

30.0

Request timeout in seconds

extra_metadata

dict[str, Any] |None

None

Additional metadata for all content

on_error

ErrorStrategy

'skip'

Error handling strategy