from svc_infra.loaders import URLLoaderBaseLoaderLoad content from one or more URLs. Fetches content from URLs and optionally extracts readable text from HTML. Supports redirects, custom headers, and batch loading.
urls: Single URL or list of URLs to load. headers: Optional HTTP headers to send with requests. extract_text: If True (default), extract readable text from HTML pages. Raw HTML is returned if False or if content is not HTML. follow_redirects: Follow HTTP redirects (default: True). timeout: Request timeout in seconds (default: 30). extra_metadata: Additional metadata to attach to all loaded content. on_error: How to handle errors ("skip" or "raise"). Default: "skip"
>>> # Load single URL >>> loader = URLLoader("https://example.com/docs/guide.md") >>> contents = await loader.load() >>> print(contents[0].content[:100]) >>> >>> # Load multiple URLs >>> loader = URLLoader([ ... "https://example.com/page1", ... "https://example.com/page2", ... ]) >>> contents = await loader.load() >>> >>> # Disable HTML text extraction >>> loader = URLLoader("https://example.com", extract_text=False) >>> contents = await loader.load() # Returns raw HTML >>> >>> # With custom headers (e.g., for APIs) >>> loader = URLLoader( ... "https://api.example.com/docs", ... headers={"Authorization": "Bearer token123"}, ... ) >>> contents = await loader.load()
- HTML text extraction removes scripts, styles, nav, footer, etc. - If BeautifulSoup is not installed, falls back to basic regex extraction - Content type is detected from HTTP headers