Status: ✅ Complete
Version: 1.0.0
Last Updated: 2024-01-15
Financial document management system with OCR extraction and AI-powered analysis for tax forms, bank statements, receipts, and other financial documents.
Table of Contents
- Overview
- Quick Start
- Use Cases
- Architecture
- API Reference
- Models
- Storage
- OCR Extraction
- AI Analysis
- FastAPI Integration
- Production Migration
- Testing
- Troubleshooting
Overview
Architecture: fin-infra documents is built as Layer 2 on top of svc-infra's generic document storage (Layer 1). This provides a clean separation: svc-infra handles base CRUD operations, while fin-infra adds financial-specific features like OCR for tax forms and AI-powered analysis. See Architecture section for details.
The documents module provides a complete solution for managing financial documents in fintech applications. It handles:
- Document Upload: Store tax forms, bank statements, receipts, invoices, contracts, insurance docs (Layer 1: svc-infra)
- OCR Extraction: Extract text from images/PDFs with Tesseract or AWS Textract (Layer 2: fin-infra)
- AI Analysis: Generate insights, recommendations, and summaries (Layer 2: fin-infra)
- Secure Storage: In-memory for testing, S3 + SQL for production (Layer 1: svc-infra)
- FastAPI Integration: One-liner to mount full REST API (6 endpoints total)
Key Features
✅ Multiple document types (7 supported: tax, bank_statement, receipt, invoice, contract, insurance, other)
✅ OCR text extraction with provider selection (Tesseract 85% confidence, Textract 96%)
✅ AI-powered document analysis with insights and recommendations
✅ Document filtering by type and year
✅ SHA-256 checksums for integrity verification
✅ MIME type detection
✅ Caching for OCR and analysis results
✅ Production-ready architecture with clear migration path
Quick Start
Installation
# fin-infra includes documents module
poetry add fin-infra
# Or from local source
cd fin-infra
poetry installBasic Usage (Programmatic)
from fin_infra.documents import easy_documents, DocumentType
# Create manager
docs = easy_documents(
storage_path="/data/documents",
default_ocr_provider="tesseract"
)
# Upload document
with open("w2_2024.pdf", "rb") as f:
doc = docs.upload(
user_id="user_123",
file=f.read(),
document_type=DocumentType.TAX,
filename="w2_2024.pdf",
metadata={"year": 2024, "form_type": "W-2"}
)
print(f"Uploaded: {doc.id}")
# Extract text via OCR
ocr_result = docs.extract_text(doc.id, provider="tesseract")
print(f"OCR confidence: {ocr_result.confidence}")
# Analyze document with AI
analysis = docs.analyze(doc.id)
print(f"Summary: {analysis.summary}")
for finding in analysis.key_findings:
print(f" - {finding}")FastAPI Integration
from fastapi import FastAPI
from fin_infra.documents import add_documents
app = FastAPI()
# One-liner: mount all document endpoints (Layer 1 + Layer 2)
manager = add_documents(
app,
storage_path="/data/documents",
default_ocr_provider="tesseract",
prefix="/documents"
)
# Endpoints now available:
# Layer 1 (svc-infra base - generic CRUD):
# POST /documents/upload - Upload document
# GET /documents/list - List documents with filters
# GET /documents/{document_id} - Get document details
# DELETE /documents/{document_id} - Delete document
#
# Layer 2 (fin-infra extensions - financial features):
# POST /documents/{document_id}/ocr - Extract text via OCR
# POST /documents/{document_id}/analyze - AI-powered analysisUse Cases
1. Tax Document Management
Scenario: Personal finance app users upload tax forms for automated analysis.
# User uploads W-2
doc = docs.upload(
user_id="user_123",
file=w2_file_bytes,
document_type=DocumentType.TAX,
filename="w2_acme_corp_2024.pdf",
metadata={"year": 2024, "form_type": "W-2", "employer": "Acme Corp"}
)
# Extract text via OCR
ocr = docs.extract_text(doc.id, provider="textract") # Higher accuracy for tax forms
# Analyze for insights
analysis = docs.analyze(doc.id)
# Returns:
# - Summary: "W-2 for 2024 tax year showing $85,000 in wages"
# - Key findings: ["Total wages: $85,000", "Federal tax withheld: $12,750"]
# - Recommendations: ["Consider increasing 401(k) contributions", "Review W-4 withholding"]2. Bank Statement Processing
Scenario: Fintech app aggregates bank statements for spending analysis.
# Upload statement
doc = docs.upload(
user_id="user_456",
file=statement_bytes,
document_type=DocumentType.BANK_STATEMENT,
filename="chase_jan_2024.pdf",
metadata={"year": 2024, "month": 1, "bank": "Chase"}
)
# Analyze spending patterns
analysis = docs.analyze(doc.id)
# Returns spending insights, recurring patterns, unusual transactions3. Receipt Management
Scenario: Expense tracking app allows users to photograph receipts.
# User uploads receipt photo
doc = docs.upload(
user_id="user_789",
file=receipt_image_bytes,
document_type=DocumentType.RECEIPT,
filename="starbucks_receipt.jpg",
metadata={"merchant": "Starbucks", "date": "2024-01-15"}
)
# OCR extracts amount and items
ocr = docs.extract_text(doc.id)
# Analysis categorizes expense
analysis = docs.analyze(doc.id)
# Returns: Category suggestion, deductibility status4. Document Organization
Scenario: User wants to view all tax documents from a specific year.
# List all tax documents from 2024
tax_docs_2024 = docs.list(
user_id="user_123",
type=DocumentType.TAX,
year=2024
)
for doc in tax_docs_2024:
print(f"{doc.filename} - uploaded {doc.upload_date}")Architecture
Layered Design (svc-infra Base + fin-infra Extensions)
fin-infra documents module is built as Layer 2 on top of svc-infra's generic document system (Layer 1):
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Application │
│ (add_documents) │
└───────────────────┬─────────────────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ FinancialDocumentManager │ ← Layer 2 (fin-infra)
│ extends BaseDocumentManager │ Financial features
└────────────┬───────────────────┘
│
▼
┌────────────────────────────────┐
│ BaseDocumentManager │ ← Layer 1 (svc-infra)
│ (generic CRUD) │ Generic storage
└────────────┬───────────────────┘
│
┌────────────────┼────────────────┬────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────────┐
│ Storage │ │ OCR │ │ Analysis │ │ Models │
│ (Base) │ │(Fin Ext)│ │(Fin Ext) │ │ (Financial) │
└─────────┘ └─────────┘ └──────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│svc-infra│ │ Cache │ │ai-infra │
│ Storage │ │ Dict │ │ LLM │
│ Backend │ │(→Redis) │ │(→GenAI) │
└─────────┘ └─────────┘ └──────────┘Layer 1 (svc-infra): Generic document infrastructure
svc_infra.documents.Document- Base document modelsvc_infra.documents.upload_document()- Storage integrationsvc_infra.documents.add_documents()- 4 base endpoints (upload, list, get, delete)- Works with ANY storage backend (S3, local, memory)
- Domain-agnostic - usable by ANY application
Layer 2 (fin-infra): Financial-specific extensions
FinancialDocumentextendsDocumentwith tax_year, form_type, DocumentTypeextract_text()- OCR for tax forms (W-2, 1099) with field parsinganalyze()- AI-powered financial insights and recommendations- 2 additional endpoints: POST /documents/{id}/ocr, POST /documents/{id}/analyze
- Backward compatible - same API surface as before
Why This Architecture?
✅ Separation of concerns: Generic file storage vs financial domain logic
✅ Reusability: Other domains (medical, legal) can use svc-infra base
✅ No duplication: fin-infra imports from svc-infra (not copy-paste)
✅ Clear extension pattern: Shows how to build domain features on generic base
✅ Backward compatible: fin-infra API unchanged, just refactored internally
Component Responsibilities
Storage (storage.py) - Layer 2 Wrapper:
- Delegates to
svc_infra.documentsfor all CRUD operations - Converts between
Document(base) andFinancialDocument(extended) - Injects/extracts financial metadata (document_type, tax_year, form_type)
- Maintains backward compatibility via function aliases
- Uses svc-infra storage backend (S3, local, memory)
OCR (ocr.py) - Financial Extension:
- Text extraction from images/PDFs
- Provider selection (Tesseract, AWS Textract)
- Tax form field parsing (W-2, 1099)
- Result caching (production: Redis with 7d TTL)
Analysis (analysis.py) - Financial Extension:
- Document analysis with financial insights and recommendations
- Type-specific analyzers (tax, bank statement, receipt, generic)
- Quality validation (confidence >= 0.7, non-empty fields)
- Result caching (production: Redis with 24h TTL)
- Production: Use ai-infra LLM for AI-powered analysis
- Extracts financial metrics (wages, taxes, spending patterns)
Models (models.py) - Layer 2 Extensions:
FinancialDocumentextendssvc_infra.documents.Documentwith financial fieldsDocument= alias for backward compatibilityDocumentTypeenum (7 types: TAX, BANK_STATEMENT, RECEIPT, etc.)OCRResultmodel (text, confidence, fields for W-2/1099 forms)DocumentAnalysismodel (summary, findings, recommendations)- Pydantic v2 models for data validation
How fin-infra Extends svc-infra
Import Pattern:
# fin-infra imports base functionality from svc-infra
from svc_infra.documents import (
Document as BaseDocument,
upload_document as base_upload_document,
add_documents as add_base_documents
)
# Then extends with financial features
class FinancialDocument(BaseDocument):
type: DocumentType # Financial-specific enum
tax_year: Optional[int] = None
form_type: Optional[str] = Noneadd_documents() Pattern (Layer 2 calls Layer 1):
def add_documents(app, storage, prefix="/documents"):
# Step 1: Mount base CRUD endpoints from svc-infra
add_base_documents(app, storage_backend=storage, prefix=prefix)
# This gives you: POST /upload, GET /list, GET /{id}, DELETE /{id}
# Step 2: Add financial-specific endpoints
@router.post("/{document_id}/ocr")
async def extract_text_ocr(...):
return await manager.extract_text(...)
@router.post("/{document_id}/analyze")
async def analyze_document_ai(...):
return await manager.analyze(...)For Other Domains: Follow the same pattern:
- Medical app:
MedicalDocumentextendsBaseDocumentwith diagnosis_codes, treatment_dates - Legal app:
LegalDocumentextendsBaseDocumentwith case_number, court_jurisdiction - E-commerce:
ProductDocumentextendsBaseDocumentwith sku, category
Reference Implementation: See svc-infra's generic documents at docs/documents.md
API Reference
DocumentManager Class
easy_documents(storage_path, default_ocr_provider) -> DocumentManager
Factory function to create configured DocumentManager.
Parameters:
storage_path(str): Base path for document storage (default: "/tmp/documents")default_ocr_provider(str): Default OCR provider - "tesseract" or "textract" (default: "tesseract")
Returns: DocumentManager instance
Example:
from fin_infra.documents import easy_documents
docs = easy_documents(
storage_path="/data/documents",
default_ocr_provider="textract"
)upload(user_id, file, document_type, filename, metadata=None) -> Document
Upload a financial document.
Parameters:
user_id(str): User identifierfile(bytes): File contentdocument_type(DocumentType): Document type enumfilename(str): Original filenamemetadata(Optional[dict]): Additional metadata (year, form_type, etc.)
Returns: Document model with generated ID, checksum, MIME type
Raises:
ValueError: If file is empty or filename invalid
Example:
with open("w2.pdf", "rb") as f:
doc = docs.upload(
user_id="user_123",
file=f.read(),
document_type=DocumentType.TAX,
filename="w2.pdf",
metadata={"year": 2024, "form_type": "W-2"}
)
print(f"Document ID: {doc.id}")
print(f"Checksum: {doc.checksum}")download(document_id) -> bytes
Download document file content.
Parameters:
document_id(str): Document identifier
Returns: File bytes
Raises:
ValueError: If document not found
Example:
file_bytes = docs.download("doc_abc123")
with open("downloaded.pdf", "wb") as f:
f.write(file_bytes)delete(document_id) -> None
Delete document (hard delete in current implementation).
Parameters:
document_id(str): Document identifier
Raises:
ValueError: If document not found
Example:
docs.delete("doc_abc123")list(user_id, type=None, year=None) -> List[Document]
List user's documents with optional filters.
Parameters:
user_id(str): User identifiertype(Optional[DocumentType]): Filter by document typeyear(Optional[int]): Filter by year (extracted from metadata)
Returns: List of Document models, sorted by upload_date descending
Example:
# All documents for user
all_docs = docs.list(user_id="user_123")
# Only tax documents
tax_docs = docs.list(user_id="user_123", type=DocumentType.TAX)
# Tax documents from 2024
tax_2024 = docs.list(user_id="user_123", type=DocumentType.TAX, year=2024)extract_text(document_id, provider=None, force_refresh=False) -> OCRResult
Extract text from document using OCR.
Parameters:
document_id(str): Document identifierprovider(Optional[str]): OCR provider - "tesseract" or "textract" (default: manager's default_ocr_provider)force_refresh(bool): Bypass cache and re-extract (default: False)
Returns: OCRResult with text, confidence, and extracted fields
Caching: Results cached in-memory (production: Redis 7d TTL)
Example:
# Use default provider
ocr = docs.extract_text("doc_abc123")
# Use specific provider
ocr = docs.extract_text("doc_abc123", provider="textract")
# Force re-extraction
ocr = docs.extract_text("doc_abc123", force_refresh=True)
print(f"Confidence: {ocr.confidence}")
print(f"Text: {ocr.text[:100]}...")
if ocr.fields:
print(f"Extracted fields: {ocr.fields}")analyze(document_id, force_refresh=False) -> DocumentAnalysis
Analyze document with AI for insights and recommendations.
Parameters:
document_id(str): Document identifierforce_refresh(bool): Bypass cache and re-analyze (default: False)
Returns: DocumentAnalysis with summary, key_findings, recommendations
Caching: Results cached in-memory (production: Redis 24h TTL)
Example:
analysis = docs.analyze("doc_abc123")
print(f"Summary: {analysis.summary}")
print("Key Findings:")
for finding in analysis.key_findings:
print(f" - {finding}")
print("Recommendations:")
for rec in analysis.recommendations:
print(f" - {rec}")Models
DocumentType (Enum)
class DocumentType(str, Enum):
TAX = "tax" # W-2, 1099, tax returns
BANK_STATEMENT = "bank_statement" # Monthly statements
RECEIPT = "receipt" # Purchase receipts
INVOICE = "invoice" # Bills, invoices
CONTRACT = "contract" # Agreements, contracts
INSURANCE = "insurance" # Insurance documents
OTHER = "other" # Misc documentsDocument
class Document(BaseModel):
id: str # Format: doc_{uuid}
user_id: str # User identifier
type: DocumentType # Document type
filename: str # Original filename
upload_date: datetime # Upload timestamp
size_bytes: int # File size
checksum: str # SHA-256 checksum
mime_type: str # Detected MIME type
metadata: dict # Additional metadata
model_config = ConfigDict(use_enum_values=True)Example:
{
"id": "doc_abc123",
"user_id": "user_123",
"type": "tax",
"filename": "w2_2024.pdf",
"upload_date": "2024-01-15T10:30:00",
"size_bytes": 45678,
"checksum": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"mime_type": "application/pdf",
"metadata": {"year": 2024, "form_type": "W-2", "employer": "Acme Corp"}
}OCRResult
class OCRResult(BaseModel):
document_id: str # Associated document
text: str # Extracted text
confidence: float # OCR confidence (0.0-1.0)
provider: str # OCR provider used
extracted_at: datetime # Extraction timestamp
fields: Optional[dict] # Parsed fields (tax forms)Example:
{
"document_id": "doc_abc123",
"text": "W-2 Wage and Tax Statement\\n2024\\nEmployer: Acme Corp\\nWages: $85,000\\n...",
"confidence": 0.96,
"provider": "textract",
"extracted_at": "2024-01-15T10:31:00",
"fields": {
"employer": "Acme Corp",
"wages": "85000",
"federal_tax": "12750",
"state_tax": "4250"
}
}DocumentAnalysis
class DocumentAnalysis(BaseModel):
document_id: str # Associated document
summary: str # Brief summary (max 250 chars)
key_findings: List[str] # Important findings (3-5 items)
recommendations: List[str] # Actionable recommendations (3-5)
confidence: float # Analysis confidence (0.0-1.0)
analyzed_at: datetime # Analysis timestampExample:
{
"document_id": "doc_abc123",
"summary": "W-2 for 2024 tax year showing $85,000 in wages with effective tax rate of 20%",
"key_findings": [
"Total wages: $85,000",
"Federal tax withheld: $12,750 (15% of wages)",
"Effective tax rate: 20%",
"No state disability insurance reported"
],
"recommendations": [
"Consider increasing 401(k) contributions to lower taxable income",
"Review W-4 withholding - may be over-withholding",
"Consult with tax professional about HSA eligibility",
"Note: This is not a substitute for advice from a certified financial advisor or tax professional"
],
"confidence": 0.92,
"analyzed_at": "2024-01-15T10:32:00"
}Storage
Layered Storage Architecture
fin-infra delegates all storage operations to svc-infra's generic storage system (Layer 1).
How It Works:
# fin-infra storage.py delegates to svc-infra
from svc_infra.documents import (
upload_document as base_upload,
download_document as base_download
)
# Financial wrapper adds DocumentType, tax_year, form_type
async def upload_document(storage, user_id, file, document_type, ...):
# Convert financial fields to metadata
metadata = {"document_type": document_type.value, ...}
# Call base layer
base_doc = await base_upload(storage, user_id, file, metadata=metadata)
# Convert to FinancialDocument
return FinancialDocument(**base_doc.model_dump(), type=document_type)Storage Backends (via svc-infra):
- MemoryBackend: Testing (fast, no persistence)
- LocalBackend: Railway volumes, filesystem storage
- S3Backend: AWS S3, DigitalOcean Spaces, Wasabi, Backblaze B2
See: svc-infra Storage Guide for backend documentation
Production Setup
Configuration (via environment variables):
# For S3 backend
export STORAGE_S3_BUCKET=user-documents
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
# For Railway volumes
export RAILWAY_VOLUME_MOUNT_PATH=/dataUsage:
from fin_infra.documents import add_documents, easy_documents
from svc_infra.storage import easy_storage
# Create storage backend (auto-detects from environment)
storage = easy_storage() # → S3Backend or LocalBackend
# Option 1: FastAPI integration
add_documents(app, storage=storage)
# Option 2: Programmatic
docs = easy_documents(storage=storage)
doc = await docs.upload(user_id, file, DocumentType.TAX, ...)Benefits:
- Layered design: Generic base + financial extensions
- Backend flexibility: S3, Railway, local, memory
- Battle-tested: Uses svc-infra infrastructure
- No vendor lock-in: Switch backends via environment variables
OCR Extraction
Providers
Tesseract (default):
- Open-source OCR engine
- 85% typical confidence
- Free, runs locally
- Good for clear scans
AWS Textract:
- Cloud-based OCR service
- 96% typical confidence
- Paid service (AWS pricing)
- Superior for complex layouts, handwriting
Usage
# Use Tesseract (default)
ocr = docs.extract_text("doc_abc123")
# Use Textract for higher accuracy
ocr = docs.extract_text("doc_abc123", provider="textract")
# Check confidence
if ocr.confidence < 0.9:
print("Warning: Low OCR confidence")Tax Form Field Extraction
The OCR module automatically extracts structured fields from W-2 and 1099 forms:
W-2 Fields:
- employer
- wages
- federal_tax
- state_tax
1099 Fields:
- payer
- income
Example:
ocr = docs.extract_text("doc_w2", provider="textract")
fields = ocr.fields
print(f"Employer: {fields['employer']}")
print(f"Wages: ${fields['wages']}")
print(f"Federal Tax: ${fields['federal_tax']}")Caching
OCR results are cached to avoid redundant processing:
- Storage: In-memory dict (production: Redis)
- TTL: 7 days (production)
- Cache Key:
{document_id}:{provider} - Bypass: Use
force_refresh=True
AI Analysis
Rule-Based Analysis (Current)
Purpose: Simulated AI for testing without external LLM dependencies
Tax Documents:
- Extracts wages from OCR text using regex:
Wages:\s*\$?([\d,]+\.?\d*) - Calculates effective tax rate from federal_tax
- Generates 3-5 recommendations
- Adds investment advice for wages > $100k
- Includes professional advisor disclaimer
Bank Statements:
- Generic spending insights
Receipts:
- Amount extraction
- Categorization advice
Generic:
- Basic "extracted successfully" message
Production Migration (ai-infra LLM)
from ai_infra.llm import LLM
llm = LLM(provider="google_genai", model="gemini-2.0-flash-exp")
# Build financial context
prompt = f"""
Analyze this {document_type} document:
OCR Text:
{ocr_text}
Metadata:
{metadata}
Provide:
1. Brief summary (max 100 words)
2. 3-5 key findings
3. 3-5 actionable recommendations
Note: Always end with "This is not a substitute for advice from a certified financial advisor or tax professional."
"""
# Get analysis with structured output
analysis_schema = {
"summary": str,
"key_findings": List[str],
"recommendations": List[str]
}
result = await llm.achat(
messages=[{"role": "user", "content": prompt}],
output_schema=analysis_schema,
output_method="prompt"
)Cost Management (MANDATORY):
- Track daily/monthly spend per user
- Enforce budget caps ($0.10/day, $2/month default)
- Use svc-infra cache for expensive operations (24h TTL)
- Target: <$0.10/user/month with caching
- Graceful degradation when budget exceeded (fall back to rule-based)
Quality Validation
All analyses must pass validation:
def _validate_analysis(analysis: DocumentAnalysis) -> bool:
# Confidence threshold
if analysis.confidence < 0.7:
return False
# Non-empty fields
if not analysis.key_findings or not analysis.recommendations:
return False
# Summary length limit
if len(analysis.summary) > 250:
return False
return TrueFastAPI Integration
Basic Setup
from fastapi import FastAPI
from fin_infra.documents import add_documents
app = FastAPI()
# Mount document endpoints
manager = add_documents(
app,
storage_path="/data/documents",
default_ocr_provider="tesseract",
prefix="/documents"
)
# Manager available for programmatic access
doc = manager.upload(...)Endpoints
POST /documents/upload
Upload a new document.
Request Body:
{
"user_id": "user_123",
"file": "base64_encoded_file_or_string",
"document_type": "tax",
"filename": "w2_2024.pdf",
"metadata": {"year": 2024, "form_type": "W-2"}
}Response: Document model
GET /documents/list?user_id=...&type=...&year=...
List user's documents with optional filters.
Query Parameters:
user_id(required): User identifiertype(optional): Document type filteryear(optional): Year filter
Response: List[Document]
GET /documents/{document_id}
Get document metadata.
Response: Document model
DELETE /documents/{document_id}
Delete a document.
Response:
{"message": "Document deleted successfully"}POST /documents/{document_id}/ocr?provider=...&force_refresh=...
Extract text from document.
Query Parameters:
provider(optional): OCR provider ("tesseract" or "textract")force_refresh(optional): Bypass cache (default: false)
Response: OCRResult model
POST /documents/{document_id}/analyze?force_refresh=...
Analyze document with AI.
Query Parameters:
force_refresh(optional): Bypass cache (default: false)
Response: DocumentAnalysis model
Authentication
Production add.py uses svc-infra user_router for authentication:
from svc_infra.api.fastapi.dual.protected import user_router
router = user_router(prefix="/documents", tags=["Documents"])
@router.post("/upload")
async def upload_document(user: RequireUser, ...):
# User authenticated via user_router
...Documentation
The module automatically registers with svc-infra scoped docs:
from svc_infra.api.fastapi.docs.scoped import add_prefixed_docs
add_prefixed_docs(
app,
prefix="/documents",
title="Documents",
auto_exclude_from_root=True,
visible_envs=None # Show in all environments
)Access:
- Landing page card:
/→ "Documents" card - Scoped OpenAPI:
/documents/openapi.json - Scoped Swagger UI:
/documents/docs - Scoped ReDoc:
/documents/redoc
Production Migration
Checklist
Storage Migration:
- Replace in-memory dicts with S3 for files
- Use svc-infra SQL for document metadata
- Add indexes on user_id, type, upload_date
- Implement soft-delete (add deleted_at column)
- Configure S3 lifecycle policies (archive old docs)
OCR Migration:
- Install pytesseract for Tesseract provider
- Configure AWS credentials for Textract
- Replace in-memory cache with Redis via svc-infra
- Set TTL to 7 days for OCR results
- Add error handling for provider failures
Analysis Migration:
- Replace rule-based logic with ai-infra LLM
- Implement cost tracking per user
- Enforce budget caps ($0.10/day, $2/month)
- Add graceful degradation to rule-based fallback
- Use svc-infra cache with 24h TTL
- Filter sensitive data before sending to LLM
Security:
- Enable authentication via user_router
- Add rate limiting (svc-infra middleware)
- Implement file size limits (10MB default)
- Validate MIME types on upload
- Scan uploads for malware (ClamAV)
- Encrypt sensitive documents at rest (S3 KMS)
Monitoring:
- Add svc-infra observability (metrics, traces, logs)
- Track upload success/failure rates
- Monitor OCR confidence distributions
- Alert on low analysis confidence
- Track LLM costs per user
Testing
Unit Tests (42 tests, all passing)
Storage Tests (tests/unit/documents/test_storage.py - 16 tests):
- Upload with/without metadata
- Unique ID generation
- Checksum calculation
- Get/download/delete operations
- List filtering by type and year
- Date sorting
OCR Tests (tests/unit/documents/test_ocr.py - 11 tests):
- Basic text extraction
- W-2 and 1099 form parsing
- Provider confidence differences
- Caching and force_refresh
- Invalid provider errors
- Field extraction validation
Analysis Tests (tests/unit/documents/test_analysis.py - 15 tests):
- W-2, 1099, bank statement, receipt, generic analysis
- Caching and force_refresh
- High-wage W-2 investment recommendations
- Confidence validation
- Summary length limits
- Non-empty findings/recommendations
- Professional advisor disclaimer
Integration Tests (14 tests, all passing)
API Tests (tests/integration/test_documents_api.py):
- add_documents helper mounts all routes
- Upload with/without metadata
- Get document details
- List documents (all, by type, by year)
- Delete document with verification
- OCR extraction (default provider, specific provider)
- Analysis (basic, force refresh)
- Empty list for new user
- Manager stored on app.state
Running Tests
# Unit tests
poetry run pytest tests/unit/documents/ -v
# Integration tests
poetry run pytest tests/integration/test_documents_api.py -v
# All documents tests
poetry run pytest tests/unit/documents/ tests/integration/test_documents_api.py -v
# With coverage
poetry run pytest tests/unit/documents/ --cov=src/fin_infra/documents --cov-report=term-missingTroubleshooting
Issue: Upload fails with "File is empty"
Cause: File bytes are empty
Solution:
# Ensure file is not empty
if not file_bytes:
raise ValueError("File cannot be empty")
# Check file size
print(f"File size: {len(file_bytes)} bytes")Issue: OCR confidence is low (<0.7)
Cause: Poor image quality, complex layout, or handwriting
Solutions:
- Try Textract provider:
docs.extract_text(doc_id, provider="textract") - Improve image quality (higher DPI, better lighting)
- Manually review extracted text
- Implement human-in-the-loop review for low confidence
Issue: Analysis returns generic insights
Cause: Rule-based analysis has limited intelligence
Solution: Migrate to ai-infra LLM for production:
from ai_infra.llm import LLM
llm = LLM(provider="google_genai", model="gemini-2.0-flash-exp")
# See "Production Migration" section for full implementationIssue: Document not found after upload
Cause: Storage cleared (in-memory) or wrong document_id
Solutions:
- Check document ID:
print(doc.id) - Verify storage not cleared:
clear_storage()in tests - Check user_id matches:
docs.list(user_id="...")
Issue: Routes conflict (/documents/list matches /{document_id})
Cause: FastAPI route ordering - more specific routes must come first
Solution:
# List route MUST come before /{document_id}
@router.get("/list")
async def list_documents(...): ...
@router.get("/{document_id}")
async def get_document(...): ...Issue: TestClient raises exceptions instead of returning error responses
Cause: Default TestClient raises server exceptions
Solution:
from fastapi.testclient import TestClient
client = TestClient(app, raise_server_exceptions=False)Future Enhancements
Short Term:
- Multipart/form-data upload support for file uploads
- Download endpoint (
GET /documents/{id}/download) - Async processing with svc-infra webhooks
- Soft-delete with deleted_at column
- File size limits and validation
Medium Term:
- Batch OCR processing for multiple documents
- Document versioning (track updates)
- Sharing/permissions (public links, team access)
- Search across OCR text (Elasticsearch integration)
- Document tagging system
Long Term:
- AI-powered document classification (auto-detect type)
- Multi-language OCR support
- Handwriting recognition
- Form pre-filling (extract data to tax software)
- Compliance checks (GDPR, SOC 2)
Related Documentation
Layered Architecture:
- svc-infra Documents Guide - Generic Layer 1 base
- svc-infra Storage Guide - Backend abstraction (S3, local, memory)
Financial Features:
- ai-infra LLM Guide - AI-powered analysis
Infrastructure:
- svc-infra Caching Guide - OCR/analysis result caching
- fin-infra README - Package overview
Document Version: 1.0.0
Last Updated: 2024-01-15
Authors: fin-infra team
License: MIT