Serverless PDF Q&A System on AWS

High-Level Architecture

Architecture Components

1. Upload Layer

Amazon S3 - PDF Storage

Bucket 1: pdf-uploads (Standard storage)
- Upload via presigned URLs or direct POST
- Server-side encryption (SSE-S3 or SSE-KMS)
- Lifecycle policy: Move to IA after 30 days, Glacier after 90 days
- Event notifications to SQS on upload

API Gateway + Lambda (Upload Handler)

Validates file type (PDF only)
Checks file size limits (e.g., 50MB max)
Generates presigned URLs for direct S3 upload (reduces Lambda time)
Creates metadata entry in DynamoDB

2. Processing Pipeline

Amazon Step Functions (State Machine)

Orchestrates the entire processing workflow:

{
  "Comment": "PDF Processing Pipeline",
  "StartAt": "ExtractText",
  "States": {
    "ExtractText": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:ExtractText",
      "Next": "GenerateSummary",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "NotifyFailure",
          "ResultPath": "$.error"
        }
      ]
    },
    "GenerateSummary": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:GenerateSummary",
      "Next": "GenerateEmbeddings"
    },
    "GenerateEmbeddings": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:GenerateEmbeddings",
      "Next": "StoreResults"
    },
    "StoreResults": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:StoreResults",
      "End": true
    }
  }
}

Amazon Textract

Async document analysis API
Handles multi-page PDFs
Extracts text with layout preservation
Cost: ~$1.50 per 1000 pages

3. AI Processing Layer

Amazon Bedrock - Claude 3.5 Sonnet

Summarization: Use anthropic.claude-3-5-sonnet-20241022-v2:0
- Prompt engineering for structured summaries
- Token limits: Input up to 200K tokens, output 8K tokens
- Cost: ~$3/million input tokens, ~$15/million output tokens
Q&A with RAG:
- Retrieve relevant chunks via vector search
- Provide context to Claude for answer generation
- Streaming responses for better UX

Embedding Generation

Option 1: Amazon Titan Embeddings (amazon.titan-embed-text-v2)
- Cost-effective, managed service
- 1024-dimension vectors
Option 2: Cohere Embed (cohere.embed-english-v3)
- Higher quality, better semantic understanding
- 1024-dimension vectors

4. Storage Layer

DynamoDB - Document Metadata

Table Structure:

Partition Key: document_id (UUID)
Sort Key: None
Attributes:
  - user_id (GSI partition key)
  - uploaded_at (GSI sort key)
  - file_name
  - file_size
  - summary
  - status (processing, completed, failed)
  - s3_key
  - processing_metadata (JSON)
  - ttl (auto-delete old records)

GSIs:

user-id-index: Query documents by user
status-index: Monitor processing status

Cost Optimization:

On-demand pricing for variable workloads
TTL for automatic cleanup
Efficient indexing strategy

Amazon OpenSearch Serverless - Vector Store

Vector similarity search using k-NN plugin
Index structure:
- Document chunks (with metadata)
- Vector embeddings (1024 dimensions)
- Metadata filters (user_id, document_id, date_range)

Cost Optimization:

Serverless OpenSearch (pay per query)
Implement caching for frequent queries
Use filtering to reduce search space

S3 - Processed Documents

Store extracted text and summaries
Lifecycle policies for cost optimization:
- Standard → IA (after 30 days)
- IA → Glacier (after 90 days)
- Glacier → Delete (after 1 year)

5. Q&A Layer

RAG (Retrieval-Augmented Generation) Flow:

User submits question via API Gateway
Lambda generates embedding for the question
Vector similarity search in OpenSearch (top-k: 5-10 chunks)
Retrieve relevant document chunks
Construct prompt with context + question
Invoke Bedrock Claude 3.5 for answer generation
Return answer with source citations

Cost Optimization Strategies

1. Lambda Optimizations

Reserved Concurrency: Set limits per function to prevent cost spikes
Memory Allocation: Right-size based on actual usage (start with 256MB, optimize)
Provisioned Concurrency: Only for critical, low-latency paths (Q&A endpoint)
Timeouts: Set aggressive timeouts (30-60s) to prevent runaway costs
Dead Letter Queues: Monitor failures without retry costs

2. Bedrock Cost Control

Request Throttling: Implement rate limiting per user/tenant
Token Budgets: Set daily/monthly token limits per user
Caching: Cache summaries and embeddings for identical documents
Streaming: Use streaming for Q&A to show partial results (better UX)
Model Selection: Use Claude 3 Haiku for simple tasks, Sonnet for complex

3. S3 Cost Optimization

Lifecycle Policies: Automatic tier transitions
S3 Intelligent-Tiering: For unpredictable access patterns
Compression: Compress stored text (gzip) before storing
Presigned URLs: Direct upload to reduce Lambda processing time

4. OpenSearch Cost Control

Serverless Mode: Pay only for what you use
Query Caching: Cache frequent queries (e.g., common questions)
Batch Operations: Batch embedding storage operations
Index Lifecycle: Delete old vectors automatically

5. DynamoDB Optimization

On-Demand: For variable workloads (no capacity planning)
TTL: Auto-delete old records
Efficient GSIs: Only create necessary indexes
Batch Operations: Batch write/read operations

6. Monitoring & Alerts

CloudWatch Billing Alarms: Alert on cost thresholds
Cost Anomaly Detection: Detect unusual spending patterns
Service-Level Metrics: Track per-service costs
Usage Dashboards: Real-time cost visibility

Scalability Considerations

1. Horizontal Scaling

All components are serverless and auto-scale
No manual capacity planning required
Handles sudden traffic spikes automatically

2. Async Processing

Uploads trigger async processing pipeline
Users get immediate acknowledgment
Processing status tracked in DynamoDB
WebSocket/SSE for real-time status updates (optional)

3. Rate Limiting

API Gateway throttling (requests per second per API key)
Per-user limits via Cognito groups
SQS message throttling for processing queue

4. Caching Strategy

API Gateway Caching: Cache Q&A responses (short TTL: 5-10 min)
Lambda Layer Caching: Cache embeddings for common documents
CloudFront: CDN for static assets (if web UI is static)

5. Multi-Tenancy

Cognito User Pools for isolation
DynamoDB partition key includes tenant/user_id
OpenSearch index per tenant (or metadata filtering)

6. Error Handling & Resilience

Step Functions retry logic with exponential backoff
Dead Letter Queues for failed messages
Circuit breakers for external service calls
Graceful degradation (fallback to cached responses)

Estimated Costs (Example: 1000 documents/month, 500 Q&A queries/day)

Service	Usage	Monthly Cost (USD)
S3 Storage (100GB)	100GB Standard	~$2.30
S3 Requests	10K PUT, 50K GET	~$0.50
Lambda	10M invocations, 50GB-sec	~$20
Textract	5K pages	~$7.50
Bedrock Claude 3.5	100M input tokens, 5M output tokens	~$375
Bedrock Embeddings	100K embeddings	~$1
DynamoDB	1M read, 500K write	~$0.25
OpenSearch Serverless	100K queries	~$50
API Gateway	15K requests	~$1
Step Functions	1K executions	~$0.50
Total		~$458/month

Cost Reduction Tips:

Use Claude 3 Haiku for simple Q&A: -60% cost
Implement aggressive caching: -40% Bedrock costs
Optimize S3 lifecycle: -50% storage costs
Batch processing: -30% Lambda costs

Security Considerations

Authentication: Cognito User Pools with MFA
Authorization: IAM roles with least privilege
Encryption:
- S3: SSE-S3 or SSE-KMS
- DynamoDB: Encryption at rest
- API Gateway: TLS 1.2+
Data Isolation: Tenant-based data separation
Audit Logging: CloudTrail for all API calls
VPC: Optional VPC for Lambda (if needed for compliance)

Deployment Recommendations

Infrastructure as Code: AWS CDK or Terraform
CI/CD: GitHub Actions or AWS CodePipeline
Monitoring: CloudWatch Dashboards + X-Ray for tracing
Testing:
- Unit tests for Lambda functions
- Integration tests for Step Functions
- Load testing for cost validation
Multi-Region: Deploy in multiple regions for DR (optional)

Implementation Phases

Phase 1: MVP (Minimum Viable Product)

S3 upload → Textract → Bedrock summary → DynamoDB
Basic Q&A with simple keyword search

Phase 2: Vector Search

Implement embeddings and OpenSearch
RAG-based Q&A

Phase 3: Optimization

Cost optimization
Caching layer
Advanced monitoring

Phase 4: Enterprise Features

Multi-tenancy
Advanced analytics
Fine-tuning capabilities

Alternative Architectures

Option 1: Use Amazon Kendra Instead of OpenSearch

Pros: Fully managed, better semantic search, no infrastructure management
Cons: Higher cost, less customization
Use Case: Enterprise customers with budget

Option 2: Use Amazon Comprehend for Summarization

Pros: Lower cost for simple summaries
Cons: Less flexible, lower quality than Claude
Use Case: Cost-sensitive scenarios with simple documents

Option 3: Hybrid Approach

Use Bedrock for complex documents
Use Comprehend for simple documents
Route based on document complexity

Next Steps

Set up AWS account with appropriate service quotas
Create proof-of-concept with one document
Measure actual costs and optimize
Scale incrementally based on usage patterns

High-Level Architecture​

Architecture Components​

1. Upload Layer​

Amazon S3 - PDF Storage​

API Gateway + Lambda (Upload Handler)​

2. Processing Pipeline​

Amazon Step Functions (State Machine)​

Amazon Textract​

3. AI Processing Layer​

Amazon Bedrock - Claude 3.5 Sonnet​

Embedding Generation​

4. Storage Layer​

DynamoDB - Document Metadata​

Amazon OpenSearch Serverless - Vector Store​

S3 - Processed Documents​

5. Q&A Layer​

RAG (Retrieval-Augmented Generation) Flow:​

Cost Optimization Strategies​

1. Lambda Optimizations​

2. Bedrock Cost Control​

3. S3 Cost Optimization​

4. OpenSearch Cost Control​

5. DynamoDB Optimization​

6. Monitoring & Alerts​

Scalability Considerations​

1. Horizontal Scaling​

2. Async Processing​

3. Rate Limiting​

4. Caching Strategy​

5. Multi-Tenancy​

6. Error Handling & Resilience​

Estimated Costs (Example: 1000 documents/month, 500 Q&A queries/day)​

Security Considerations​

Deployment Recommendations​

Implementation Phases​

Phase 1: MVP (Minimum Viable Product)​

Phase 2: Vector Search​

Phase 3: Optimization​

Phase 4: Enterprise Features​

Alternative Architectures​

Option 1: Use Amazon Kendra Instead of OpenSearch​

Option 2: Use Amazon Comprehend for Summarization​

Option 3: Hybrid Approach​

Next Steps​

High-Level Architecture

Architecture Components

1. Upload Layer

Amazon S3 - PDF Storage

API Gateway + Lambda (Upload Handler)

2. Processing Pipeline

Amazon Step Functions (State Machine)

Amazon Textract

3. AI Processing Layer

Amazon Bedrock - Claude 3.5 Sonnet

Embedding Generation

4. Storage Layer

DynamoDB - Document Metadata

Amazon OpenSearch Serverless - Vector Store

S3 - Processed Documents

5. Q&A Layer

RAG (Retrieval-Augmented Generation) Flow:

Cost Optimization Strategies

1. Lambda Optimizations

2. Bedrock Cost Control

3. S3 Cost Optimization

4. OpenSearch Cost Control

5. DynamoDB Optimization

6. Monitoring & Alerts

Scalability Considerations

1. Horizontal Scaling

2. Async Processing

3. Rate Limiting

4. Caching Strategy

5. Multi-Tenancy

6. Error Handling & Resilience

Estimated Costs (Example: 1000 documents/month, 500 Q&A queries/day)

Security Considerations

Deployment Recommendations

Implementation Phases

Phase 1: MVP (Minimum Viable Product)

Phase 2: Vector Search

Phase 3: Optimization

Phase 4: Enterprise Features

Alternative Architectures

Option 1: Use Amazon Kendra Instead of OpenSearch

Option 2: Use Amazon Comprehend for Summarization

Option 3: Hybrid Approach

Next Steps