Document Processing Worker

October 8, 2025 · View on GitHub

High-performance, cloud-native document processing worker service engineered for enterprise-scale document processing workloads. The worker implements a distributed, event-driven architecture that seamlessly scales from single-instance development environments to multi-node production clusters handling thousands of concurrent document processing operations.

Core Worker Concepts

Distributed Processing Architecture

The doc-proc-worker is built on a message-driven, horizontally scalable architecture that separates concerns between task coordination, execution orchestration, and processing logic. This design enables:

Fault Isolation: Failed workers don't impact other processing units
Load Distribution: Work is automatically distributed across available worker instances
Resource Optimization: CPU, memory, and I/O resources are efficiently utilized across processing units

Event-Driven Processing Model

The worker operates on an asynchronous, event-driven processing model where:

Queue-Based Coordination: Azure Storage Queues provide durable, distributed task coordination
Message-Driven Execution: Each processing unit is triggered by queue messages containing batch execution requests
State Management: Azure Cosmos DB provides consistent, globally distributed state tracking
Pipeline Orchestration: Document processing flows through configurable, multi-stage pipelines

Scaling Dimensions

Horizontal Scaling (Scale-Out)

Instance Scaling: Deploy multiple worker containers across nodes
Queue Partitioning: Distribute messages across multiple queues for parallel processing
Regional Distribution: Deploy workers in multiple regions for global processing

Vertical Scaling (Scale-Up)

Multiprocessing Pools: Utilize multiple CPU cores within single instances
Memory Optimization: Scale worker memory for large document processing
I/O Parallelization: Concurrent file operations and network requests

Overview

The doc-proc-worker serves as the computational engine of the document processing solution accelerator, designed to process large-scale document workloads with enterprise-grade reliability, performance, and observability. It integrates seamlessly with Azure cloud services to provide a fully managed, serverless-ready processing platform.

Key Capabilities

Asynchronous Queue Processing: High-throughput polling and processing of Azure Storage Queue messages
Pipeline Execution: Executes configurable document processing pipelines with step-by-step orchestration
Batch Processing: Handles batch operations with progress tracking, error recovery, and results aggregation
Multiprocessing Support: Scales to multiple concurrent worker processes for CPU-intensive workloads
Fault Tolerance: Comprehensive error handling, retry mechanisms, and graceful degradation
Health Monitoring: Real-time worker health checks with automatic restart capabilities
Graceful Shutdown: Coordinated shutdown handling across all worker processes
Cloud-Native: Built for Azure with native integration to Azure services

Core Architecture Components

1. Queue Worker (`QueueWorker`)

Primary Message Processing Engine

The Queue Worker is the foundational processing unit that implements the core message-driven processing loop:

Asynchronous Message Polling: Continuously monitors Azure Storage Queues using efficient long-polling techniques
Message Deserialization: Converts JSON queue messages into strongly-typed Python objects with validation
Processing Coordination: Routes messages to appropriate handlers based on message type and priority
Error Recovery: Implements exponential backoff, retry logic, and dead letter handling
Graceful Shutdown: Supports coordinated shutdown across distributed worker instances

Key Responsibilities:

# Message processing lifecycle
async def process_message(message: QueueMessage) -> ProcessingResult:
    1. Validate message format and authentication
    2. Route to appropriate execution handler  
    3. Track processing progress and metrics
    4. Handle errors and implement retry logic
    5. Clean up resources and update message visibility

2. Execution Manager (`ExecutionManager`)

Batch Processing Orchestrator

The Execution Manager provides high-level orchestration for document processing batches:

Batch Lifecycle Management: Creates, tracks, and finalizes BatchExecution records in Cosmos DB
Pipeline Coordination: Loads pipeline definitions and coordinates multi-step processing workflows
Progress Tracking: Provides real-time progress updates and completion metrics
Result Aggregation: Collects and consolidates results from individual document processing operations

Processing Flow:

# Batch execution workflow
async def execute_batch(request: BatchExecutionRequest) -> BatchResult:
    1. Create BatchExecution record with SUBMITTED status
    2. Load and validate pipeline configuration
    3. Initialize pipeline with source documents
    4. Execute pipeline steps with progress tracking
    5. Aggregate results and update final status
    6. Store results and cleanup temporary resources

3. Pipeline Manager (`PipelineManager`)

Dynamic Pipeline Orchestration

The Pipeline Manager provides flexible, configurable document processing workflows:

Dynamic Pipeline Loading: Loads pipeline configurations from Cosmos DB with caching
Step Dependency Resolution: Manages complex dependencies between pipeline steps
Step Instance Management: Creates and configures individual step instances with parameters
Data Flow Management: Handles data passing between pipeline steps with validation

4. Worker Pool Manager (`WorkerPoolManager`)

Multiprocessing Coordination

The Worker Pool Manager enables horizontal scaling within single instances:

Process Pool Management: Creates, monitors, and manages multiple worker processes
Health Monitoring: Continuously monitors worker process health and performance
Automatic Recovery: Automatically restarts failed worker processes with configurable strategies
Load Balancing: Distributes work across available worker processes efficiently
Graceful Shutdown Coordination: Orchestrates coordinated shutdown across all worker processes
Resource Isolation: Provides process-level isolation for fault tolerance

Processing Flow

Message Processing Lifecycle

Queue Polling: Workers continuously poll Azure Storage Queue for new messages
Message Deserialization: JSON messages are parsed into typed request objects
Message Validation: Request validation and type checking
Batch Execution:
- Create BatchExecution record in Cosmos DB
- Load pipeline configuration and steps
- Process documents through pipeline stages
- Track progress and collect results
- Update batch status and completion metrics
Result Storage: Store execution results and statistics in Cosmos DB
Queue Cleanup: Delete successfully processed messages from queue

Supported Message Types

BATCH_EXECUTION_REQUEST: Primary message type for initiating document processing batches

Pipeline Execution

The worker integrates with the doc-proc-lib pipeline framework:

Loads pipeline configurations dynamically
Executes step-by-step document processing
Handles step dependencies and data flow
Provides error isolation and recovery
Aggregates step outputs into batch results

Scaling Architecture & Patterns

Processing Models & Scaling Strategies

Single Worker Mode

Ideal for: Development, Testing, Low-Volume Processing

Architecture: Single asyncio event loop with sequential message processing
Scaling Characteristics:
- Vertical scaling only (CPU/memory within single process)
- Predictable resource consumption patterns
- Linear performance scaling with queue depth
Benefits:
- Simplified deployment and debugging
- Lower memory footprint
- Deterministic execution order
- Easier troubleshooting and monitoring
Limitations:
- Cannot utilize multi-core processors effectively
- Single point of failure within instance
- Limited throughput ceiling
- CPU-bound operations block other processing

Multiprocessing Pool Mode

Ideal for: Production Workloads, High-Throughput, CPU-Intensive Processing

Architecture: Multiple independent worker processes with shared coordination
Scaling Characteristics:
- Horizontal scaling within single VM/container instance
- Linear performance scaling with CPU core count
- Independent failure domains per worker process
- Concurrent message processing across processes
Advanced Features:
- Dynamic Process Management: Automatic worker process creation/termination
- Load Balancing: Intelligent work distribution across available processes
- Health Monitoring: Per-process health tracking with automatic recovery
- Resource Isolation: Memory and CPU isolation prevents resource contention
- Graceful Degradation: Continues operating with reduced capacity during failures

Horizontal Scaling Patterns

Container-Level Scaling

Multi-Instance Deployment

Azure Container Apps: Auto-scale based on queue depth, CPU, memory metrics
Kubernetes: HPA (Horizontal Pod Autoscaler) with custom queue metrics
Benefits:
- Independent failure domains across instances
- Geographic distribution capabilities
- Resource isolation at infrastructure level
- Platform-managed scaling policies

Vertical Scaling Optimizations

CPU Scaling

Processor Utilization Strategies

CPU-Intensive Workloads: Scale worker processes to match CPU core count
I/O-Intensive Workloads: Over-subscribe CPUs (2x-4x core count) for better utilization
Mixed Workloads: Dynamic process adjustment based on workload characteristics

I/O Scaling

Network and Storage Optimization

Connection Pooling: Optimize Azure service connections per worker
Batch Operations: Aggregate multiple operations for improved throughput
Async I/O: Non-blocking I/O operations across all Azure service interactions
Caching Strategies: Multi-level caching for pipeline configurations and metadata

For detailed information about multiprocessing architecture and patterns, see MULTIPROCESSING.md.

Worker Infrastructure Deployment

Worker-Specific Bicep Template

The doc-proc-worker has its own dedicated Bicep template at infra/bicep/main.bicep that focuses exclusively on deploying the worker container application. This template is designed to be lightweight and assumes that the core infrastructure (storage, databases, identity) has already been deployed by the main solution template.

Worker Bicep Template Overview

The worker's main.bicep template deploys a single, focused component:

Worker Container Application:

Azure Container App - Serverless, scalable worker container that processes document batches from storage queues
Managed Identity Integration - Uses existing user-assigned identity for secure access to Azure services
Environment Variable Configuration - Connects to Azure App Configuration for runtime settings
Container Registry Integration - Pulls worker container images from the shared registry

Worker Deployment Parameters

The worker template accepts focused parameters for container-specific configuration:

Basic Configuration:

@description('Name prefix for worker resources')
param namePrefix string = 'docproc'

@description('Environment name (dev, staging, prod)')
param environment string = 'dev'

Container Infrastructure References:

@description('Container Apps Environment resource name where the container apps will be deployed')
param containerAppsEnvironment string

@description('Container Registry Server')
param containerRegistryServer string

@description('Container image for the backend app')
param containerImage string

Resource Allocation:

@description('CPU cores for the container')
param cpuCores string = '1.0'

@description('Memory in GB for the container')
param memoryInGB string = '2Gi'

Security and Configuration:

@description('User Assigned Identity Resource Name used as identity for the api app')
param userAssignedIdentityName string

@description('App Configuration Store resource endpoint')
param appConfigStoreEndpoint string

@description('Additional environment variables')
param additionalEnvironmentVariables array = []

Worker Container Configuration

The template configures the worker container with the following key characteristics:

Runtime Configuration:

No External Ingress - Worker operates as a background service without HTTP endpoints
Managed Identity Authentication - Uses user-assigned identity for passwordless Azure service access
Environment Variables - Configures Azure App Configuration connection and client identity

Container Settings:

containers: [
  {
    name: appName
    image: containerImage
    resources: {
      cpu: cpuCores      // Default: 1.0 cores
      memory: memoryInGB // Default: 2Gi RAM
    }
    env: environmentVariables
  }
]

Security Configuration:

System-Assigned Identity: Disabled (uses user-assigned identity instead)
User-Assigned Identity: References existing identity from infrastructure deployment
Container Registry Access: Authenticated using managed identity

Environment Variables

The template automatically configures essential environment variables:

var environmentVariables = [
  {
    name: 'AZURE_APP_CONFIG_CONNECTION_STRING'
    value: ''  // Empty - uses managed identity instead
  }
  {
    name: 'AZURE_APP_CONFIG_ENDPOINT'
    value: appConfigStoreEndpoint
  }
  {
    name: 'AZURE_CLIENT_ID'
    value: userAssignedIdentity.properties.clientId
  }
  // Plus any additionalEnvironmentVariables passed as parameters
]

Deployment Dependencies

The worker template assumes these resources already exist:

Container Apps Environment - Shared hosting environment for all container apps
User Assigned Identity - Pre-created identity with appropriate Azure service permissions
Container Registry - Repository containing the worker container image
App Configuration Store - Centralized configuration service with worker settings

Template Outputs

The worker template provides a single, focused output:

output containerAppName string = workerApp.outputs.name

This output can be used for:

Monitoring and logging integrations
Scaling policy configurations
Health check implementations
CI/CD pipeline validations

Scaling Considerations

The worker template supports both vertical and horizontal scaling:

Vertical Scaling:

Adjust cpuCores parameter (0.25 to 4.0 cores)
Modify memoryInGB parameter (0.5Gi to 8Gi)

Horizontal Scaling:

Container Apps Environment handles automatic scaling
Multiple worker instances can be deployed across availability zones
Each worker operates independently with shared queue processing