Chapter 3: Document AI and Content Processing

April 13, 2026 · View on GitHub

Welcome to Chapter 3: Document AI and Content Processing. In this part of n8n AI Tutorial: Workflow Automation with AI, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Extract information from PDFs, images, web pages, and documents using AI-powered processing.

Document AI Pipeline

flowchart LR
    SRC[Source: Email / S3 / URL] --> LOAD[Document Loader\nPDF, HTML, CSV]
    LOAD --> SPLIT[Text Splitter\nRecursiveCharacterTextSplitter]
    SPLIT --> AI[AI Node\nExtract / Summarize / Classify]
    AI --> OUT[Structured Output\nJSON fields]
    OUT --> STORE[Database / Spreadsheet]

Document Processing Nodes

n8n provides various nodes for processing different document types with AI assistance.

PDF Processing

PDF to Text Extraction

{
  "parameters": {
    "operation": "pdfToText",
    "binaryData": true,
    "dataPropertyName": "data",
    "options": {
      "mimeType": "application/pdf"
    }
  },
  "name": "Extract PDF Text",
  "type": "n8n-nodes-base.extractFromFile",
  "typeVersion": 1
}

AI-Powered PDF Analysis

{
  "nodes": [
    {
      "parameters": {
        "operation": "pdfToText",
        "binaryData": true,
        "dataPropertyName": "data"
      },
      "name": "PDF Extractor",
      "type": "n8n-nodes-base.extractFromFile"
    },
    {
      "parameters": {
        "model": "gpt-4o",
        "messages": [
          {
            "role": "system",
            "content": "You are an expert document analyzer. Extract key information and provide a structured summary."
          },
          {
            "role": "user",
            "content": "Analyze this document and extract:\n1. Main topic\n2. Key findings\n3. Important dates\n4. Contact information\n\nDocument text:\n{{ $json.text }}"
          }
        ],
        "responseFormat": "json"
      },
      "name": "AI Document Analyzer",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    }
  ],
  "connections": {
    "PDF Extractor": {
      "main": [
        [
          {
            "node": "AI Document Analyzer",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Web Scraping with AI

Web Page Content Extraction

{
  "parameters": {
    "url": "={{ $json.website_url }}",
    "responseFormat": "html",
    "options": {
      "followRedirects": true,
      "timeout": 10000
    }
  },
  "name": "Web Scraper",
  "type": "n8n-nodes-base.httpRequest",
  "typeVersion": 1
}

AI-Powered Web Content Analysis

{
  "nodes": [
    {
      "parameters": {
        "url": "={{ $json.url }}",
        "responseFormat": "html"
      },
      "name": "Fetch Webpage",
      "type": "n8n-nodes-base.httpRequest"
    },
    {
      "parameters": {
        "dataPropertyName": "data",
        "extractionValues": {
          "values": [
            {
              "key": "title",
              "cssSelector": "title",
              "returnValue": "text"
            },
            {
              "key": "content",
              "cssSelector": "body",
              "returnValue": "html"
            }
          ]
        }
      },
      "name": "Extract Content",
      "type": "n8n-nodes-base.html"
    },
    {
      "parameters": {
        "model": "gpt-4o",
        "messages": [
          {
            "role": "system",
            "content": "You are a web content analyzer. Extract and summarize the key information from web pages."
          },
          {
            "role": "user",
            "content": "Summarize this webpage content in 3 key points:\n\nTitle: {{ $json.title }}\nContent: {{ $json.content }}"
          }
        ]
      },
      "name": "AI Content Summarizer",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    }
  ]
}

Image Processing with AI

Image Analysis

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image in detail:"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "={{ $json.image_url }}"
            }
          }
        ]
      }
    ]
  },
  "name": "Image Analyzer",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

OCR with AI Enhancement

{
  "nodes": [
    {
      "parameters": {
        "operation": "ocr",
        "binaryData": true,
        "dataPropertyName": "data",
        "options": {
          "language": "eng",
          "tesseractOptions": {
            "psm": 3
          }
        }
      },
      "name": "OCR Extractor",
      "type": "n8n-nodes-base.extractFromFile"
    },
    {
      "parameters": {
        "model": "gpt-4o",
        "messages": [
          {
            "role": "system",
            "content": "Clean and correct OCR text. Fix any errors and improve formatting."
          },
          {
            "role": "user",
            "content": "Correct this OCR text:\n{{ $json.text }}"
          }
        ]
      },
      "name": "AI Text Corrector",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    }
  ]
}

Document Classification

Automatic Document Categorization

{
  "parameters": {
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "system",
        "content": "You are a document classifier. Analyze the content and classify it into one of these categories: invoice, contract, report, email, legal, technical, marketing, financial, medical, other."
      },
      {
        "role": "user",
        "content": "Classify this document:\n\n{{ $json.document_text }}"
      }
    ],
    "responseFormat": "json"
  },
  "name": "Document Classifier",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Multi-Label Classification

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "Analyze the document and assign multiple relevant tags from: urgent, confidential, legal, financial, technical, customer-related, internal, external, review-required, approved, rejected."
      },
      {
        "role": "user",
        "content": "Tag this document with relevant labels:\n{{ $json.content }}"
      }
    ],
    "responseFormat": "json"
  },
  "name": "Document Tagger",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Information Extraction

Structured Data Extraction

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "Extract structured information from documents. Return valid JSON with the requested fields."
      },
      {
        "role": "user",
        "content": "Extract the following from this invoice:\n- Invoice number\n- Date\n- Vendor name\n- Total amount\n- Line items\n\nDocument: {{ $json.document_text }}"
      }
    ],
    "responseFormat": "json"
  },
  "name": "Invoice Extractor",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Entity Recognition

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "Extract named entities from text. Return JSON with arrays for: persons, organizations, locations, dates, amounts."
      },
      {
        "role": "user",
        "content": "Extract entities from:\n{{ $json.text }}"
      }
    ],
    "responseFormat": "json"
  },
  "name": "Entity Extractor",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Document Summarization

Automatic Summarization

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "You are an expert document summarizer. Create concise, accurate summaries that capture the main points and key information."
      },
      {
        "role": "user",
        "content": "Summarize this document in 3-5 bullet points:\n\n{{ $json.document_text }}"
      }
    ],
    "maxTokens": 300
  },
  "name": "Document Summarizer",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Executive Summary Generation

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "Create executive summaries for business documents. Focus on key decisions, actions, and outcomes."
      },
      {
        "role": "user",
        "content": "Create an executive summary for:\n{{ $json.document_text }}\n\nInclude: purpose, key findings, recommendations, next steps."
      }
    ],
    "responseFormat": "json"
  },
  "name": "Executive Summary",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Document Q&A System

Interactive Document Query

{
  "nodes": [
    {
      "parameters": {
        "model": "text-embedding-ada-002",
        "input": "={{ $json.document_chunks }}"
      },
      "name": "Create Embeddings",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    },
    {
      "parameters": {
        "operation": "upsert",
        "pineconeIndex": "documents",
        "items": "={{ $json.embeddings.map((emb, i) => ({ id: $json.chunk_ids[i], values: emb, metadata: { text: $json.chunks[i] } })) }}"
      },
      "name": "Store in Vector DB",
      "type": "@n8n/n8n-nodes-langchain.pinecone"
    },
    {
      "parameters": {
        "model": "text-embedding-ada-002",
        "input": "={{ $json.question }}"
      },
      "name": "Query Embedding",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    },
    {
      "parameters": {
        "operation": "getMany",
        "pineconeIndex": "documents",
        "query": "={{ $json.query_embedding[0] }}",
        "numberOfResults": 3
      },
      "name": "Retrieve Context",
      "type": "@n8n/n8n-nodes-langchain.pinecone"
    },
    {
      "parameters": {
        "model": "gpt-4o",
        "messages": [
          {
            "role": "system",
            "content": "Answer questions based on the provided context. If the answer isn't in the context, say so."
          },
          {
            "role": "user",
            "content": "Context:\n{{ $json.context_chunks.join('\\n---\\n') }}\n\nQuestion: {{ $json.question }}"
          }
        ]
      },
      "name": "Generate Answer",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    }
  ]
}

Automated Document Processing

Email Document Processing

{
  "nodes": [
    {
      "parameters": {
        "resource": "message",
        "operation": "getAll",
        "options": {
          "filter": "has:attachment filename:pdf"
        }
      },
      "name": "Gmail Trigger",
      "type": "n8n-nodes-base.gmail"
    },
    {
      "parameters": {
        "operation": "pdfToText",
        "binaryData": true,
        "dataPropertyName": "data"
      },
      "name": "Extract PDF",
      "type": "n8n-nodes-base.extractFromFile"
    },
    {
      "parameters": {
        "model": "gpt-4o",
        "messages": [
          {
            "role": "system",
            "content": "Analyze this document and determine: 1) Document type 2) Priority level (high/medium/low) 3) Key action items 4) Response needed (yes/no)"
          },
          {
            "role": "user",
            "content": "Analyze: {{ $json.text }}"
          }
        ],
        "responseFormat": "json"
      },
      "name": "AI Document Analysis",
      "type": "@n8n/n8n-nodes-langchain.openAi"
    },
    {
      "parameters": {
        "conditions": {
          "string": [
            {
              "value1": "={{ $json.priority }}",
              "operation": "equal",
              "value2": "high"
            }
          ]
        }
      },
      "name": "High Priority Check",
      "type": "n8n-nodes-base.if"
    }
  ]
}

Content Generation

Automated Report Generation

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "You are a professional report writer. Create well-structured, comprehensive reports."
      },
      {
        "role": "user",
        "content": "Generate a business report with these sections:\n1. Executive Summary\n2. Current Situation\n3. Analysis\n4. Recommendations\n\nData: {{ $json.business_data }}"
      }
    ],
    "maxTokens": 2000
  },
  "name": "Report Generator",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Content Enhancement

{
  "parameters": {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "Improve content quality: fix grammar, enhance clarity, add structure, make more engaging."
      },
      {
        "role": "user",
        "content": "Enhance this content:\n{{ $json.original_text }}"
      }
    ]
  },
  "name": "Content Enhancer",
  "type": "@n8n/n8n-nodes-langchain.openAi"
}

Integration Patterns

API-Based Document Processing

import requests
import json

def process_document_with_n8n(document_url, webhook_url):
    """Send document to n8n workflow for processing."""

    payload = {
        "document_url": document_url,
        "processing_type": "analysis"
    }

    response = requests.post(webhook_url, json=payload)

    if response.status_code == 200:
        result = response.json()
        return {
            "summary": result.get("summary"),
            "entities": result.get("entities"),
            "sentiment": result.get("sentiment")
        }
    else:
        raise Exception(f"n8n processing failed: {response.text}")

# Usage
result = process_document_with_n8n(
    "https://example.com/document.pdf",
    "http://localhost:5678/webhook/document-processor"
)

Best Practices

Pre-processing: Clean and structure input documents before AI processing
Chunking: Split large documents into manageable chunks
Caching: Cache processed results to avoid reprocessing
Validation: Validate AI-extracted information
Error Handling: Handle document parsing failures gracefully
Rate Limiting: Respect API limits when processing batches
Monitoring: Track processing success rates and quality
Security: Sanitize document content before processing

Document AI transforms how organizations process and understand their content. The next chapter explores building autonomous AI agents with tool access.

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for content, json, nodes so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

coupling core logic too tightly to one implementation path
missing the handoff boundaries between setup, execution, and validation
shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 3: Document AI and Content Processing as an operating subsystem inside n8n AI Tutorial: Workflow Automation with AI, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around name, parameters, role as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 3: Document AI and Content Processing usually follows a repeatable control path:

Context bootstrap: initialize runtime config and prerequisites for content.
Input normalization: shape incoming data so json receives stable contracts.
Core execution: run the main logic branch and propagate intermediate state through nodes.
Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
Output composition: return canonical result payloads for downstream consumers.
Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Key source files in n8n-io/n8n:

packages/@n8n/nodes-langchain/nodes/document_loaders/ -- document loader nodes: PDF, URL, JSON, CSV, binary data loaders
packages/@n8n/nodes-langchain/nodes/text_splitters/ -- text splitter nodes wrapping LangChain's RecursiveCharacterTextSplitter, TokenTextSplitter
packages/@n8n/nodes-langchain/nodes/output_parser/ -- structured output parsers for extracting JSON from LLM responses

Suggested trace: follow a PDF document loader node's supplyData() to see how it returns a LangChain Document[] array for downstream vector store ingestion.