📂 MCP Data Fetch Server

November 15, 2025 ¡ View on GitHub

MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.


Table of Contents


🎯 Features

  • Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.
  • Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
  • Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
  • Built‑in caching – optional cache directory reduces repeated network calls.
  • Prompt‑injection detection – validates URLs and fetched content for malicious instructions.

📦 Installation & Quick Start

# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer

# Make the startup script executable
chmod +x run.sh

# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory

📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣ main.py launches the MCP server, listening on stdin/stdout for JSON‑RPC requests.


⚙️ Command‑Line Options

OptionDescription
-d, --working-dirPath to the sandboxed working directory where all file operations are confined (default: ~/.mcp_datafetch).
-c, --cache-dirName of the cache subdirectory relative to the working directory (default: cache).
-h, --helpShow help message and exit.

🤝 Integration with LM Studio (or any MCP‑compatible client)

Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.

{
  "mcpServers": {
    "datafetch": {
      "command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
      "args": [
        "-d",
        "/absolute/path/to/working/directory"
      ],
      "env": { "WORKING_DIR": "." }
    }
  }
}

📌 Tip: Ensure run.sh is executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.


📡 MCP API Overview

All communication follows JSON‑RPC 2.0 over stdin/stdout.

initialize

Request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {}
}

Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).

tools/list

Request:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}

Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).

tools/call

Generic request shape (replace <tool_name> and arguments as needed):

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "<tool_name>",
    "arguments": { … }
  }
}

The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.


🛠️ Available Tools

fetch_webpage

  • Securely fetches a web page and returns clean content in the requested format.
NameTypeRequiredDescription
urlstring✅ (no default)URL to fetch (http/https only).
formatstring❌ (markdown)Output format – one of markdown, text, or html.
include_linksboolean❌ (true)Whether to append an extracted links list.
include_imagesboolean❌ (false)Whether to list image URLs in the output.
remove_bannersboolean❌ (true)Attempt to strip cookie banners & pop‑ups.

Example

{
  "jsonrpc": "2.0",
  "id": 10,
  "method": "tools/call",
  "params": {
    "name": "fetch_webpage",
    "arguments": {
      "url": "https://example.com/article",
      "format": "markdown",
      "include_links": true,
      "include_images": false,
      "remove_banners": true
    }
  }
}

Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.


  • Extracts and categorises all hyperlinks from a page.
NameTypeRequiredDescription
urlstring✅ (no default)URL of the page to analyse.
filterstring❌ (all)Return only all, internal, external, or resources.

Example

{
  "jsonrpc": "2.0",
  "id": 11,
  "method": "tools/call",
  "params": {
    "name": "extract_links",
    "arguments": {
      "url": "https://example.com/blog",
      "filter": "internal"
    }
  }
}

Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.


download_file

  • Safely downloads a remote file into the sandboxed cache directory.
NameTypeRequiredDescription
urlstring✅ (no default)Direct URL to the file.
filenamestring❌ (auto‑generated)Desired filename; will be sanitised and forced into the cache directory.

Example

{
  "jsonrpc": "2.0",
  "id": 12,
  "method": "tools/call",
  "params": {
    "name": "download_file",
    "arguments": {
      "url": "https://example.com/files/report.pdf",
      "filename": "report_latest.pdf"
    }
  }
}

Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.


get_page_metadata

  • Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.
NameTypeRequiredDescription
urlstring✅ (no default)URL of the page to inspect.

Example

{
  "jsonrpc": "2.0",
  "id": 13,
  "method": "tools/call",
  "params": {
    "name": "get_page_metadata",
    "arguments": { "url": "https://example.com/product/42" }
  }
}

Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.


check_url

  • Performs a lightweight HEAD request to report status code, headers and size without downloading the body.
NameTypeRequiredDescription
urlstring✅ (no default)URL to probe.

Example

{
  "jsonrpc": "2.0",
  "id": 14,
  "method": "tools/call",
  "params": {
    "name": "check_url",
    "arguments": { "url": "https://example.com/resource.zip" }
  }
}

Note: The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as Content‑Type and Content‑Length.


🔐 Security Features

  • Path‑traversal protection – all file operations are confined to the sandboxed working directory.
  • Prompt‑injection detection in URLs, fetched HTML and generated content.
  • Blocked domains & extensions (localhost, private IP ranges, executable/script files).
  • Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.
  • HTML sanitisation – removes <script>, <iframe>, event handlers and other risky elements before processing.
  • Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.

© 2025 Undici77 – All rights reserved.