đ MCP Data Fetch Server
November 15, 2025 ¡ View on GitHub
MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.
Table of Contents
- Features
- Installation & Quick Start
- CommandâLine Options
- Integration with LMâŻStudio
- MCP API Overview
- Available Tools
- Security Features
đŻ Features
- Secure web page fetching â strips scripts, iframes and cookie banners; no JavaScript execution.
- Rich data extraction â retrieve links, metadata, OpenâŻGraph/Twitter cards, and downloadable resources.
- Safe file downloads â size limits, filename sanitisation, and pathâtraversal protection within a sandboxed cache.
- Builtâin caching â optional cache directory reduces repeated network calls.
- Promptâinjection detection â validates URLs and fetched content for malicious instructions.
đŚ Installation & Quick Start
# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer
# Make the startup script executable
chmod +x run.sh
# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory
đ Threeâstep overview
1ď¸âŁ The script creates a virtual environment and installs dependencies.
2ď¸âŁ It prepares a cache folder (.fetch_cache) inside the project root.
3ď¸âŁmain.pylaunches the MCP server, listening on stdin/stdout for JSONâRPC requests.
âď¸ CommandâLine Options
| Option | Description |
|---|---|
-d, --working-dir | Path to the sandboxed working directory where all file operations are confined (default: ~/.mcp_datafetch). |
-c, --cache-dir | Name of the cache subdirectory relative to the working directory (default: cache). |
-h, --help | Show help message and exit. |
đ¤ Integration with LMâŻStudio (or any MCPâcompatible client)
Add an entry to your mcp.json configuration so that LMâŻStudio can launch the server automatically.
{
"mcpServers": {
"datafetch": {
"command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
"args": [
"-d",
"/absolute/path/to/working/directory"
],
"env": { "WORKING_DIR": "." }
}
}
}
đ Tip: Ensure
run.shis executable (chmod +x âŚ) and that the virtual environment can install the required Python packages on first launch.
đĄ MCP API Overview
All communication follows JSONâRPCâŻ2.0 over stdin/stdout.
initialize
Request:
{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {}
}
Response contains the protocol version, server capabilities and basic metadata (e.g., nameâŻ=âŻmcp-datafetch-server, versionâŻ=âŻ2.1.0).
tools/list
Request:
{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}
Response: { "tools": [ âŚtool definitions⌠] }. Each definition includes name, description and an input schema (JSONâŻSchema).
tools/call
Generic request shape (replace <tool_name> and arguments as needed):
{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "<tool_name>",
"arguments": { ⌠}
}
}
The server validates the request against the toolâs schema, executes the operation, and returns a ToolResult containing one or more content blocks.
đ ď¸ Available Tools
fetch_webpage
- Securely fetches a web page and returns clean content in the requested format.
| Name | Type | Required | Description |
|---|---|---|---|
url | string | â (no default) | URL to fetch (http/https only). |
format | string | â (markdown) | Output format â one of markdown, text, or html. |
include_links | boolean | â (true) | Whether to append an extracted links list. |
include_images | boolean | â (false) | Whether to list image URLs in the output. |
remove_banners | boolean | â (true) | Attempt to strip cookie banners & popâups. |
Example
{
"jsonrpc": "2.0",
"id": 10,
"method": "tools/call",
"params": {
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/article",
"format": "markdown",
"include_links": true,
"include_images": false,
"remove_banners": true
}
}
}
Note: The tool sanitises HTML, removes scripts/iframes, and checks for promptâinjection patterns before returning content.
extract_links
- Extracts and categorises all hyperlinks from a page.
| Name | Type | Required | Description |
|---|---|---|---|
url | string | â (no default) | URL of the page to analyse. |
filter | string | â (all) | Return only all, internal, external, or resources. |
Example
{
"jsonrpc": "2.0",
"id": 11,
"method": "tools/call",
"params": {
"name": "extract_links",
"arguments": {
"url": "https://example.com/blog",
"filter": "internal"
}
}
}
Note: Links are classified as internal (same domain) or external; resource links (images, PDFsâŚ) can be filtered with resources.
download_file
- Safely downloads a remote file into the sandboxed cache directory.
| Name | Type | Required | Description |
|---|---|---|---|
url | string | â (no default) | Direct URL to the file. |
filename | string | â (autoâgenerated) | Desired filename; will be sanitised and forced into the cache directory. |
Example
{
"jsonrpc": "2.0",
"id": 12,
"method": "tools/call",
"params": {
"name": "download_file",
"arguments": {
"url": "https://example.com/files/report.pdf",
"filename": "report_latest.pdf"
}
}
}
Note: The server enforces a 100âŻMB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for crossâagent access.
get_page_metadata
- Extracts structured metadata (title, description, OpenâŻGraph, Twitter Cards) from a web page.
| Name | Type | Required | Description |
|---|---|---|---|
url | string | â (no default) | URL of the page to inspect. |
Example
{
"jsonrpc": "2.0",
"id": 13,
"method": "tools/call",
"params": {
"name": "get_page_metadata",
"arguments": { "url": "https://example.com/product/42" }
}
}
Note: The tool returns a formatted text block with title, description, keywords, OpenâŻGraph properties and Twitter Card fields.
check_url
- Performs a lightweight HEAD request to report status code, headers and size without downloading the body.
| Name | Type | Required | Description |
|---|---|---|---|
url | string | â (no default) | URL to probe. |
Example
{
"jsonrpc": "2.0",
"id": 14,
"method": "tools/call",
"params": {
"name": "check_url",
"arguments": { "url": "https://example.com/resource.zip" }
}
}
Note: The response includes the final URL after redirects, a concise status summary (â
âŻOK or â ď¸âŻError), and selected HTTP headers such as ContentâType and ContentâLength.
đ Security Features
- Pathâtraversal protection â all file operations are confined to the sandboxed working directory.
- Promptâinjection detection in URLs, fetched HTML and generated content.
- Blocked domains & extensions (localhost, private IP ranges, executable/script files).
- Contentâsize limits â maxâŻ50âŻMB for page fetches, maxâŻ100âŻMB for file downloads.
- HTML sanitisation â removes
<script>,<iframe>, event handlers and other risky elements before processing. - Cookie/banner handling â optional removal of consent banners and popâups during fetch.
Š 2025 Undici77 â All rights reserved.