π NewsCrawler
March 4, 2026 Β· View on GitHub
π NewsCrawler
Multi-Platform News & Content Crawler Suite
An open-source crawler toolkit for developers & researchers with CLI invocation, Web UI, unified JSON output, MCP support, and Claude Code Skills
Supports 12 mainstream platforms: WeChat, Toutiao, NetEase, Sohu, Tencent, Naver, Detik, Quora, BBC, CNN, Twitter/X
English Β· δΈζ

Ready-to-use Web UI - Auto-detect platform, real-time progress, JSON/Markdown export
π― Why NewsCrawler?
| π Multi-Platform | π¨ Dual Modes | π¦ Standardized | β‘ Fast Setup | π§© Skills Support |
|---|---|---|---|---|
| 12 Platforms CN/EN/KR/ID | Python API + Web UI | Unified JSON Easy Integration | uv Manager Lightning Fast | Claude Code Portable Skills |
Key Features:
- β Multi-Platform Support - WeChat, Toutiao, NetEase, Sohu, Tencent, Lenny's Newsletter, Naver Blog, Detik News, Quora, BBC News, CNN News, Twitter/X
- β Smart Extraction - Auto-detect platform type, extract title, content, images, videos
- β Unified Output - Standardized JSON format perfect for data analysis, storage, downstream processing
- β Flexible Usage - Python API (for automation) + Web UI (visual, no-code) + MCP Server (AI Agents) + Claude Code Skills
- β One-Click Deployment - Docker Compose orchestrates all services (Backend + Frontend + MCP)
- β AI Agent Integration - MCP (Model Context Protocol) support for Claude Desktop and AI tools
- β Modular Design - Decoupled crawlers, easy to extend or optimize
- β Lightweight & Efficient - uv-managed dependencies, fast installation, stable runtime
π Quick Start
Method 1: Docker Compose (β Recommended - One-Click Deployment)
# 1. Install Docker & Docker Compose
# Visit: https://docs.docker.com/get-docker/
# 2. Clone repository
git clone https://github.com/NanmiCoder/NewsCrawler.git
cd NewsCrawler
# 3. One-click start all services (Backend + Frontend + MCP)
docker compose up -d
# 4. Access services
# - Frontend UI: http://localhost:3000
# - Backend API: http://localhost:8000/docs
# - MCP Server: http://localhost:8765/mcp
What's included:
- β Backend Service (FastAPI) - News extraction API
- β Frontend Service (Vue 3 + Nginx) - Web UI interface
- β MCP Service - AI Agent tools for Claude Desktop
- β Auto Health Checks - Ensures all services are running
- β
Data Persistence - Extracted news saved in
./data/
Docker Management:
# View logs
docker compose logs -f
# Stop services
docker compose down
# Rebuild after code update
docker compose up -d --build
π Full Documentation: DOCKER_DEPLOYMENT.md
Method 2: Web UI (Manual Setup)
# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh # macOS/Linux
# or: pip install uv
# 2. Clone repository
git clone https://github.com/NanmiCoder/NewsCrawler.git
cd NewsCrawler
# 3. Install all dependencies (uv workspace mode)
uv sync
# 4. Start backend (from project root)
uv run news-extractor-backend --host 0.0.0.0 --port 8000
# 5. Start frontend (new terminal)
cd news-extractor-ui/frontend
npm install && npm run dev
# 6. Visit http://localhost:3000
Web UI Features:
- π― Paste URL, auto-detect platform type
- π Real-time extraction progress
- π JSON / Markdown dual-format export
- πΌοΈ Content preview & one-click download
π Detailed Deployment Guide: MANUAL_DEPLOYMENT.md
Method 3: Python API (For Automation)
from news_crawler.wechat_news import WeChatNewsCrawler
from news_crawler.toutiao_news import ToutiaoNewsCrawler
# WeChat Official Account
wechat_url = "https://mp.weixin.qq.com/s/xxxxxx"
crawler = WeChatNewsCrawler(wechat_url)
result = crawler.run() # Auto-save to data/ directory
# Toutiao
toutiao_url = "https://www.toutiao.com/article/xxxxxx"
crawler = ToutiaoNewsCrawler(toutiao_url)
result = crawler.run()
print(result) # Returns JSON format data
Run Examples:
uv run call_example.py # View complete examples
Method 4: MCP Server (AI Agent Integration)
What is MCP? Model Context Protocol (MCP) is a standard for connecting AI assistants (like Claude Desktop) to external tools and data sources.
Use Cases:
- π€ Let Claude extract news directly through conversation
- π Batch process multiple URLs via AI commands
- π AI-powered content analysis workflows
- π Build custom AI agents with news extraction capabilities
Quick Setup:
# 1. Start MCP Server (Recommended: Docker)
docker compose up -d mcp
# 2. Or start manually (from project root)
# First install dependencies
uv sync
# Start MCP server
uv run news-extractor-mcp --host 0.0.0.0 --port 8765
# 3. MCP Server running at: http://localhost:8765/mcp
AI Tool Configuration (Streamable HTTP):
Cursor (Click to expand)
Config file: ~/.cursor/mcp.json (global) or .cursor/mcp.json (project-level)
{
"mcpServers": {
"newscrawler": {
"url": "http://127.0.0.1:8765/mcp"
}
}
}
Windsurf (Click to expand)
Config file: ~/.codeium/windsurf/mcp_server_config.json
{
"mcpServers": {
"newscrawler": {
"url": "http://127.0.0.1:8765/mcp"
}
}
}
Trae (Click to expand)
Settings β Tools β MCP Servers β Add Server
{
"name": "newscrawler",
"url": "http://127.0.0.1:8765/mcp"
}
Claude Desktop (Click to expand)
Config file location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%/Claude/claude_desktop_config.json
{
"mcpServers": {
"newscrawler": {
"url": "http://127.0.0.1:8765/mcp"
}
}
}
Other MCP-Compatible Tools (Click to expand)
All MCP clients supporting Streamable HTTP transport can use:
{
"mcpServers": {
"newscrawler": {
"url": "http://127.0.0.1:8765/mcp"
}
}
}
Note: If using Docker and your AI tool runs outside Docker, replace 127.0.0.1 with host IP or host.docker.internal
Available MCP Tools:
extract_news- Extract single news article (JSON or Markdown)batch_extract_news- Extract multiple URLs in batchdetect_news_platform- Identify platform type from URLlist_supported_platforms- Show all supported platforms
Example Conversation with Claude:
You: "Extract this WeChat article: https://mp.weixin.qq.com/s/xxxxx"
Claude: [Uses extract_news tool] "I've extracted the article..."
You: "Extract these 3 URLs in Markdown format: [url1, url2, url3]"
Claude: [Uses batch_extract_news] "Here's the combined Markdown..."
π Full MCP Documentation: news_extractor_mcp/README.md
Method 5: Claude Code Skills (AI Coding Assistant Integration)
What are Claude Code Skills?
Claude Code is Anthropic's AI coding assistant. Skills are portable, self-contained modules that can be copied into any project, giving Claude Code the ability to extract news content automatically.
How is this different from MCP?
- MCP Server - Requires running a standalone service, ideal for long-running AI workflows
- Claude Code Skills - Copy to your project and use immediately, no service needed, ideal for developers who want quick news extraction during coding
Use Cases:
- π§© Give Claude Code news extraction capabilities in any project
- π¦ Self-contained, no external services required β just copy and use
- π§ Quickly extract news content during development for testing or analysis
Installation:
Copy the .claude/skills/news-extractor/ directory from this project to your target project and install dependencies:
# 1. Copy skill to your project
cp -r NewsCrawler/.claude/skills/news-extractor <your-project>/.claude/skills/news-extractor
# 2. Install dependencies
cd <your-project>/.claude/skills/news-extractor
uv sync
# 3. Use directly in Claude Code
# Claude Code will automatically read SKILL.md and gain news extraction capabilities
Supports 12 platforms: WeChat, Toutiao, NetEase, Sohu, Tencent, BBC News, CNN News, Twitter/X, Lenny's Newsletter, Naver Blog, Detik News, Quora
π Full Installation Guide: INSTALL_SKILL.md
π¦ Supported Platforms
News / Content Platforms
| Platform | URL Example | Language | Features |
|---|---|---|---|
| WeChat Official Accounts | mp.weixin.qq.com | Chinese | Articles & videos |
| Toutiao | toutiao.com | Chinese | Rich media, videos |
| NetEase News | 163.com | Chinese | Image galleries |
| Sohu News | sohu.com | Chinese | Multimedia content |
| Tencent News | news.qq.com | Chinese | Video news |
| Lenny's Newsletter | lennysnewsletter.com | English | Long-form content |
| Naver Blog | blog.naver.com | Korean | Blog platform |
| Detik News | detik.com | Indonesian | Southeast Asia news |
| Quora | quora.com | English | Q&A content |
| Twitter/X | x.com twitter.com | Multi-lang | Tweet extraction |
Stock Video Platforms
Pexels Β· Pixabay Β· Coverr Β· Mixkit - High-quality free video downloads
π‘ Use Cases
π° Multi-source news aggregation / Public opinion monitoring
π Media content analysis, data mining, recommendation systems
π¬ Academic research / Data science - Cross-platform extraction
π Educational projects / Personal learning - Crawler framework
π€ AI training data collection / Content quality analysis
π Data Output Format
All crawlers output unified JSON format, saved in data/ directory:
{
"title": "Article Title",
"news_url": "Original URL",
"news_id": "Article ID",
"meta_info": {
"author_name": "Author Name",
"author_url": "Author Homepage",
"publish_time": "2024-10-15 10:30:00"
},
"contents": [
{"type": "text", "content": "Paragraph text", "desc": ""},
{"type": "image", "content": "https://example.com/image.jpg", "desc": "Image desc"},
{"type": "video", "content": "https://example.com/video.mp4", "desc": "Video desc"}
],
"texts": ["Paragraph 1", "Paragraph 2"],
"images": ["Image URL 1", "Image URL 2"],
"videos": ["Video URL 1"]
}
Field Descriptions:
contents- Structured content preserving order and type (text/image/video)texts/images/videos- Flattened lists for quick access to specific content typesmeta_info- Article metadata (author, publish time, etc.)
π§ Technology Stack
Backend
Python 3.8+ Β· FastAPI Β· Pydantic Β· curl_cffi Β· parsel Β· tenacity
Frontend
Vue 3 Β· TypeScript Β· Vite Β· Axios
Dev Tools
uv (package manager) Β· Playwright (browser automation, optional)
Project Structure
NewsCrawler/
βββ news_crawler/ # Core crawler modules
β βββ wechat_news/ # WeChat
β βββ toutiao_news/ # Toutiao
β βββ netease_news/ # NetEase
β βββ sohu_news/ # Sohu
β βββ tencent_news/ # Tencent
β βββ ... # Other platforms
β
βββ news_extractor_core/ # Shared core library (uv workspace member)
β βββ adapters/ # Platform adapters
β βββ services/ # Business logic
β βββ models/ # Data models
β
βββ news_extractor_backend/ # FastAPI backend service (uv workspace member)
β βββ api/ # API routes
β βββ main.py # Application entry
β βββ cli.py # CLI entry point
β
βββ news_extractor_mcp/ # MCP server (uv workspace member)
β βββ server.py # MCP implementation
β βββ README.md # MCP documentation
β
βββ news-extractor-ui/ # Web UI application
β βββ frontend/ # Vue 3 frontend
β
βββ video_crawler/ # Video downloaders
βββ libs/ # Utility libraries
βββ data/ # Output directory
β
βββ pyproject.toml # uv workspace root config
βββ uv.lock # Dependency lock file
βββ Dockerfile # Multi-stage Docker build
βββ docker-compose.yml # Service orchestration
βββ DOCKER_DEPLOYMENT.md # Docker deployment guide
βββ MANUAL_DEPLOYMENT.md # Manual deployment guide
β οΈ Important Notice
This project is for educational and research purposes only. Commercial use is prohibited.
Usage Guidelines:
- β Personal learning, research, educational purposes only
- β Comply with target websites' robots.txt and terms of service
- β Control request frequency to avoid server stress
- β Do not use for illegal purposes or infringe on others' rights
- β No large-scale commercial crawling
Technical Notes:
- Some platforms may have anti-scraping mechanisms; adjust strategies accordingly
- Default headers may expire; use Playwright to auto-fetch fresh cookies
- Web page structure changes may cause parsing failures; feel free to submit issues
π€ Contributing
Issues and Pull Requests are welcome!
Contribution Areas:
- π Fix bugs
- β¨ Add new platform support
- π Improve documentation
- π¨ Optimize UI/UX
- β‘ Performance optimization
Submission Process:
- Fork this repository
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add some AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open Pull Request
π License
This project is for learning and research purposes only. By using this project, you agree to:
- Not use it for commercial purposes
- Not perform large-scale crawling
- Comply with relevant laws and target websites' terms of service
This project assumes no responsibility for any legal liability arising from its use.
π Resources
π Star History
If this project helps you, please give us a β Star!
Made with β€οΈ by NanmiCoder