ArtSearch ๐Ÿ” [](https://opensource.org/licenses/MIT)

May 17, 2025 ยท View on GitHub

A local search system implementation using Elasticsearch for Wikipedia data indexing and retrieval.

Table of Contents

Features โœจ

  • Multi-language support for Wikipedia data
  • Elasticsearch-powered search backend
  • CLI interface for index management and queries
  • Configurable search parameters

Prerequisites ๐Ÿ› ๏ธ

  • Python 3.12
  • Java 11+ (for Elasticsearch)
  • 30GB+ free disk space (for data storage)

Installation โš™๏ธ

1. Download Elasticsearch Engine

# Create data directory
mkdir -p data && cd data

# Download and extract Elasticsearch
wget -O elasticsearch-8.17.3.tar.gz \
  https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.17.3-linux-x86_64.tar.gz
tar zxvf elasticsearch-8.17.3.tar.gz
rm elasticsearch-8.17.3.tar.gz
cd ..

2. Download Wikipedia Data

Download specific language version of Wikipedia dataset:

# Default English dataset (November 2023)
modelscope download --dataset wikimedia/wikipedia \
  --include 20231101.en/* \
  --local_dir ./data/wikipedia

Example data structure:

{
    "id": "1",
    "url": "https://simple.wikipedia.org/wiki/April",
    "title": "April",
    "text": "April is the fourth month..."
}

Usage ๐Ÿš€

Folder Structure

โ”œโ”€โ”€ data                      # Data folder
โ”‚   โ”œโ”€โ”€ elasticsearch-8.17.3  # Elasticsearch engine
โ”‚   โ””โ”€โ”€ wikipedia             # Wikipedia data folder
โ”‚       โ”œโ”€โ”€ 20231101.en       # English data
โ”‚       โ”œโ”€โ”€ 20231101.zh       # Chinese data
โ”‚       ...                   # more language data
โ”‚
โ”œโ”€โ”€ es_wiki_build.py          # Scripts for build wiki index
โ”œโ”€โ”€ es_wiki_test.py           # Unit test for elasticsearch
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ wiki_searcher.py          # Search client for wiki data

Building Index

# Build index for default language (en)
python es_wiki_build.py

# Build index for specific language (e.g., French)
python es_wiki_build.py --language fr

Performing Searches

# Default search setting
python es_wiki_test.py

# Direct query execution
python es_wiki_test.py \
  --language en \
  --query "Paris 2024 Olympic Games" 

Configuration โš™๏ธ

Setting environment variables for Elasticsearch configuration:

export ELASTIC_PASSWORD="changeme"

Development ๐Ÿง‘๐Ÿ’ป

# Install dependencies
pip install -r requirements.txt

Contributing ๐Ÿค

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License ๐Ÿ“„

This project is licensed under the MIT License - see the LICENSE file for details.