Chapter 1: Getting Started with ClickHouse
March 2, 2026 · View on GitHub
Welcome to ClickHouse! If you've ever needed to analyze massive datasets with lightning-fast performance, you're in the right place. ClickHouse is designed specifically for analytical workloads, providing unparalleled speed for complex queries on large datasets.
What Makes ClickHouse Special?
ClickHouse revolutionizes analytical databases by:
- Column-Oriented Storage - Optimized for analytical queries and aggregations
- Vectorized Query Execution - Processes data in CPU vector registers for maximum speed
- Massive Parallel Processing - Utilizes all available CPU cores simultaneously
- Efficient Compression - Reduces storage costs while maintaining query performance
- Real-Time Ingestion - Handles continuous data streams with minimal latency
- Horizontal Scaling - Distributes queries across multiple nodes seamlessly
Installation Options
Option 1: Docker Installation (Recommended)
# Pull and run ClickHouse server
docker run -d --name clickhouse-server \
--ulimit nofile=262144:262144 \
-p 8123:8123 -p 9000:9000 \
clickhouse/clickhouse-server:latest
# Access ClickHouse CLI
docker exec -it clickhouse-server clickhouse-client
Option 2: Native Linux Installation
# Ubuntu/Debian
sudo apt-get install clickhouse-server clickhouse-client
# CentOS/RHEL/Fedora
sudo yum install clickhouse-server clickhouse-client
# Start the service
sudo systemctl start clickhouse-server
sudo systemctl enable clickhouse-server
# Connect to ClickHouse
clickhouse-client
Option 3: macOS Installation
# Using Homebrew
brew install clickhouse
# Start ClickHouse server
clickhouse server
# In another terminal, connect
clickhouse client
Option 4: Windows Installation
# Using Docker (recommended for Windows)
docker run -d --name clickhouse-server \
-p 8123:8123 -p 9000:9000 \
clickhouse/clickhouse-server:latest
# Access via Docker
docker exec -it clickhouse-server clickhouse-client
Your First ClickHouse Database
Let's create your first analytical database:
Step 1: Connect to ClickHouse
# Connect using HTTP interface (port 8123)
curl "http://localhost:8123/" \
--data-binary "SELECT 1"
# Or use the native client
clickhouse-client
# In the client, you should see:
# ClickHouse client version 23.8.1.1.
# Connecting to localhost:9000 as user default.
# Connected to ClickHouse server version 23.8.1.
# :)
Step 2: Create Your First Database
-- Create a new database
CREATE DATABASE analytics;
-- Switch to the database
USE analytics;
-- Verify database creation
SHOW DATABASES;
Step 3: Create Your First Table
-- Create a table for user events
CREATE TABLE user_events (
user_id UInt64,
event_type String,
event_time DateTime,
event_data String,
session_id String,
user_agent String,
ip_address String
) ENGINE = MergeTree()
ORDER BY (user_id, event_time)
PARTITION BY toYYYYMM(event_time);
Step 4: Insert Sample Data
-- Insert some sample events
INSERT INTO user_events VALUES
(1, 'page_view', '2023-12-01 10:00:00', '{"page": "/home", "duration": 30}', 'session_123', 'Chrome/91.0', '192.168.1.1'),
(1, 'click', '2023-12-01 10:00:15', '{"element": "button", "page": "/home"}', 'session_123', 'Chrome/91.0', '192.168.1.1'),
(2, 'page_view', '2023-12-01 10:05:00', '{"page": "/products", "duration": 45}', 'session_456', 'Firefox/89.0', '192.168.1.2'),
(3, 'purchase', '2023-12-01 10:10:00', '{"product_id": 123, "amount": 99.99}', 'session_789', 'Safari/14.0', '192.168.1.3');
Step 5: Run Your First Analytical Query
-- Count total events
SELECT count() FROM user_events;
-- Count events by type
SELECT event_type, count() as event_count
FROM user_events
GROUP BY event_type
ORDER BY event_count DESC;
-- Find top users by activity
SELECT user_id, count() as total_events
FROM user_events
GROUP BY user_id
ORDER BY total_events DESC
LIMIT 10;
-- Analyze events over time
SELECT
toStartOfHour(event_time) as hour,
event_type,
count() as event_count
FROM user_events
GROUP BY hour, event_type
ORDER BY hour, event_type;
Understanding ClickHouse Architecture
Core Components
ClickHouse System
├── Server (clickhouse-server)
│ ├── Query Processor
│ ├── Storage Engine
│ └── Network Interface
├── Client Tools
│ ├── clickhouse-client (CLI)
│ ├── HTTP Interface (port 8123)
│ └── Native Protocol (port 9000)
├── Storage Layer
│ ├── MergeTree Engine
│ ├── Column Storage
│ └── Compression
└── Distributed Layer
├── Sharding
├── Replication
└── Load Balancing
Data Flow in ClickHouse
sequenceDiagram
participant Client
participant Server
participant Storage
Client->>Server: SQL Query
Server->>Server: Parse & Optimize
Server->>Storage: Read Column Data
Storage-->>Server: Compressed Columns
Server->>Server: Vectorized Processing
Server->>Server: Aggregation & Analytics
Server-->>Client: Query Results
Table Engines
ClickHouse uses different storage engines for different use cases:
-- MergeTree (most common for analytical data)
CREATE TABLE events (
id UInt64,
data String,
timestamp DateTime
) ENGINE = MergeTree()
ORDER BY (id, timestamp);
-- ReplacingMergeTree (handles duplicate data)
CREATE TABLE user_profiles (
user_id UInt64,
name String,
email String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;
-- SummingMergeTree (automatic aggregation)
CREATE TABLE metrics (
date Date,
metric_name String,
value UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, metric_name);
Data Types in ClickHouse
Numeric Types
-- Integers
CREATE TABLE numbers (
tiny TinyInt, -- 8-bit: -128 to 127
small SmallInt, -- 16-bit: -32,768 to 32,767
medium MediumInt, -- 24-bit: -8,388,608 to 8,388,607
normal Int32, -- 32-bit: -$2^{31}$ to $2^{31}$-1
big UInt64 -- 64-bit unsigned: 0 to $2^{64}$-1
);
-- Floating Point
CREATE TABLE measurements (
temperature Float32, -- 32-bit float
pressure Float64, -- 64-bit float (double)
accuracy Decimal(10,2) -- Fixed precision decimal
);
String Types
-- String types
CREATE TABLE text_data (
short_text FixedString(100), -- Fixed length string
long_text String, -- Variable length string
json_data String -- JSON as string
);
Date and Time Types
-- Temporal types
CREATE TABLE temporal_data (
event_date Date, -- Date only (YYYY-MM-DD)
event_time DateTime, -- Date and time
event_timestamp DateTime64(3), -- High precision timestamp
duration UInt32 -- Duration in seconds
);
Array and Nested Types
-- Complex data structures
CREATE TABLE complex_data (
tags Array(String), -- Array of strings
properties Map(String, String), -- Key-value pairs
coordinates Tuple(Float64, Float64), -- Latitude, longitude
metadata Nested( -- Nested structure
key String,
value String,
timestamp DateTime
)
);
Basic Query Patterns
Data Ingestion
-- Insert single row
INSERT INTO user_events VALUES (4, 'login', '2023-12-01 11:00:00', '{}', 'session_999', 'Chrome/91.0', '192.168.1.4');
-- Insert multiple rows
INSERT INTO user_events VALUES
(5, 'signup', '2023-12-01 11:05:00', '{}', 'session_1000', 'Firefox/89.0', '192.168.1.5'),
(6, 'page_view', '2023-12-01 11:10:00', '{"page": "/dashboard"}', 'session_1001', 'Safari/14.0', '192.168.1.6');
-- Bulk insert from file
INSERT INTO user_events FROM INFILE '/path/to/events.csv' FORMAT CSV;
Analytical Queries
-- Time-based aggregations
SELECT
toStartOfDay(event_time) as day,
event_type,
count() as events,
uniqExact(user_id) as unique_users
FROM user_events
WHERE event_time >= '2023-12-01'
GROUP BY day, event_type
ORDER BY day, events DESC;
-- User behavior analysis
SELECT
user_id,
countIf(event_type = 'page_view') as page_views,
countIf(event_type = 'click') as clicks,
countIf(event_type = 'purchase') as purchases,
max(event_time) as last_activity
FROM user_events
GROUP BY user_id
HAVING page_views > 0
ORDER BY purchases DESC, page_views DESC;
-- Session analysis
SELECT
session_id,
user_id,
min(event_time) as session_start,
max(event_time) as session_end,
count() as total_events,
session_end - session_start as session_duration
FROM user_events
GROUP BY session_id, user_id
ORDER BY session_duration DESC;
Advanced Analytics
-- Rolling window analysis
SELECT
user_id,
event_time,
event_type,
count() OVER (PARTITION BY user_id ORDER BY event_time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) as rolling_count_10
FROM user_events
ORDER BY user_id, event_time;
-- Cohort analysis
SELECT
toStartOfMonth(first_event) as cohort_month,
toStartOfMonth(event_time) as activity_month,
count(distinct user_id) as active_users,
cohort_month = activity_month as is_first_month
FROM (
SELECT
user_id,
min(event_time) as first_event,
event_time
FROM user_events
GROUP BY user_id, event_time
)
GROUP BY cohort_month, activity_month
ORDER BY cohort_month, activity_month;
Performance Characteristics
Why ClickHouse is Fast
- Column Storage: Only reads required columns
- Vectorized Execution: Processes data in CPU vectors
- Parallel Processing: Uses all CPU cores
- Efficient Compression: Reduces I/O operations
- Pre-aggregated Data: MergeTree engine pre-sorts data
- Memory Efficient: Minimizes memory allocations
Benchmarking Your Setup
-- Test basic performance
SELECT count() FROM user_events;
-- Test aggregation performance
SELECT
event_type,
count() as cnt
FROM user_events
GROUP BY event_type;
-- Test complex analytical query
SELECT
user_id,
toStartOfHour(event_time) as hour,
count() as events_per_hour
FROM user_events
WHERE event_time >= today()
GROUP BY user_id, hour
ORDER BY user_id, hour;
Data Import and Export
Importing Data
# Import from CSV
cat data.csv | clickhouse-client --query="INSERT INTO table_name FORMAT CSV"
# Import from TSV
clickhouse-client --query="INSERT INTO table_name FORMAT TabSeparated" < data.tsv
# Import from JSON
clickhouse-client --query="INSERT INTO table_name FORMAT JSONEachRow" < data.json
Exporting Data
-- Export to CSV
SELECT * FROM user_events INTO OUTFILE 'events.csv' FORMAT CSV;
-- Export query results
SELECT user_id, count() as events
FROM user_events
GROUP BY user_id
INTO OUTFILE 'user_stats.csv' FORMAT CSVWithNames;
Monitoring and Maintenance
System Tables
-- View system information
SELECT * FROM system.databases;
SELECT * FROM system.tables;
SELECT * FROM system.processes; -- Currently running queries
-- Monitor performance
SELECT * FROM system.metrics;
SELECT * FROM system.events;
Basic Maintenance
-- Optimize table (merge parts)
OPTIMIZE TABLE user_events;
-- Check table size
SELECT
database,
table,
sum(bytes) as size_bytes,
formatReadableSize(sum(bytes)) as size_readable
FROM system.parts
WHERE database = 'analytics' AND table = 'user_events'
GROUP BY database, table;
What We've Accomplished
Congratulations! 🎉 You've successfully:
- Installed ClickHouse using Docker or native installation
- Created your first analytical database with proper schema design
- Loaded sample data and performed basic analytical queries
- Understood ClickHouse's architecture and performance characteristics
- Learned essential data types and table engines
- Implemented basic analytical patterns for time-series and user behavior data
- Explored performance optimization techniques
- Set up monitoring and maintenance procedures
Next Steps
Now that you have ClickHouse running and understand the basics, let's dive deeper into data modeling and schema design. In Chapter 2: Data Modeling & Schemas, we'll explore ClickHouse's powerful table engines, partitioning strategies, and advanced schema design patterns.
Practice what you've learned:
- Load a larger dataset and experiment with different analytical queries
- Try different table engines (ReplacingMergeTree, SummingMergeTree)
- Implement time-based partitioning for better performance
- Create a real-time dashboard using ClickHouse's query capabilities
What kind of analytical workload are you most excited to implement with ClickHouse? 📊
What Problem Does This Solve?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for clickhouse, SELECT, user_id so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 1: Getting Started with ClickHouse as an operating subsystem inside ClickHouse Tutorial: High-Performance Analytical Database, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around user_events, event_time, server as your checklist when adapting these patterns to your own repository.
How it Works Under the Hood
Under the hood, Chapter 1: Getting Started with ClickHouse usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
clickhouse. - Input normalization: shape incoming data so
SELECTreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
user_id. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Source Walkthrough
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com).
Suggested trace strategy:
- search upstream code for
clickhouseandSELECTto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production