Walk-Forward Validation
February 4, 2026 · View on GitHub
Proper validation is critical for trading strategies. This tutorial covers walk-forward methodology.
Learning Objectives
After this tutorial, you will understand:
- Why time-based splits are essential
- Walk-forward validation methodology
- How to implement it in TensorTrade
- Interpreting walk-forward results
The Problem with Random Splits
Machine Learning Default
# Standard ML: Random 80/20 split
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.2, shuffle=True)
Why It Fails for Time Series
Data: [Jan] [Feb] [Mar] [Apr] [May] [Jun]
Random split might give:
Train: [Jan] [Mar] [May] (scattered)
Test: [Feb] [Apr] [Jun] (interleaved)
Problems:
1. Agent sees Feb patterns when learning for Feb
2. March data "leaks" information about February
3. Test is not truly unseen future data
Time-Based Split (Basic)
The Correct Approach
Data: [Jan] [Feb] [Mar] [Apr] [May] [Jun]
Time-based split:
Train: [Jan] [Feb] [Mar] [Apr] (past)
Test: [May] [Jun] (future)
Agent only sees past data during training
Test data is truly "unseen future"
Implementation
# Time-based split (no shuffling!)
total_candles = len(data)
test_size = 30 * 24 # 30 days
val_size = 30 * 24 # 30 days
test_data = data.iloc[-test_size:]
val_data = data.iloc[-(test_size + val_size):-test_size]
train_data = data.iloc[:-(test_size + val_size)]
# Timeline:
# [=============== train ===============][= val =][= test =]
# ^ ^
# val start test start
Walk-Forward Validation
Walk-forward simulates how you'd actually use the model:
- Train on available data
- Test on next period
- Retrain with updated data
- Repeat
┌────────────────────────────────────────────────────────────────┐
│ Walk-Forward Validation │
│ │
│ Fold 1: [=== TRAIN ===][TEST] │
│ Jan-Apr May │
│ │
│ Fold 2: [=== TRAIN ===][TEST] │
│ Feb-May Jun │
│ │
│ Fold 3: [=== TRAIN ===][TEST] │
│ Mar-Jun Jul │
│ │
│ Fold 4: [=== TRAIN ===][TEST] │
│ Apr-Jul Aug │
│ │
│ Final Result: Average of all fold test results │
│ │
└────────────────────────────────────────────────────────────────┘
Why Walk-Forward is Better
- Tests on multiple periods - Not just one lucky/unlucky month
- Simulates real deployment - Train → deploy → retrain cycle
- Reveals distribution shift - How performance varies over time
- More robust results - Average is more reliable than single test
Implementation
Basic Walk-Forward
def walk_forward(data, train_months=4, test_months=1, stride_months=1):
"""Walk-forward validation for trading."""
results = []
candles_per_month = 30 * 24 # Hourly data
train_size = train_months * candles_per_month
test_size = test_months * candles_per_month
stride = stride_months * candles_per_month
# Calculate number of folds
total_candles = len(data)
start = 0
fold = 0
while start + train_size + test_size <= total_candles:
fold += 1
# Extract train and test
train_end = start + train_size
test_end = train_end + test_size
train_data = data.iloc[start:train_end].copy()
test_data = data.iloc[train_end:test_end].copy()
print(f"\nFold {fold}:")
print(f" Train: {train_data['date'].iloc[0]} to {train_data['date'].iloc[-1]}")
print(f" Test: {test_data['date'].iloc[0]} to {test_data['date'].iloc[-1]}")
# Train model
model = train(train_data)
# Evaluate on test
test_pnl = evaluate(model, test_data)
bh_pnl = buy_and_hold(test_data)
results.append({
'fold': fold,
'test_start': test_data['date'].iloc[0],
'test_end': test_data['date'].iloc[-1],
'agent_pnl': test_pnl,
'bh_pnl': bh_pnl,
})
print(f" Agent P&L: ${test_pnl:+,.0f}, B&H: ${bh_pnl:+,.0f}")
# Move window forward
start += stride
return pd.DataFrame(results)
Full Example
def run_walk_forward():
# Load data
data = load_data()
# Run walk-forward
results = walk_forward(
data,
train_months=4, # 4 months training
test_months=1, # 1 month test
stride_months=1 # Move forward 1 month each fold
)
# Summary statistics
print("\n" + "="*50)
print("Walk-Forward Results Summary")
print("="*50)
avg_agent = results['agent_pnl'].mean()
avg_bh = results['bh_pnl'].mean()
win_rate = (results['agent_pnl'] > results['bh_pnl']).mean()
print(f"Average Agent P&L: ${avg_agent:+,.0f}")
print(f"Average B&H P&L: ${avg_bh:+,.0f}")
print(f"Win Rate vs B&H: {win_rate:.1%}")
print(f"Folds where Agent beats B&H: {sum(results['agent_pnl'] > results['bh_pnl'])}/{len(results)}")
return results
Interpreting Results
Good Results
Walk-Forward Results Summary
============================
Fold 1: Agent $+150, B&H $-100 (Win)
Fold 2: Agent $+80, B&H $+200 (Loss)
Fold 3: Agent $+210, B&H $+50 (Win)
Fold 4: Agent $-50, B&H $-300 (Win)
Fold 5: Agent $+120, B&H $+100 (Win)
Average Agent: $+102
Average B&H: $-10
Win Rate: 80% (4/5 folds)
Interpretation: Agent consistently beats B&H across market conditions
Concerning Results
Walk-Forward Results Summary
============================
Fold 1: Agent $+500, B&H $-200 (Win) ← Bull market, agent wins big
Fold 2: Agent $-400, B&H $-100 (Loss) ← Bear market, agent loses more
Fold 3: Agent $+300, B&H $+250 (Win) ← Bull market, agent wins small
Fold 4: Agent $-600, B&H $-300 (Loss) ← Bear market, agent loses big
Fold 5: Agent $+100, B&H $+50 (Win)
Average Agent: $-20
Average B&H: $-60
Win Rate: 60%
Interpretation: Agent does well in bull markets but badly in bear markets
→ Not robust, probably overfit to bullish patterns
Red Flags
Warning signs in walk-forward results:
1. High variance across folds
Fold 1: +\$500, Fold 2: -\$800, Fold 3: +\$600
→ Strategy is unstable
2. Performance degradation over time
Fold 1: +\$300, Fold 2: +\$200, Fold 3: +\$50, Fold 4: -\$100
→ Market regime changed, model didn't adapt
3. Wins only in specific market conditions
All wins during rising markets, all losses during falling
→ Not actually predicting, just biased long/short
Anchored vs Rolling Walk-Forward
Rolling (Standard)
Training window moves with test window:
Fold 1: [Jan-Apr] → [May]
Fold 2: [Feb-May] → [Jun]
Fold 3: [Mar-Jun] → [Jul]
Each fold uses same amount of training data
Older data "falls off" the training set
Anchored (Expanding)
Training window grows over time:
Fold 1: [Jan-Apr] → [May]
Fold 2: [Jan-May] → [Jun]
Fold 3: [Jan-Jun] → [Jul]
Each fold uses more training data
Never discards older data
Which to Use?
| Method | Pros | Cons |
|---|---|---|
| Rolling | Adapts to recent conditions | Loses old patterns |
| Anchored | More training data | May include stale patterns |
Recommendation: Start with rolling. Try anchored if you have limited data.
Walk-Forward with Retraining
In production, you'd retrain periodically:
def walk_forward_with_retraining(data):
"""Walk-forward with model retraining."""
results = []
model = None
for fold in range(n_folds):
train_data = get_train_data(fold)
test_data = get_test_data(fold)
# Retrain from scratch each fold
# (Or fine-tune from previous model)
model = train(train_data)
test_pnl = evaluate(model, test_data)
results.append(test_pnl)
return results
Fine-Tuning vs Fresh Training
# Option A: Train from scratch each fold
model = PPOConfig().build() # Fresh model
# Option B: Fine-tune from previous fold
if previous_model:
model = previous_model
model.train() # Continue training with new data
else:
model = PPOConfig().build()
Fine-tuning is faster but may accumulate errors. Fresh training is slower but more robust.
Statistical Significance
The Problem
5 folds: Agent wins 3, loses 2
Is this skill or luck?
- Could be 60% real win rate (skill)
- Could be 50% and we got lucky (noise)
Simple Significance Test
from scipy import stats
def test_significance(agent_pnls, bh_pnls):
"""Test if agent significantly beats B&H."""
differences = [a - b for a, b in zip(agent_pnls, bh_pnls)]
# One-sample t-test: Is mean difference > 0?
t_stat, p_value = stats.ttest_1samp(differences, 0)
print(f"Mean difference: ${np.mean(differences):+,.0f}")
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05 and np.mean(differences) > 0:
print("Result: Statistically significant outperformance")
else:
print("Result: Cannot conclude agent beats B&H")
Sample Size Requirements
For reliable results:
- Minimum 5-10 folds
- Preferably 20+ folds
- More folds = more confident results
With 5 folds:
Even 5/5 wins might not be significant
Could still be luck
With 20 folds:
15/20 wins is more convincing
Harder to achieve by chance
Common Mistakes
Mistake 1: Using Future Data in Features
# WRONG: Using .shift(-1) includes future
df['future_return'] = df['close'].shift(-1) / df['close'] - 1
Mistake 2: Overlapping Train/Test
# WRONG: Validation overlaps with training
train_data = data.iloc[:1000]
val_data = data.iloc[900:1100] # 100 rows overlap!
Mistake 3: Testing on Same Period Multiple Times
# WRONG: Tuning hyperparameters on test set
for lr in [0.001, 0.0001, 0.00001]:
model = train(train_data, lr=lr)
test_pnl = evaluate(model, test_data) # Peeking at test!
# Now test_pnl is optimistic because you tuned on it
Mistake 4: Ignoring Market Regimes
# WRONG: Not checking if results are regime-dependent
# Agent might only work in certain market conditions
Key Takeaways
- Never shuffle time series data - Use time-based splits
- Walk-forward simulates real deployment - Train → test → retrain
- Multiple folds give robust results - Single test is unreliable
- Check for regime dependency - Does it work in all market conditions?
- Statistical significance matters - 3/5 wins might be luck
Checkpoint
After this tutorial, verify you understand:
- Why random splits fail for time series
- How walk-forward validation works
- The difference between rolling and anchored
- How to interpret walk-forward results
Final Words
You've completed the TensorTrade tutorial curriculum.
What we learned:
- RL agents CAN predict market direction
- Commission is the main challenge
- Overfitting is the default failure mode
- Proper validation is essential
What's next:
- Contribute to TensorTrade (reduce overtrading!)
- Experiment with your own strategies
- Join the community on Discord
Happy trading!