Polars Setup Guide
January 23, 2026 · View on GitHub
This guide covers how to use the Parallel Polars integration for DataFrame-native data enrichment.
Architecture
Polars DataFrame
│
▼
parallel_enrich(df, input_columns, output_columns)
│
▼
Parallel Task Group API (batch processing)
│
▼
Polars DataFrame with new columns
The integration processes all rows in a single batch for efficiency, then adds the enriched columns back to your DataFrame.
Prerequisites
- Python 3.12+
- Parallel API Key from platform.parallel.ai
Installation
pip install parallel-web-tools[polars]
Or with all dependencies:
pip install parallel-web-tools[all]
Quick Start
import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich
# Create a DataFrame
df = pl.DataFrame({
"company": ["Google", "Microsoft", "Apple"],
"website": ["google.com", "microsoft.com", "apple.com"],
})
# Enrich with company information
result = parallel_enrich(
df,
input_columns={
"company_name": "company",
"website": "website",
},
output_columns=[
"CEO name",
"Founding year",
"Headquarters city",
],
)
# Access the enriched DataFrame
print(result.result)
print(f"Success: {result.success_count}, Errors: {result.error_count}")
Output:
shape: (3, 6)
┌───────────┬───────────────┬─────────────────┬──────────────┬──────────────────┐
│ company ┆ website ┆ ceo_name ┆ founding_year┆ headquarters_city│
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════════════╪═════════════════╪══════════════╪══════════════════╡
│ Google ┆ google.com ┆ Sundar Pichai ┆ 1998 ┆ Mountain View │
│ Microsoft ┆ microsoft.com ┆ Satya Nadella ┆ 1975 ┆ Redmond │
│ Apple ┆ apple.com ┆ Tim Cook ┆ 1976 ┆ Cupertino │
└───────────┴───────────────┴─────────────────┴──────────────┴──────────────────┘
Success: 3, Errors: 0
Authentication
Set your API key via environment variable:
export PARALLEL_API_KEY="your-api-key"
Or pass it directly:
result = parallel_enrich(
df,
input_columns={"company_name": "company"},
output_columns=["CEO name"],
api_key="your-api-key",
)
API Reference
parallel_enrich()
def parallel_enrich(
df: pl.DataFrame,
input_columns: dict[str, str],
output_columns: list[str],
api_key: str | None = None,
processor: str = "lite-fast",
timeout: int = 600,
include_basis: bool = False,
) -> EnrichmentResult
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df | pl.DataFrame | required | DataFrame to enrich |
input_columns | dict[str, str] | required | Mapping of input descriptions to column names |
output_columns | list[str] | required | List of output column descriptions |
api_key | str | None | None | API key (uses env var if not provided) |
processor | str | "lite-fast" | Parallel processor to use |
timeout | int | 600 | Timeout in seconds |
include_basis | bool | False | Include citations in results |
Returns: EnrichmentResult
EnrichmentResult
@dataclass
class EnrichmentResult:
dataframe: pl.DataFrame # Enriched DataFrame
success_count: int # Number of successful rows
error_count: int # Number of failed rows
errors: list[dict[str, Any]] # Error details
elapsed_time: float # Processing time in seconds
parallel_enrich_lazy()
Same as parallel_enrich() but accepts a pl.LazyFrame. Collects the LazyFrame before processing.
Usage Examples
Basic Company Enrichment
import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich
df = pl.DataFrame({
"name": ["Tesla", "SpaceX", "Neuralink"],
})
result = parallel_enrich(
df,
input_columns={"company_name": "name"},
output_columns=[
"CEO name",
"Industry",
"Year founded",
"Headquarters",
],
)
print(result.result)
Multiple Input Columns
df = pl.DataFrame({
"company": ["Acme Corp"],
"domain": ["acme.com"],
"location": ["San Francisco, CA"],
})
result = parallel_enrich(
df,
input_columns={
"company_name": "company",
"website": "domain",
"headquarters": "location",
},
output_columns=[
"Number of employees",
"Annual revenue (USD)",
"Main products",
],
)
Using Different Processors
# Fast, basic metadata
result = parallel_enrich(df, ..., processor="lite-fast")
# Standard enrichments
result = parallel_enrich(df, ..., processor="base-fast")
# Deep research
result = parallel_enrich(df, ..., processor="pro-fast")
Including Citations
result = parallel_enrich(
df,
input_columns={"company_name": "company"},
output_columns=["CEO name"],
include_basis=True,
)
# Access citations
for row in result.result.iter_rows(named=True):
print(f"CEO: {row['ceo_name']}")
print(f"Sources: {row['_basis']}")
Error Handling
result = parallel_enrich(df, ...)
if result.error_count > 0:
print(f"Failed rows: {result.error_count}")
for error in result.errors:
print(f" Row {error['row']}: {error['error']}")
# Filter successful rows only
successful_df = result.result.filter(
pl.col("ceo_name").is_not_null()
)
With LazyFrames
# Read from CSV lazily
lf = pl.scan_csv("companies.csv")
# Filter and select
lf = lf.filter(pl.col("active") == True).select(["name", "website"])
# Enrich (will collect the LazyFrame)
from parallel_web_tools.integrations.polars import parallel_enrich_lazy
result = parallel_enrich_lazy(
lf,
input_columns={"company_name": "name", "website": "website"},
output_columns=["CEO name"],
)
Large Dataset Processing
For large datasets, consider processing in batches:
def enrich_in_batches(df: pl.DataFrame, batch_size: int = 100):
"""Process large DataFrames in batches."""
results = []
for i in range(0, len(df), batch_size):
batch = df.slice(i, batch_size)
result = parallel_enrich(
batch,
input_columns={"company_name": "company"},
output_columns=["CEO name"],
)
results.append(result.result)
return pl.concat(results)
Processor Options
| Processor | Speed | Cost | Best For |
|---|---|---|---|
lite, lite-fast | Fastest | ~$0.005/row | Basic metadata, high volume |
base, base-fast | Fast | ~$0.01/row | Standard enrichments |
core, core-fast | Medium | ~$0.025/row | Cross-referenced data |
pro, pro-fast | Slow | ~$0.10/row | Deep research |
Column Name Mapping
Output columns are automatically converted to valid Python identifiers:
| Description | Column Name |
|---|---|
"CEO name" | ceo_name |
"Founding year (YYYY)" | founding_year |
"Annual revenue [USD]" | annual_revenue |
"2024 Revenue" | col_2024_revenue |
Best Practices
1. Be Specific in Descriptions
# Good - specific descriptions
output_columns = [
"CEO name (current CEO or equivalent leader)",
"Founding year (YYYY format)",
"Annual revenue (USD, most recent fiscal year)",
]
# Less specific - may get inconsistent results
output_columns = ["CEO", "Year", "Revenue"]
2. Use Appropriate Processors
- High volume, basic data: Use
lite-fast - Standard company info: Use
base-fast - Research-quality data: Use
pro-fast
3. Handle Errors Gracefully
result = parallel_enrich(df, ...)
# Check for errors before using results
if result.error_count > 0:
logger.warning(f"{result.error_count} rows failed enrichment")
# Errors don't stop processing - partial results are returned
4. Consider Batch Sizes
The integration processes all rows in a single batch. For very large datasets (1000+ rows), consider:
- Processing in smaller batches
- Using
lite-fastprocessor for faster results - Increasing timeout for large batches
Troubleshooting
"Column not found in DataFrame"
Ensure the column names in input_columns values match your DataFrame:
# Wrong - column name doesn't exist
input_columns={"company_name": "Company"} # Capital C
# Correct
input_columns={"company_name": "company"} # Lowercase
Timeout Errors
Increase the timeout for large batches:
result = parallel_enrich(
df,
...,
timeout=1200, # 20 minutes
)
Authentication Errors
Check your API key:
# Verify env var is set
echo $PARALLEL_API_KEY
# Or pass directly
result = parallel_enrich(..., api_key="your-key")
Next Steps
- See the demo notebook for more examples
- Check Parallel Documentation for API details
- View parallel-web-tools on GitHub