Polars Setup Guide

January 23, 2026 · View on GitHub

This guide covers how to use the Parallel Polars integration for DataFrame-native data enrichment.

Architecture

Polars DataFrame
       │
       ▼
parallel_enrich(df, input_columns, output_columns)
       │
       ▼
Parallel Task Group API (batch processing)
       │
       ▼
Polars DataFrame with new columns

The integration processes all rows in a single batch for efficiency, then adds the enriched columns back to your DataFrame.

Prerequisites

Python 3.12+
Parallel API Key from platform.parallel.ai

Installation

pip install parallel-web-tools[polars]

Or with all dependencies:

pip install parallel-web-tools[all]

Quick Start

import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich

# Create a DataFrame
df = pl.DataFrame({
    "company": ["Google", "Microsoft", "Apple"],
    "website": ["google.com", "microsoft.com", "apple.com"],
})

# Enrich with company information
result = parallel_enrich(
    df,
    input_columns={
        "company_name": "company",
        "website": "website",
    },
    output_columns=[
        "CEO name",
        "Founding year",
        "Headquarters city",
    ],
)

# Access the enriched DataFrame
print(result.result)
print(f"Success: {result.success_count}, Errors: {result.error_count}")

Output:

shape: (3, 6)
┌───────────┬───────────────┬─────────────────┬──────────────┬──────────────────┐
│ company   ┆ website       ┆ ceo_name        ┆ founding_year┆ headquarters_city│
│ ---       ┆ ---           ┆ ---             ┆ ---          ┆ ---              │
│ str       ┆ str           ┆ str             ┆ str          ┆ str              │
╞═══════════╪═══════════════╪═════════════════╪══════════════╪══════════════════╡
│ Google    ┆ google.com    ┆ Sundar Pichai   ┆ 1998         ┆ Mountain View    │
│ Microsoft ┆ microsoft.com ┆ Satya Nadella   ┆ 1975         ┆ Redmond          │
│ Apple     ┆ apple.com     ┆ Tim Cook        ┆ 1976         ┆ Cupertino        │
└───────────┴───────────────┴─────────────────┴──────────────┴──────────────────┘
Success: 3, Errors: 0

Authentication

Set your API key via environment variable:

export PARALLEL_API_KEY="your-api-key"

Or pass it directly:

result = parallel_enrich(
    df,
    input_columns={"company_name": "company"},
    output_columns=["CEO name"],
    api_key="your-api-key",
)

API Reference

`parallel_enrich()`

def parallel_enrich(
    df: pl.DataFrame,
    input_columns: dict[str, str],
    output_columns: list[str],
    api_key: str | None = None,
    processor: str = "lite-fast",
    timeout: int = 600,
    include_basis: bool = False,
) -> EnrichmentResult

Parameters:

Parameter	Type	Default	Description
`df`	`pl.DataFrame`	required	DataFrame to enrich
`input_columns`	`dict[str, str]`	required	Mapping of input descriptions to column names
`output_columns`	`list[str]`	required	List of output column descriptions
`api_key`	`str \| None`	`None`	API key (uses env var if not provided)
`processor`	`str`	`"lite-fast"`	Parallel processor to use
`timeout`	`int`	`600`	Timeout in seconds
`include_basis`	`bool`	`False`	Include citations in results

Returns: EnrichmentResult

`EnrichmentResult`

@dataclass
class EnrichmentResult:
    dataframe: pl.DataFrame      # Enriched DataFrame
    success_count: int           # Number of successful rows
    error_count: int             # Number of failed rows
    errors: list[dict[str, Any]] # Error details
    elapsed_time: float          # Processing time in seconds

`parallel_enrich_lazy()`

Same as parallel_enrich() but accepts a pl.LazyFrame. Collects the LazyFrame before processing.

Usage Examples

Basic Company Enrichment

import polars as pl
from parallel_web_tools.integrations.polars import parallel_enrich

df = pl.DataFrame({
    "name": ["Tesla", "SpaceX", "Neuralink"],
})

result = parallel_enrich(
    df,
    input_columns={"company_name": "name"},
    output_columns=[
        "CEO name",
        "Industry",
        "Year founded",
        "Headquarters",
    ],
)

print(result.result)

Multiple Input Columns

df = pl.DataFrame({
    "company": ["Acme Corp"],
    "domain": ["acme.com"],
    "location": ["San Francisco, CA"],
})

result = parallel_enrich(
    df,
    input_columns={
        "company_name": "company",
        "website": "domain",
        "headquarters": "location",
    },
    output_columns=[
        "Number of employees",
        "Annual revenue (USD)",
        "Main products",
    ],
)

Using Different Processors

# Fast, basic metadata
result = parallel_enrich(df, ..., processor="lite-fast")

# Standard enrichments
result = parallel_enrich(df, ..., processor="base-fast")

# Deep research
result = parallel_enrich(df, ..., processor="pro-fast")

Including Citations

result = parallel_enrich(
    df,
    input_columns={"company_name": "company"},
    output_columns=["CEO name"],
    include_basis=True,
)

# Access citations
for row in result.result.iter_rows(named=True):
    print(f"CEO: {row['ceo_name']}")
    print(f"Sources: {row['_basis']}")

Error Handling

result = parallel_enrich(df, ...)

if result.error_count > 0:
    print(f"Failed rows: {result.error_count}")
    for error in result.errors:
        print(f"  Row {error['row']}: {error['error']}")

# Filter successful rows only
successful_df = result.result.filter(
    pl.col("ceo_name").is_not_null()
)

With LazyFrames

# Read from CSV lazily
lf = pl.scan_csv("companies.csv")

# Filter and select
lf = lf.filter(pl.col("active") == True).select(["name", "website"])

# Enrich (will collect the LazyFrame)
from parallel_web_tools.integrations.polars import parallel_enrich_lazy

result = parallel_enrich_lazy(
    lf,
    input_columns={"company_name": "name", "website": "website"},
    output_columns=["CEO name"],
)

Large Dataset Processing

For large datasets, consider processing in batches:

def enrich_in_batches(df: pl.DataFrame, batch_size: int = 100):
    """Process large DataFrames in batches."""
    results = []

    for i in range(0, len(df), batch_size):
        batch = df.slice(i, batch_size)
        result = parallel_enrich(
            batch,
            input_columns={"company_name": "company"},
            output_columns=["CEO name"],
        )
        results.append(result.result)

    return pl.concat(results)

Processor Options

Processor	Speed	Cost	Best For
`lite`, `lite-fast`	Fastest	~$0.005/row	Basic metadata, high volume
`base`, `base-fast`	Fast	~$0.01/row	Standard enrichments
`core`, `core-fast`	Medium	~$0.025/row	Cross-referenced data
`pro`, `pro-fast`	Slow	~$0.10/row	Deep research

Column Name Mapping

Output columns are automatically converted to valid Python identifiers:

Description	Column Name
`"CEO name"`	`ceo_name`
`"Founding year (YYYY)"`	`founding_year`
`"Annual revenue [USD]"`	`annual_revenue`
`"2024 Revenue"`	`col_2024_revenue`

Best Practices

1. Be Specific in Descriptions

# Good - specific descriptions
output_columns = [
    "CEO name (current CEO or equivalent leader)",
    "Founding year (YYYY format)",
    "Annual revenue (USD, most recent fiscal year)",
]

# Less specific - may get inconsistent results
output_columns = ["CEO", "Year", "Revenue"]

2. Use Appropriate Processors

High volume, basic data: Use lite-fast
Standard company info: Use base-fast
Research-quality data: Use pro-fast

3. Handle Errors Gracefully

result = parallel_enrich(df, ...)

# Check for errors before using results
if result.error_count > 0:
    logger.warning(f"{result.error_count} rows failed enrichment")

# Errors don't stop processing - partial results are returned

4. Consider Batch Sizes

The integration processes all rows in a single batch. For very large datasets (1000+ rows), consider:

Processing in smaller batches
Using lite-fast processor for faster results
Increasing timeout for large batches

Troubleshooting

"Column not found in DataFrame"

Ensure the column names in input_columns values match your DataFrame:

# Wrong - column name doesn't exist
input_columns={"company_name": "Company"}  # Capital C

# Correct
input_columns={"company_name": "company"}  # Lowercase

Timeout Errors

Increase the timeout for large batches:

result = parallel_enrich(
    df,
    ...,
    timeout=1200,  # 20 minutes
)

Authentication Errors

Check your API key:

# Verify env var is set
echo $PARALLEL_API_KEY

# Or pass directly
result = parallel_enrich(..., api_key="your-key")

Next Steps

See the demo notebook for more examples
Check Parallel Documentation for API details
View parallel-web-tools on GitHub