Building Features¶

Once you've defined features, the next step is to build them and persist to storage. This guide covers building features using both the CLI and the Python API.

Using the CLI¶

The recommended way to build features is via the mlforge build command.

Basic Usage¶

mlforge build

This will:

Load your Definitions object from definitions.py
Materialize all registered features
Write them to your configured offline store
Display a preview of each feature

Build Specific Features¶

Build only selected features by name:

mlforge build --features user_total_spend,user_avg_spend

Or build features by tag:

mlforge build --tags user_metrics,demographics

Mutually exclusive filters

The --features and --tags options cannot be used together. Choose one filtering approach per build command.

Force Rebuild¶

By default, mlforge skips features that already exist. Use --force to rebuild:

mlforge build --force

Disable Preview¶

Turn off the data preview:

mlforge build --no-preview

Control Preview Size¶

Adjust the number of rows shown:

mlforge build --preview-rows 10

Verbose Logging¶

Enable debug logging:

mlforge build --verbose

Versioning¶

Starting in v0.5.0, mlforge automatically versions your features using semantic versioning.

Automatic Version Detection¶

When you run mlforge build, mlforge:

Checks if the feature already exists
Compares the new build against the latest version
Detects what changed (schema, config, or data)
Automatically bumps the version according to semantic versioning rules

Version bump logic:

MAJOR (2.0.0): Breaking changes
- Columns removed
- Data types changed
MINOR (1.1.0): Additive changes
- Columns added
- Configuration changed (interval, metrics, etc.)
PATCH (1.0.1): Non-breaking changes
- Data refresh (same schema and config)

Override Automatic Versioning¶

Specify an explicit version using the --version flag:

mlforge build --version 2.0.0

Or in Python:

defs.build(version="2.0.0")

Warning

Overriding the version skips change detection. Only use this when you need precise control over versioning.

Version Examples¶

First build - Creates version 1.0.0:

mlforge build --features user_spend
# → Created user_spend v1.0.0

Data refresh - Same schema/config, new data → PATCH bump:

mlforge build --features user_spend --force
# → Created user_spend v1.0.1

Add column - Additive change → MINOR bump:

# Add a new column to the feature output
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_spend(df):
    return df.select(["user_id", "total_spend", "avg_spend"])  # Added avg_spend

defs.build()
# → Created user_spend v1.1.0

Remove column - Breaking change → MAJOR bump:

# Remove a column from feature output
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_spend(df):
    return df.select(["user_id"])  # Removed total_spend

defs.build()
# → Created user_spend v2.0.0

Team Collaboration via Git¶

mlforge enables teams to collaborate on feature definitions via Git while keeping large data files out of version control.

How It Works¶

Metadata is committed: .meta.json and _latest.json files
Data is ignored: data.parquet files are excluded via auto-generated .gitignore
Teammates rebuild locally: Run mlforge sync to recreate data from metadata

Workflow¶

Developer 1 - Creates a new feature:

# Define and build feature
mlforge build --features user_spend

# Commit metadata to Git
git add feature_store/user_spend/
git commit -m "feat: add user_spend feature v1.0.0"
git push

Developer 2 - Pulls changes and rebuilds data:

# Pull latest changes
git pull

# Rebuild features from metadata
mlforge sync

# user_spend v1.0.0 is now available locally

Git Ignore¶

mlforge automatically creates .gitignore files in each feature directory:

# Auto-generated by mlforge
# Data files are rebuilt from source; commit .meta.json and _latest.json only
*/data.parquet

This ensures: - Metadata files are committed (.meta.json, _latest.json) - Data files are ignored (data.parquet) - Feature definitions can be shared via Git - Data can be rebuilt from source using mlforge sync

Sync Command¶

The mlforge sync command rebuilds features from metadata:

# Preview what would be synced
mlforge sync --dry-run

# Sync all features with missing data
mlforge sync

# Sync specific features
mlforge sync --features user_spend,merchant_spend

# Force sync even if source data changed
mlforge sync --force

See the CLI Reference for complete sync command documentation.

LocalStore Only

The sync command only works with LocalStore. Cloud stores (S3Store) already share data between teammates, so syncing is not needed.

Using the Python API¶

You can also build features programmatically:

import mlforge as mlf
import features

defs = mlf.Definitions(
    name="my-project",
    features=[features],
    offline_store=mlf.LocalStore("./feature_store")
)

# Build all features
defs.build()

Build Specific Features¶

By feature name:

defs.build(feature_names=["user_total_spend", "user_avg_spend"])

By tag:

defs.build(tag_names=["user_metrics", "demographics"])

Mutually exclusive parameters

The feature_names and tag_names parameters cannot be used together.

Force Rebuild¶

defs.build(force=True)

Disable Preview¶

defs.build(preview=False)

Custom Preview Size¶

defs.build(preview_rows=10)

Get Output Paths¶

The build() method returns a dictionary mapping feature names to their file paths:

from pathlib import Path

paths = defs.build()

for feature_name, path in paths.items():
    print(f"{feature_name}: {path}")

# Output:
# user_total_spend: feature_store/user_total_spend.parquet
# user_avg_spend: feature_store/user_avg_spend.parquet

Storage Backend¶

Features are stored in the configured offline store. mlforge supports both local and cloud storage backends.

LocalStore¶

Stores features as individual Parquet files on the local filesystem:

import mlforge as mlf

store = mlf.LocalStore(path="./feature_store")

Each feature is saved as feature_store/<feature_name>.parquet.

S3Store¶

Stores features in Amazon S3 for production deployments:

import mlforge as mlf

store = mlf.S3Store(
    bucket="mlforge-features",
    prefix="prod/features"
)

Features are stored at s3://mlforge-features/prod/features/<feature_name>.parquet.

AWS Credentials

S3Store uses standard AWS credential resolution (environment variables, ~/.aws/credentials, or IAM roles).

See the Storage Backends guide for detailed configuration and IAM policy examples.

Listing Features¶

View all registered features:

mlforge list

Filter by tags:

mlforge list --tags user_metrics

Output:

┌──────────────────┬──────────────┬──────────────────────────┬──────────────┬─────────────────────┐
│ Name             │ Keys         │ Source                   │ Tags         │ Description         │
├──────────────────┼──────────────┼──────────────────────────┼──────────────┼─────────────────────┤
│ user_total_spend │ [user_id]    │ data/transactions.parquet│ user_metrics │ Total spend by user │
│ user_avg_spend   │ [user_id]    │ data/transactions.parquet│ user_metrics │ Avg spend by user   │
└──────────────────┴──────────────┴──────────────────────────┴──────────────┴─────────────────────┘

Or in Python:

# List all features
features = defs.list_features()

for feature in features:
    print(f"{feature.name}: {feature.description}")

# List features by tag
user_features = defs.list_features(tags=["user_metrics"])

Error Handling¶

FeatureMaterializationError¶

Raised when a feature function fails or returns invalid data:

from mlforge.errors import FeatureMaterializationError

try:
    defs.build()
except FeatureMaterializationError as e:
    print(f"Failed to materialize feature: {e}")

Common causes:

Feature function returns None

@mlf.feature(keys=["user_id"], source="data/users.parquet")
def broken_feature(df):
    df.group_by("user_id").agg(...)
    # Missing return statement!

Feature function returns wrong type

@mlf.feature(keys=["user_id"], source="data/users.parquet")
def broken_feature(df):
    return df.to_dict()  # Should return DataFrame

Missing key columns in output

@mlf.feature(keys=["user_id"], source="data/users.parquet")
def broken_feature(df):
    return df.select("amount")  # user_id is missing!

Source File Errors¶

If the source file doesn't exist or has an unsupported format:

@mlf.feature(keys=["user_id"], source="data/missing.parquet")
def my_feature(df): ...

# Raises: FileNotFoundError
defs.build()

Supported formats: .parquet, .csv

Workflow Example¶

A typical development workflow:

# 1. Define features
cat > features.py << 'EOF'
import mlforge as mlf
import polars as pl

@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_age(df):
    return df.select(["user_id", "age"])
EOF

# 2. Create definitions
cat > definitions.py << 'EOF'
import mlforge as mlf
import features

defs = mlf.Definitions(
    name="user-features",
    features=[features],
    offline_store=mlf.LocalStore("./feature_store")
)
EOF

# 3. Build features
mlforge build

# 4. Verify
mlforge list

# 5. Rebuild specific features if needed
mlforge build --features user_age --force

Performance Tips¶

1. Use Parquet for Sources¶

Parquet is significantly faster than CSV for large datasets:

# Convert CSV to Parquet once
import polars as pl

df = pl.read_csv("data/large_file.csv")
df.write_parquet("data/large_file.parquet")

# Then use Parquet in features
@mlf.feature(keys=["id"], source="data/large_file.parquet")
def my_feature(df): ...

2. Filter Early¶

If you don't need all source data, filter it early in your feature function:

@mlf.feature(keys=["user_id"], source="data/all_events.parquet")
def recent_user_activity(df):
    return (
        df
        .filter(pl.col("event_date") >= "2024-01-01")  # Filter early
        .group_by("user_id")
        .agg(pl.col("event_id").count())
    )

3. Build Features Incrementally¶

During development, build one feature at a time:

mlforge build --features new_feature

Once it works, build all features together.

Next Steps¶

Retrieving Features - Use features in training pipelines
Entity Keys - Work with surrogate keys
Point-in-Time Correctness - Temporal feature joins