Skip to content

Building Features

Once you've defined features, the next step is to build them and persist to storage. This guide covers building features using both the CLI and the Python API.

Using the CLI

The recommended way to build features is via the mlforge build command.

Basic Usage

mlforge build

This will:

  1. Load your Definitions object from definitions.py
  2. Materialize all registered features
  3. Write them to your configured offline store
  4. Display a preview of each feature

Build Specific Features

Build only selected features by name:

mlforge build --features user_total_spend,user_avg_spend

Or build features by tag:

mlforge build --tags user_metrics,demographics

Mutually exclusive filters

The --features and --tags options cannot be used together. Choose one filtering approach per build command.

Force Rebuild

By default, mlforge skips features that already exist. Use --force to rebuild:

mlforge build --force

Disable Preview

Turn off the data preview:

mlforge build --no-preview

Control Preview Size

Adjust the number of rows shown:

mlforge build --preview-rows 10

Verbose Logging

Enable debug logging:

mlforge build --verbose

Versioning

Starting in v0.5.0, mlforge automatically versions your features using semantic versioning.

Automatic Version Detection

When you run mlforge build, mlforge:

  1. Checks if the feature already exists
  2. Compares the new build against the latest version
  3. Detects what changed (schema, config, or data)
  4. Automatically bumps the version according to semantic versioning rules

Version bump logic:

  • MAJOR (2.0.0): Breaking changes
    • Columns removed
    • Data types changed
  • MINOR (1.1.0): Additive changes
    • Columns added
    • Configuration changed (interval, metrics, etc.)
  • PATCH (1.0.1): Non-breaking changes
    • Data refresh (same schema and config)

Override Automatic Versioning

Specify an explicit version using the --version flag:

mlforge build --version 2.0.0

Or in Python:

defs.build(version="2.0.0")

Warning

Overriding the version skips change detection. Only use this when you need precise control over versioning.

Version Examples

First build - Creates version 1.0.0:

mlforge build --features user_spend
# → Created user_spend v1.0.0

Data refresh - Same schema/config, new data → PATCH bump:

mlforge build --features user_spend --force
# → Created user_spend v1.0.1

Add column - Additive change → MINOR bump:

# Add a new column to the feature output
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_spend(df):
    return df.select(["user_id", "total_spend", "avg_spend"])  # Added avg_spend

defs.build()
# → Created user_spend v1.1.0

Remove column - Breaking change → MAJOR bump:

# Remove a column from feature output
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_spend(df):
    return df.select(["user_id"])  # Removed total_spend

defs.build()
# → Created user_spend v2.0.0

Team Collaboration via Git

mlforge enables teams to collaborate on feature definitions via Git while keeping large data files out of version control.

How It Works

  1. Metadata is committed: .meta.json and _latest.json files
  2. Data is ignored: data.parquet files are excluded via auto-generated .gitignore
  3. Teammates rebuild locally: Run mlforge sync to recreate data from metadata

Workflow

Developer 1 - Creates a new feature:

# Define and build feature
mlforge build --features user_spend

# Commit metadata to Git
git add feature_store/user_spend/
git commit -m "feat: add user_spend feature v1.0.0"
git push

Developer 2 - Pulls changes and rebuilds data:

# Pull latest changes
git pull

# Rebuild features from metadata
mlforge sync

# user_spend v1.0.0 is now available locally

Git Ignore

mlforge automatically creates .gitignore files in each feature directory:

# Auto-generated by mlforge
# Data files are rebuilt from source; commit .meta.json and _latest.json only
*/data.parquet

This ensures: - Metadata files are committed (.meta.json, _latest.json) - Data files are ignored (data.parquet) - Feature definitions can be shared via Git - Data can be rebuilt from source using mlforge sync

Sync Command

The mlforge sync command rebuilds features from metadata:

# Preview what would be synced
mlforge sync --dry-run

# Sync all features with missing data
mlforge sync

# Sync specific features
mlforge sync --features user_spend,merchant_spend

# Force sync even if source data changed
mlforge sync --force

See the CLI Reference for complete sync command documentation.

LocalStore Only

The sync command only works with LocalStore. Cloud stores (S3Store) already share data between teammates, so syncing is not needed.

Using the Python API

You can also build features programmatically:

import mlforge as mlf
import features

defs = mlf.Definitions(
    name="my-project",
    features=[features],
    offline_store=mlf.LocalStore("./feature_store")
)

# Build all features
defs.build()

Build Specific Features

By feature name:

defs.build(feature_names=["user_total_spend", "user_avg_spend"])

By tag:

defs.build(tag_names=["user_metrics", "demographics"])

Mutually exclusive parameters

The feature_names and tag_names parameters cannot be used together.

Force Rebuild

defs.build(force=True)

Disable Preview

defs.build(preview=False)

Custom Preview Size

defs.build(preview_rows=10)

Get Output Paths

The build() method returns a dictionary mapping feature names to their file paths:

from pathlib import Path

paths = defs.build()

for feature_name, path in paths.items():
    print(f"{feature_name}: {path}")

# Output:
# user_total_spend: feature_store/user_total_spend.parquet
# user_avg_spend: feature_store/user_avg_spend.parquet

Storage Backend

Features are stored in the configured offline store. mlforge supports both local and cloud storage backends.

LocalStore

Stores features as individual Parquet files on the local filesystem:

import mlforge as mlf

store = mlf.LocalStore(path="./feature_store")

Each feature is saved as feature_store/<feature_name>.parquet.

S3Store

Stores features in Amazon S3 for production deployments:

import mlforge as mlf

store = mlf.S3Store(
    bucket="mlforge-features",
    prefix="prod/features"
)

Features are stored at s3://mlforge-features/prod/features/<feature_name>.parquet.

AWS Credentials

S3Store uses standard AWS credential resolution (environment variables, ~/.aws/credentials, or IAM roles).

See the Storage Backends guide for detailed configuration and IAM policy examples.

Listing Features

View all registered features:

mlforge list

Filter by tags:

mlforge list --tags user_metrics

Output:

┌──────────────────┬──────────────┬──────────────────────────┬──────────────┬─────────────────────┐
│ Name             │ Keys         │ Source                   │ Tags         │ Description         │
├──────────────────┼──────────────┼──────────────────────────┼──────────────┼─────────────────────┤
│ user_total_spend │ [user_id]    │ data/transactions.parquet│ user_metrics │ Total spend by user │
│ user_avg_spend   │ [user_id]    │ data/transactions.parquet│ user_metrics │ Avg spend by user   │
└──────────────────┴──────────────┴──────────────────────────┴──────────────┴─────────────────────┘

Or in Python:

# List all features
features = defs.list_features()

for feature in features:
    print(f"{feature.name}: {feature.description}")

# List features by tag
user_features = defs.list_features(tags=["user_metrics"])

Error Handling

FeatureMaterializationError

Raised when a feature function fails or returns invalid data:

from mlforge.errors import FeatureMaterializationError

try:
    defs.build()
except FeatureMaterializationError as e:
    print(f"Failed to materialize feature: {e}")

Common causes:

  1. Feature function returns None

    @mlf.feature(keys=["user_id"], source="data/users.parquet")
    def broken_feature(df):
        df.group_by("user_id").agg(...)
        # Missing return statement!
    

  2. Feature function returns wrong type

    @mlf.feature(keys=["user_id"], source="data/users.parquet")
    def broken_feature(df):
        return df.to_dict()  # Should return DataFrame
    

  3. Missing key columns in output

    @mlf.feature(keys=["user_id"], source="data/users.parquet")
    def broken_feature(df):
        return df.select("amount")  # user_id is missing!
    

Source File Errors

If the source file doesn't exist or has an unsupported format:

@mlf.feature(keys=["user_id"], source="data/missing.parquet")
def my_feature(df): ...

# Raises: FileNotFoundError
defs.build()

Supported formats: .parquet, .csv

Workflow Example

A typical development workflow:

# 1. Define features
cat > features.py << 'EOF'
import mlforge as mlf
import polars as pl

@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_age(df):
    return df.select(["user_id", "age"])
EOF

# 2. Create definitions
cat > definitions.py << 'EOF'
import mlforge as mlf
import features

defs = mlf.Definitions(
    name="user-features",
    features=[features],
    offline_store=mlf.LocalStore("./feature_store")
)
EOF

# 3. Build features
mlforge build

# 4. Verify
mlforge list

# 5. Rebuild specific features if needed
mlforge build --features user_age --force

Performance Tips

1. Use Parquet for Sources

Parquet is significantly faster than CSV for large datasets:

# Convert CSV to Parquet once
import polars as pl

df = pl.read_csv("data/large_file.csv")
df.write_parquet("data/large_file.parquet")

# Then use Parquet in features
@mlf.feature(keys=["id"], source="data/large_file.parquet")
def my_feature(df): ...

2. Filter Early

If you don't need all source data, filter it early in your feature function:

@mlf.feature(keys=["user_id"], source="data/all_events.parquet")
def recent_user_activity(df):
    return (
        df
        .filter(pl.col("event_date") >= "2024-01-01")  # Filter early
        .group_by("user_id")
        .agg(pl.col("event_id").count())
    )

3. Build Features Incrementally

During development, build one feature at a time:

mlforge build --features new_feature

Once it works, build all features together.

Next Steps