Building Features¶
Once you've defined features, the next step is to build them and persist to storage. This guide covers building features using both the CLI and the Python API.
Using the CLI¶
The recommended way to build features is via the mlforge build command.
Basic Usage¶
This will:
- Load your
Definitionsobject fromdefinitions.py - Materialize all registered features
- Write them to your configured offline store
- Display a preview of each feature
Build Specific Features¶
Build only selected features by name:
Or build features by tag:
Mutually exclusive filters
The --features and --tags options cannot be used together. Choose one filtering approach per build command.
Force Rebuild¶
By default, mlforge skips features that already exist. Use --force to rebuild:
Disable Preview¶
Turn off the data preview:
Control Preview Size¶
Adjust the number of rows shown:
Verbose Logging¶
Enable debug logging:
Versioning¶
Starting in v0.5.0, mlforge automatically versions your features using semantic versioning.
Automatic Version Detection¶
When you run mlforge build, mlforge:
- Checks if the feature already exists
- Compares the new build against the latest version
- Detects what changed (schema, config, or data)
- Automatically bumps the version according to semantic versioning rules
Version bump logic:
- MAJOR (2.0.0): Breaking changes
- Columns removed
- Data types changed
- MINOR (1.1.0): Additive changes
- Columns added
- Configuration changed (interval, metrics, etc.)
- PATCH (1.0.1): Non-breaking changes
- Data refresh (same schema and config)
Override Automatic Versioning¶
Specify an explicit version using the --version flag:
Or in Python:
Warning
Overriding the version skips change detection. Only use this when you need precise control over versioning.
Version Examples¶
First build - Creates version 1.0.0:
Data refresh - Same schema/config, new data → PATCH bump:
Add column - Additive change → MINOR bump:
# Add a new column to the feature output
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_spend(df):
return df.select(["user_id", "total_spend", "avg_spend"]) # Added avg_spend
defs.build()
# → Created user_spend v1.1.0
Remove column - Breaking change → MAJOR bump:
# Remove a column from feature output
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_spend(df):
return df.select(["user_id"]) # Removed total_spend
defs.build()
# → Created user_spend v2.0.0
Team Collaboration via Git¶
mlforge enables teams to collaborate on feature definitions via Git while keeping large data files out of version control.
How It Works¶
- Metadata is committed:
.meta.jsonand_latest.jsonfiles - Data is ignored:
data.parquetfiles are excluded via auto-generated.gitignore - Teammates rebuild locally: Run
mlforge syncto recreate data from metadata
Workflow¶
Developer 1 - Creates a new feature:
# Define and build feature
mlforge build --features user_spend
# Commit metadata to Git
git add feature_store/user_spend/
git commit -m "feat: add user_spend feature v1.0.0"
git push
Developer 2 - Pulls changes and rebuilds data:
# Pull latest changes
git pull
# Rebuild features from metadata
mlforge sync
# user_spend v1.0.0 is now available locally
Git Ignore¶
mlforge automatically creates .gitignore files in each feature directory:
# Auto-generated by mlforge
# Data files are rebuilt from source; commit .meta.json and _latest.json only
*/data.parquet
This ensures:
- Metadata files are committed (.meta.json, _latest.json)
- Data files are ignored (data.parquet)
- Feature definitions can be shared via Git
- Data can be rebuilt from source using mlforge sync
Sync Command¶
The mlforge sync command rebuilds features from metadata:
# Preview what would be synced
mlforge sync --dry-run
# Sync all features with missing data
mlforge sync
# Sync specific features
mlforge sync --features user_spend,merchant_spend
# Force sync even if source data changed
mlforge sync --force
See the CLI Reference for complete sync command documentation.
LocalStore Only
The sync command only works with LocalStore. Cloud stores (S3Store) already share data between teammates, so syncing is not needed.
Using the Python API¶
You can also build features programmatically:
import mlforge as mlf
import features
defs = mlf.Definitions(
name="my-project",
features=[features],
offline_store=mlf.LocalStore("./feature_store")
)
# Build all features
defs.build()
Build Specific Features¶
By feature name:
By tag:
Mutually exclusive parameters
The feature_names and tag_names parameters cannot be used together.
Force Rebuild¶
Disable Preview¶
Custom Preview Size¶
Get Output Paths¶
The build() method returns a dictionary mapping feature names to their file paths:
from pathlib import Path
paths = defs.build()
for feature_name, path in paths.items():
print(f"{feature_name}: {path}")
# Output:
# user_total_spend: feature_store/user_total_spend.parquet
# user_avg_spend: feature_store/user_avg_spend.parquet
Storage Backend¶
Features are stored in the configured offline store. mlforge supports both local and cloud storage backends.
LocalStore¶
Stores features as individual Parquet files on the local filesystem:
Each feature is saved as feature_store/<feature_name>.parquet.
S3Store¶
Stores features in Amazon S3 for production deployments:
Features are stored at s3://mlforge-features/prod/features/<feature_name>.parquet.
AWS Credentials
S3Store uses standard AWS credential resolution (environment variables, ~/.aws/credentials, or IAM roles).
See the Storage Backends guide for detailed configuration and IAM policy examples.
Listing Features¶
View all registered features:
Filter by tags:
Output:
┌──────────────────┬──────────────┬──────────────────────────┬──────────────┬─────────────────────┐
│ Name │ Keys │ Source │ Tags │ Description │
├──────────────────┼──────────────┼──────────────────────────┼──────────────┼─────────────────────┤
│ user_total_spend │ [user_id] │ data/transactions.parquet│ user_metrics │ Total spend by user │
│ user_avg_spend │ [user_id] │ data/transactions.parquet│ user_metrics │ Avg spend by user │
└──────────────────┴──────────────┴──────────────────────────┴──────────────┴─────────────────────┘
Or in Python:
# List all features
features = defs.list_features()
for feature in features:
print(f"{feature.name}: {feature.description}")
# List features by tag
user_features = defs.list_features(tags=["user_metrics"])
Error Handling¶
FeatureMaterializationError¶
Raised when a feature function fails or returns invalid data:
from mlforge.errors import FeatureMaterializationError
try:
defs.build()
except FeatureMaterializationError as e:
print(f"Failed to materialize feature: {e}")
Common causes:
-
Feature function returns None
-
Feature function returns wrong type
-
Missing key columns in output
Source File Errors¶
If the source file doesn't exist or has an unsupported format:
@mlf.feature(keys=["user_id"], source="data/missing.parquet")
def my_feature(df): ...
# Raises: FileNotFoundError
defs.build()
Supported formats: .parquet, .csv
Workflow Example¶
A typical development workflow:
# 1. Define features
cat > features.py << 'EOF'
import mlforge as mlf
import polars as pl
@mlf.feature(keys=["user_id"], source="data/users.parquet")
def user_age(df):
return df.select(["user_id", "age"])
EOF
# 2. Create definitions
cat > definitions.py << 'EOF'
import mlforge as mlf
import features
defs = mlf.Definitions(
name="user-features",
features=[features],
offline_store=mlf.LocalStore("./feature_store")
)
EOF
# 3. Build features
mlforge build
# 4. Verify
mlforge list
# 5. Rebuild specific features if needed
mlforge build --features user_age --force
Performance Tips¶
1. Use Parquet for Sources¶
Parquet is significantly faster than CSV for large datasets:
# Convert CSV to Parquet once
import polars as pl
df = pl.read_csv("data/large_file.csv")
df.write_parquet("data/large_file.parquet")
# Then use Parquet in features
@mlf.feature(keys=["id"], source="data/large_file.parquet")
def my_feature(df): ...
2. Filter Early¶
If you don't need all source data, filter it early in your feature function:
@mlf.feature(keys=["user_id"], source="data/all_events.parquet")
def recent_user_activity(df):
return (
df
.filter(pl.col("event_date") >= "2024-01-01") # Filter early
.group_by("user_id")
.agg(pl.col("event_id").count())
)
3. Build Features Incrementally¶
During development, build one feature at a time:
Once it works, build all features together.
Next Steps¶
- Retrieving Features - Use features in training pipelines
- Entity Keys - Work with surrogate keys
- Point-in-Time Correctness - Temporal feature joins