Storage Backends¶
mlforge supports multiple storage backends for persisting built features. Choose the backend that best fits your deployment environment.
LocalStore¶
The default storage backend that writes features to the local filesystem as Parquet files.
Basic Usage¶
import mlforge as mlf
import features
defs = mlf.Definitions(
name="my-project",
features=[features],
offline_store=mlf.LocalStore(path="./feature_store")
)
Configuration¶
import mlforge as mlf
from pathlib import Path
# Using string path
store = mlf.LocalStore(path="./feature_store")
# Using Path object
store = mlf.LocalStore(path=Path("./feature_store"))
# Custom location
store = mlf.LocalStore(path=Path.home() / "ml_projects" / "features")
Storage Format¶
Features are stored in a versioned directory structure (since v0.5.0):
feature_store/
├── user_total_spend/
│ ├── 1.0.0/
│ │ ├── data.parquet # Feature data
│ │ └── .meta.json # Version metadata
│ ├── 1.0.1/
│ │ ├── data.parquet
│ │ └── .meta.json
│ ├── _latest.json # Pointer to latest version
│ └── .gitignore # Auto-generated (ignores data.parquet)
├── user_avg_spend/
│ ├── 1.0.0/
│ │ ├── data.parquet
│ │ └── .meta.json
│ ├── _latest.json
│ └── .gitignore
└── product_popularity/
├── 1.0.0/
├── data.parquet
└── .meta.json
├── _latest.json
└── .gitignore
Key files:
data.parquet- Materialized feature data.meta.json- Metadata for this version (schema, config, hashes, timestamps)_latest.json- Pointer to the latest version (e.g.,{"version": "1.0.1"}).gitignore- Auto-generated file that ignores*/data.parquet
Versioning:
- Each build creates or updates a semantic version (e.g.,
1.0.0,1.0.1,1.1.0) - Versions are automatically bumped based on detected changes:
- MAJOR (2.0.0) - Columns removed or dtype changed
- MINOR (1.1.0) - Columns added or config changed
- PATCH (1.0.1) - Data refresh only
- Use
--version X.Y.Zto override automatic versioning
Git Integration:
LocalStore automatically creates .gitignore files to exclude data files:
# Auto-generated by mlforge
# Data files are rebuilt from source; commit .meta.json and _latest.json only
*/data.parquet
This allows you to commit feature metadata to Git while excluding large data files that can be rebuilt from source using mlforge sync.
When to Use¶
- Local development - Fast iteration and debugging
- Small datasets - Features fit on local disk
- Single machine deployments - No distributed infrastructure needed
- CI/CD pipelines - Ephemeral feature stores for testing
S3Store¶
Cloud storage backend for Amazon S3, supporting distributed access and production deployments.
Basic Usage¶
import mlforge as mlf
import features
defs = mlf.Definitions(
name="my-project",
features=[features],
offline_store=mlf.S3Store(
bucket="mlforge-features",
prefix="prod/features"
)
)
Configuration¶
import mlforge as mlf
# With prefix (recommended for organization)
store = mlf.S3Store(
bucket="my-bucket",
prefix="prod/features" # Features stored at s3://my-bucket/prod/features/
)
# Without prefix (bucket root)
store = mlf.S3Store(
bucket="my-bucket",
prefix="" # Features stored at s3://my-bucket/
)
# With explicit region
store = mlf.S3Store(
bucket="my-bucket",
prefix="prod/features",
region="us-west-2"
)
AWS Credentials¶
S3Store uses standard AWS credential resolution:
Storage Format¶
Features are stored in S3 with the same Parquet format:
s3://mlforge-features/prod/features/
├── user_total_spend.parquet
├── user_avg_spend.parquet
└── product_popularity.parquet
IAM Policy¶
Your AWS credentials need appropriate S3 permissions. Here's a minimal IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "mlforgeFeatureStoreAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mlforge-features",
"arn:aws:s3:::mlforge-features/*"
]
}
]
}
Read-Only Access¶
For production environments where only feature retrieval is needed:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "mlforgeFeatureStoreReadOnly",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mlforge-features",
"arn:aws:s3:::mlforge-features/*"
]
}
]
}
Write-Only Access¶
For feature build pipelines that only write features:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "mlforgeFeatureStoreWriteOnly",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mlforge-features",
"arn:aws:s3:::mlforge-features/*"
]
}
]
}
Prefix-Based Access Control¶
Restrict access to specific prefixes (e.g., prod/ vs dev/):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "mlforgeProductionAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mlforge-features",
"arn:aws:s3:::mlforge-features/prod/*"
],
"Condition": {
"StringLike": {
"s3:prefix": ["prod/*"]
}
}
}
]
}
When to Use¶
- Production deployments - Centralized, durable storage
- Team collaboration - Shared feature store across multiple users
- Large datasets - Features too large for local disk
- Multi-environment workflows - Separate dev/staging/prod prefixes
- CI/CD pipelines - Build features in one environment, use in another
Error Handling¶
import mlforge as mlf
try:
store = mlf.S3Store(bucket="nonexistent-bucket", prefix="features")
except ValueError as e:
print(f"Bucket error: {e}")
# Bucket 'nonexistent-bucket' does not exist or is not accessible.
# Ensure the bucket is created and credentials have appropriate permissions.
Common issues:
- Bucket doesn't exist - Create the bucket first using AWS Console or CLI
- Missing permissions - Verify IAM policy allows required S3 actions
- Wrong region - Specify
regionparameter if bucket is in non-default region - Invalid credentials - Check AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Complete Example¶
Local Development¶
import mlforge as mlf
import polars as pl
@mlf.feature(keys=["user_id"], source="data/transactions.parquet")
def user_total_spend(df):
return df.group_by("user_id").agg(
pl.col("amount").sum().alias("total_spend")
)
# Local development - fast iteration
defs = mlf.Definitions(
name="user-features",
features=[user_total_spend],
offline_store=mlf.LocalStore("./dev_features")
)
defs.build()
Production Deployment¶
import mlforge as mlf
import polars as pl
import os
@mlf.feature(keys=["user_id"], source="s3://my-bucket/data/transactions.parquet")
def user_total_spend(df):
return df.group_by("user_id").agg(
pl.col("amount").sum().alias("total_spend")
)
# Production - S3 storage
environment = os.getenv("ENVIRONMENT", "dev")
defs = mlf.Definitions(
name="user-features",
features=[user_total_spend],
offline_store=mlf.S3Store(
bucket="mlforge-features",
prefix=f"{environment}/features" # dev/features or prod/features
)
)
defs.build()
Multi-Environment Setup¶
import mlforge as mlf
import os
def get_store():
"""Get appropriate store based on environment."""
env = os.getenv("ENVIRONMENT", "local")
if env == "local":
return mlf.LocalStore("./feature_store")
elif env == "dev":
return mlf.S3Store(bucket="mlforge-features", prefix="dev/features")
elif env == "staging":
return mlf.S3Store(bucket="mlforge-features", prefix="staging/features")
elif env == "prod":
return mlf.S3Store(bucket="mlforge-features-prod", prefix="features")
else:
raise ValueError(f"Unknown environment: {env}")
defs = mlf.Definitions(
name="user-features",
features=[...],
offline_store=get_store()
)
Usage:
# Local development
ENVIRONMENT=local python build_features.py
# Dev environment
ENVIRONMENT=dev python build_features.py
# Production
ENVIRONMENT=prod python build_features.py
Performance Considerations¶
LocalStore¶
Pros: - Fastest for small datasets (no network overhead) - Simple setup (no credentials required) - Works offline
Cons: - Limited by local disk space - Not suitable for distributed deployments - No built-in backup/versioning
S3Store¶
Pros: - Virtually unlimited storage - High durability (99.999999999%) - Built-in versioning (if enabled on bucket) - Accessible from anywhere
Cons: - Network latency for read/write operations - Data transfer costs for large datasets - Requires AWS credentials
Optimization Tips¶
For S3Store:
- Use same region - Colocate compute and S3 bucket to minimize latency
- Enable S3 Transfer Acceleration - For cross-region access
- Batch operations - Materialize multiple features in one run
- Use IAM roles - Avoid credential management overhead
# Good: Batch materialization
defs.build() # Builds all features
# Less efficient: Individual feature builds
for feature_name in ["feature1", "feature2", "feature3"]:
defs.build(feature_names=[feature_name])
Next Steps¶
- Building Features - Build features to storage
- Retrieving Features - Read features from storage
- Store API Reference - Detailed API documentation