Skip to content

Storage Backends

mlforge supports multiple storage backends for persisting built features. Choose the backend that best fits your deployment environment.

LocalStore

The default storage backend that writes features to the local filesystem as Parquet files.

Basic Usage

import mlforge as mlf
import features

defs = mlf.Definitions(
    name="my-project",
    features=[features],
    offline_store=mlf.LocalStore(path="./feature_store")
)

Configuration

import mlforge as mlf
from pathlib import Path

# Using string path
store = mlf.LocalStore(path="./feature_store")

# Using Path object
store = mlf.LocalStore(path=Path("./feature_store"))

# Custom location
store = mlf.LocalStore(path=Path.home() / "ml_projects" / "features")

Storage Format

Features are stored in a versioned directory structure (since v0.5.0):

feature_store/
├── user_total_spend/
│   ├── 1.0.0/
│   │   ├── data.parquet          # Feature data
│   │   └── .meta.json             # Version metadata
│   ├── 1.0.1/
│   │   ├── data.parquet
│   │   └── .meta.json
│   ├── _latest.json               # Pointer to latest version
│   └── .gitignore                 # Auto-generated (ignores data.parquet)
├── user_avg_spend/
│   ├── 1.0.0/
│   │   ├── data.parquet
│   │   └── .meta.json
│   ├── _latest.json
│   └── .gitignore
└── product_popularity/
    ├── 1.0.0/
        ├── data.parquet
        └── .meta.json
    ├── _latest.json
    └── .gitignore

Key files:

  • data.parquet - Materialized feature data
  • .meta.json - Metadata for this version (schema, config, hashes, timestamps)
  • _latest.json - Pointer to the latest version (e.g., {"version": "1.0.1"})
  • .gitignore - Auto-generated file that ignores */data.parquet

Versioning:

  • Each build creates or updates a semantic version (e.g., 1.0.0, 1.0.1, 1.1.0)
  • Versions are automatically bumped based on detected changes:
    • MAJOR (2.0.0) - Columns removed or dtype changed
    • MINOR (1.1.0) - Columns added or config changed
    • PATCH (1.0.1) - Data refresh only
  • Use --version X.Y.Z to override automatic versioning

Git Integration:

LocalStore automatically creates .gitignore files to exclude data files:

# Auto-generated by mlforge
# Data files are rebuilt from source; commit .meta.json and _latest.json only
*/data.parquet

This allows you to commit feature metadata to Git while excluding large data files that can be rebuilt from source using mlforge sync.

When to Use

  • Local development - Fast iteration and debugging
  • Small datasets - Features fit on local disk
  • Single machine deployments - No distributed infrastructure needed
  • CI/CD pipelines - Ephemeral feature stores for testing

S3Store

Cloud storage backend for Amazon S3, supporting distributed access and production deployments.

Basic Usage

import mlforge as mlf
import features

defs = mlf.Definitions(
    name="my-project",
    features=[features],
    offline_store=mlf.S3Store(
        bucket="mlforge-features",
        prefix="prod/features"
    )
)

Configuration

import mlforge as mlf

# With prefix (recommended for organization)
store = mlf.S3Store(
    bucket="my-bucket",
    prefix="prod/features"  # Features stored at s3://my-bucket/prod/features/
)

# Without prefix (bucket root)
store = mlf.S3Store(
    bucket="my-bucket",
    prefix=""  # Features stored at s3://my-bucket/
)

# With explicit region
store = mlf.S3Store(
    bucket="my-bucket",
    prefix="prod/features",
    region="us-west-2"
)

AWS Credentials

S3Store uses standard AWS credential resolution:

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1
aws configure
# AWS Access Key ID: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key: ****
# Default region name: us-east-1
# Default output format: json

When running on AWS services, use IAM roles instead of static credentials:

import mlforge as mlf

# No credentials needed - uses instance/task role
store = mlf.S3Store(bucket="mlforge-features", prefix="prod")

Storage Format

Features are stored in S3 with the same Parquet format:

s3://mlforge-features/prod/features/
├── user_total_spend.parquet
├── user_avg_spend.parquet
└── product_popularity.parquet

IAM Policy

Your AWS credentials need appropriate S3 permissions. Here's a minimal IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "mlforgeFeatureStoreAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlforge-features",
        "arn:aws:s3:::mlforge-features/*"
      ]
    }
  ]
}

Read-Only Access

For production environments where only feature retrieval is needed:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "mlforgeFeatureStoreReadOnly",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlforge-features",
        "arn:aws:s3:::mlforge-features/*"
      ]
    }
  ]
}

Write-Only Access

For feature build pipelines that only write features:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "mlforgeFeatureStoreWriteOnly",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlforge-features",
        "arn:aws:s3:::mlforge-features/*"
      ]
    }
  ]
}

Prefix-Based Access Control

Restrict access to specific prefixes (e.g., prod/ vs dev/):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "mlforgeProductionAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlforge-features",
        "arn:aws:s3:::mlforge-features/prod/*"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": ["prod/*"]
        }
      }
    }
  ]
}

When to Use

  • Production deployments - Centralized, durable storage
  • Team collaboration - Shared feature store across multiple users
  • Large datasets - Features too large for local disk
  • Multi-environment workflows - Separate dev/staging/prod prefixes
  • CI/CD pipelines - Build features in one environment, use in another

Error Handling

import mlforge as mlf

try:
    store = mlf.S3Store(bucket="nonexistent-bucket", prefix="features")
except ValueError as e:
    print(f"Bucket error: {e}")
    # Bucket 'nonexistent-bucket' does not exist or is not accessible.
    # Ensure the bucket is created and credentials have appropriate permissions.

Common issues:

  1. Bucket doesn't exist - Create the bucket first using AWS Console or CLI
  2. Missing permissions - Verify IAM policy allows required S3 actions
  3. Wrong region - Specify region parameter if bucket is in non-default region
  4. Invalid credentials - Check AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

Complete Example

Local Development

import mlforge as mlf
import polars as pl

@mlf.feature(keys=["user_id"], source="data/transactions.parquet")
def user_total_spend(df):
    return df.group_by("user_id").agg(
        pl.col("amount").sum().alias("total_spend")
    )

# Local development - fast iteration
defs = mlf.Definitions(
    name="user-features",
    features=[user_total_spend],
    offline_store=mlf.LocalStore("./dev_features")
)

defs.build()

Production Deployment

import mlforge as mlf
import polars as pl
import os

@mlf.feature(keys=["user_id"], source="s3://my-bucket/data/transactions.parquet")
def user_total_spend(df):
    return df.group_by("user_id").agg(
        pl.col("amount").sum().alias("total_spend")
    )

# Production - S3 storage
environment = os.getenv("ENVIRONMENT", "dev")

defs = mlf.Definitions(
    name="user-features",
    features=[user_total_spend],
    offline_store=mlf.S3Store(
        bucket="mlforge-features",
        prefix=f"{environment}/features"  # dev/features or prod/features
    )
)

defs.build()

Multi-Environment Setup

import mlforge as mlf
import os

def get_store():
    """Get appropriate store based on environment."""
    env = os.getenv("ENVIRONMENT", "local")

    if env == "local":
        return mlf.LocalStore("./feature_store")
    elif env == "dev":
        return mlf.S3Store(bucket="mlforge-features", prefix="dev/features")
    elif env == "staging":
        return mlf.S3Store(bucket="mlforge-features", prefix="staging/features")
    elif env == "prod":
        return mlf.S3Store(bucket="mlforge-features-prod", prefix="features")
    else:
        raise ValueError(f"Unknown environment: {env}")

defs = mlf.Definitions(
    name="user-features",
    features=[...],
    offline_store=get_store()
)

Usage:

# Local development
ENVIRONMENT=local python build_features.py

# Dev environment
ENVIRONMENT=dev python build_features.py

# Production
ENVIRONMENT=prod python build_features.py

Performance Considerations

LocalStore

Pros: - Fastest for small datasets (no network overhead) - Simple setup (no credentials required) - Works offline

Cons: - Limited by local disk space - Not suitable for distributed deployments - No built-in backup/versioning

S3Store

Pros: - Virtually unlimited storage - High durability (99.999999999%) - Built-in versioning (if enabled on bucket) - Accessible from anywhere

Cons: - Network latency for read/write operations - Data transfer costs for large datasets - Requires AWS credentials

Optimization Tips

For S3Store:

  1. Use same region - Colocate compute and S3 bucket to minimize latency
  2. Enable S3 Transfer Acceleration - For cross-region access
  3. Batch operations - Materialize multiple features in one run
  4. Use IAM roles - Avoid credential management overhead
# Good: Batch materialization
defs.build()  # Builds all features

# Less efficient: Individual feature builds
for feature_name in ["feature1", "feature2", "feature3"]:
    defs.build(feature_names=[feature_name])

Next Steps