Skip to content

Feature Metadata

mlforge automatically captures and stores metadata for every feature you build. This metadata provides visibility into your feature store, making it easier to understand what features exist, when they were last updated, and what columns they contain.

What is Captured

When you build a feature with mlforge build, the following metadata is automatically captured:

Feature Configuration

  • Name: Feature identifier
  • Entity: Primary entity key
  • Keys: All entity key columns
  • Timestamp: Temporal column (if applicable)
  • Interval: Time interval for rolling aggregations (if applicable)
  • Tags: Feature grouping tags
  • Description: Human-readable description from the @feature decorator

Versioning (v0.5.0+)

  • Version: Semantic version string (e.g., "1.0.0", "1.2.3")
  • Created At: ISO 8601 timestamp of when this version was first created
  • Source Hash: Hash of source data file for reproducibility verification
  • Schema Hash: Hash of column names and types for schema change detection
  • Config Hash: Hash of feature configuration for config change detection
  • Content Hash: Hash of materialized data for content change detection
  • Change Summary: Structured information about what changed:
    • bump_type: Type of version bump (initial/major/minor/patch)
    • reason: Why the version was bumped (e.g., "columns_added", "data_refresh")
    • details: Specific changes (e.g., list of added/removed columns)

Storage Details

  • Path: Location of the materialized feature file (versioned: feature_store/feature_name/version/data.parquet)
  • Source: Path to the source data file
  • Row Count: Number of rows in the materialized feature
  • Last Updated: ISO 8601 timestamp of when this version was last built

Column Information

mlforge separates base columns from generated feature columns:

Base Columns

Base columns are from your feature function output (before metrics are applied):

  • Name: Column name
  • Type: Polars data type (e.g., Utf8, Float64, Date)
  • Validators: Data quality validators applied to this column

Example base columns: - Entity keys (e.g., user_id, merchant_id) - Timestamp columns (e.g., transaction_date) - Input columns for metrics (e.g., amount, quantity)

Feature Columns

Feature columns are generated by metrics (e.g., rolling aggregations):

  • Name: Column name
  • Type: Polars data type
  • Input: Source column from base columns
  • Aggregation: Type of aggregation (sum, count, mean, etc.)
  • Window: Time window for rolling aggregation

Where Metadata is Stored

Metadata is stored in .meta.json files within each feature's version directory:

feature_store/
├── user_spend/
│   ├── 1.0.0/
│   │   ├── data.parquet
│   │   └── .meta.json       # Metadata for v1.0.0
│   ├── 1.0.1/
│   │   ├── data.parquet
│   │   └── .meta.json       # Metadata for v1.0.1
│   ├── _latest.json         # Pointer: {"version": "1.0.1"}
│   └── .gitignore
├── merchant_spend/
│   ├── 1.0.0/
│   │   ├── data.parquet
│   │   └── .meta.json
│   ├── _latest.json
│   └── .gitignore
└── account_spend/
    ├── 2.0.0/
        ├── data.parquet
        └── .meta.json
    ├── _latest.json
    └── .gitignore

Each .meta.json file contains the complete metadata for one specific version of a feature in JSON format.

Viewing Metadata

Using the CLI

Inspect a Specific Feature

Use the inspect command to view detailed metadata for a feature:

mlforge inspect user_spend

This displays:

  • Feature configuration
  • Storage details
  • Column information in a formatted table
  • Tags and description

View All Features

Use the manifest command to see a summary of all features:

mlforge manifest

This shows a table with key metrics for each feature:

  • Feature name
  • Entity
  • Row count
  • Column count
  • Last updated timestamp

Programmatically

You can also read metadata in your Python code:

from mlforge import LocalStore

store = LocalStore("./feature_store")

# Read metadata for a specific feature
metadata = store.read_metadata("user_spend")

if metadata:
    print(f"Feature: {metadata.name}")
    print(f"Rows: {metadata.row_count:,}")
    print(f"Last updated: {metadata.last_updated}")

    # Inspect base columns
    print("\nBase Columns:")
    for col in metadata.columns:
        print(f"  {col.name}: {col.dtype}")
        if col.validators:
            for validator in col.validators:
                print(f"    - {validator}")

    # Inspect feature columns
    print("\nGenerated Features:")
    for col in metadata.features:
        print(f"  {col.name}: {col.agg}({col.input}) over {col.window}")

Example Metadata Structure

Here's an example of how metadata is structured for a feature with validators and rolling metrics:

from mlforge import feature
from mlforge.metrics import Rolling
from mlforge.validators import not_null, greater_than_or_equal

@feature(
    keys=["merchant_id"],
    source="data/transactions.parquet",
    timestamp="transaction_date",
    interval="1d",
    metrics=[
        Rolling(
            windows=["7d", "30d"],
            aggregations={"amt": ["sum", "count", "mean"]}
        )
    ],
    validators={
        "amt": [not_null(), greater_than_or_equal(0)]
    }
)
def merchant_spend(df):
    return df.select(["merchant_id", "transaction_date", "amt"])

After building, the .meta.json file will contain:

{
  "name": "merchant_spend",
  "version": "1.0.0",
  "path": "feature_store/merchant_spend/1.0.0/data.parquet",
  "entity": "merchant_id",
  "keys": ["merchant_id"],
  "source": "data/transactions.parquet",
  "row_count": 10000,
  "created_at": "2024-01-16T08:30:00Z",
  "last_updated": "2024-01-16T08:30:00Z",
  "timestamp": "transaction_date",
  "interval": "1d",
  "source_hash": "abc123def456",
  "schema_hash": "789abc012def",
  "config_hash": "456def789abc",
  "content_hash": "012def456abc",
  "change_summary": {
    "bump_type": "initial",
    "reason": "first_build",
    "details": []
  },
  "columns": [
    {
      "name": "merchant_id",
      "dtype": "String"
    },
    {
      "name": "transaction_date",
      "dtype": "Datetime(time_unit='us', time_zone=None)"
    },
    {
      "name": "amt",
      "dtype": "Float64",
      "validators": [
        {"validator": "not_null"},
        {"validator": "greater_than_or_equal", "value": 0}
      ]
    }
  ],
  "features": [
    {
      "name": "merchant_spend__amt__sum__1d__7d",
      "dtype": "Float64",
      "input": "amt",
      "agg": "sum",
      "window": "7d"
    },
    {
      "name": "merchant_spend__amt__count__1d__7d",
      "dtype": "UInt32",
      "input": "amt",
      "agg": "count",
      "window": "7d"
    },
    {
      "name": "merchant_spend__amt__mean__1d__7d",
      "dtype": "Float64",
      "input": "amt",
      "agg": "mean",
      "window": "7d"
    }
  ]
}

Notice how: - Base columns (keys, timestamp, input) are in the columns array with validator info - Generated features (rolling metrics) are in the features array with aggregation details - Validators include both the name and parameters (e.g., value: 0)

Consolidated Manifest

You can generate a consolidated manifest.json file that contains metadata for all features:

mlforge manifest --regenerate

This creates a single JSON file with all feature metadata, useful for:

  • Documentation generation
  • Feature catalog UIs
  • Integration with other tools
  • Version control tracking

The manifest file structure:

{
  "version": "1.0",
  "generated_at": "2024-01-16T08:30:00Z",
  "features": {
    "user_spend": { ... },
    "merchant_spend": { ... },
    "account_spend": { ... }
  }
}

Use Cases

Feature Discovery

Quickly understand what features exist in your store:

mlforge manifest

Debugging

Check when a feature was last built and how many rows it has:

mlforge inspect user_spend

Monitoring

Track feature freshness by comparing last_updated timestamps:

from mlforge import LocalStore
from datetime import datetime, timezone, timedelta

store = LocalStore("./feature_store")

# Find stale features (not updated in last 24 hours)
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)

for meta in store.list_metadata():
    last_updated = datetime.fromisoformat(meta.last_updated.replace('Z', '+00:00'))
    if last_updated < cutoff:
        print(f"Stale feature: {meta.name} (last updated {meta.last_updated})")

Documentation

Generate feature catalogs from metadata:

from mlforge import LocalStore

store = LocalStore("./feature_store")

print("# Feature Catalog\n")

for meta in store.list_metadata():
    print(f"## {meta.name}")
    if meta.description:
        print(f"\n{meta.description}\n")
    print(f"- **Entity**: {meta.entity}")
    print(f"- **Rows**: {meta.row_count:,}")
    print(f"- **Base Columns**: {len(meta.columns)}")
    print(f"- **Features**: {len(meta.features)}")
    if meta.tags:
        print(f"- **Tags**: {', '.join(meta.tags)}")
    print()

Metadata Schema

See the Manifest API Reference for detailed JSON schema documentation.

Next Steps