Feature Metadata¶
mlforge automatically captures and stores metadata for every feature you build. This metadata provides visibility into your feature store, making it easier to understand what features exist, when they were last updated, and what columns they contain.
What is Captured¶
When you build a feature with mlforge build, the following metadata is automatically captured:
Feature Configuration¶
- Name: Feature identifier
- Entity: Primary entity key
- Keys: All entity key columns
- Timestamp: Temporal column (if applicable)
- Interval: Time interval for rolling aggregations (if applicable)
- Tags: Feature grouping tags
- Description: Human-readable description from the
@featuredecorator
Versioning (v0.5.0+)¶
- Version: Semantic version string (e.g., "1.0.0", "1.2.3")
- Created At: ISO 8601 timestamp of when this version was first created
- Source Hash: Hash of source data file for reproducibility verification
- Schema Hash: Hash of column names and types for schema change detection
- Config Hash: Hash of feature configuration for config change detection
- Content Hash: Hash of materialized data for content change detection
- Change Summary: Structured information about what changed:
bump_type: Type of version bump (initial/major/minor/patch)reason: Why the version was bumped (e.g., "columns_added", "data_refresh")details: Specific changes (e.g., list of added/removed columns)
Storage Details¶
- Path: Location of the materialized feature file (versioned:
feature_store/feature_name/version/data.parquet) - Source: Path to the source data file
- Row Count: Number of rows in the materialized feature
- Last Updated: ISO 8601 timestamp of when this version was last built
Column Information¶
mlforge separates base columns from generated feature columns:
Base Columns¶
Base columns are from your feature function output (before metrics are applied):
- Name: Column name
- Type: Polars data type (e.g.,
Utf8,Float64,Date) - Validators: Data quality validators applied to this column
Example base columns:
- Entity keys (e.g., user_id, merchant_id)
- Timestamp columns (e.g., transaction_date)
- Input columns for metrics (e.g., amount, quantity)
Feature Columns¶
Feature columns are generated by metrics (e.g., rolling aggregations):
- Name: Column name
- Type: Polars data type
- Input: Source column from base columns
- Aggregation: Type of aggregation (sum, count, mean, etc.)
- Window: Time window for rolling aggregation
Where Metadata is Stored¶
Metadata is stored in .meta.json files within each feature's version directory:
feature_store/
├── user_spend/
│ ├── 1.0.0/
│ │ ├── data.parquet
│ │ └── .meta.json # Metadata for v1.0.0
│ ├── 1.0.1/
│ │ ├── data.parquet
│ │ └── .meta.json # Metadata for v1.0.1
│ ├── _latest.json # Pointer: {"version": "1.0.1"}
│ └── .gitignore
├── merchant_spend/
│ ├── 1.0.0/
│ │ ├── data.parquet
│ │ └── .meta.json
│ ├── _latest.json
│ └── .gitignore
└── account_spend/
├── 2.0.0/
├── data.parquet
└── .meta.json
├── _latest.json
└── .gitignore
Each .meta.json file contains the complete metadata for one specific version of a feature in JSON format.
Viewing Metadata¶
Using the CLI¶
Inspect a Specific Feature¶
Use the inspect command to view detailed metadata for a feature:
This displays:
- Feature configuration
- Storage details
- Column information in a formatted table
- Tags and description
View All Features¶
Use the manifest command to see a summary of all features:
This shows a table with key metrics for each feature:
- Feature name
- Entity
- Row count
- Column count
- Last updated timestamp
Programmatically¶
You can also read metadata in your Python code:
from mlforge import LocalStore
store = LocalStore("./feature_store")
# Read metadata for a specific feature
metadata = store.read_metadata("user_spend")
if metadata:
print(f"Feature: {metadata.name}")
print(f"Rows: {metadata.row_count:,}")
print(f"Last updated: {metadata.last_updated}")
# Inspect base columns
print("\nBase Columns:")
for col in metadata.columns:
print(f" {col.name}: {col.dtype}")
if col.validators:
for validator in col.validators:
print(f" - {validator}")
# Inspect feature columns
print("\nGenerated Features:")
for col in metadata.features:
print(f" {col.name}: {col.agg}({col.input}) over {col.window}")
Example Metadata Structure¶
Here's an example of how metadata is structured for a feature with validators and rolling metrics:
from mlforge import feature
from mlforge.metrics import Rolling
from mlforge.validators import not_null, greater_than_or_equal
@feature(
keys=["merchant_id"],
source="data/transactions.parquet",
timestamp="transaction_date",
interval="1d",
metrics=[
Rolling(
windows=["7d", "30d"],
aggregations={"amt": ["sum", "count", "mean"]}
)
],
validators={
"amt": [not_null(), greater_than_or_equal(0)]
}
)
def merchant_spend(df):
return df.select(["merchant_id", "transaction_date", "amt"])
After building, the .meta.json file will contain:
{
"name": "merchant_spend",
"version": "1.0.0",
"path": "feature_store/merchant_spend/1.0.0/data.parquet",
"entity": "merchant_id",
"keys": ["merchant_id"],
"source": "data/transactions.parquet",
"row_count": 10000,
"created_at": "2024-01-16T08:30:00Z",
"last_updated": "2024-01-16T08:30:00Z",
"timestamp": "transaction_date",
"interval": "1d",
"source_hash": "abc123def456",
"schema_hash": "789abc012def",
"config_hash": "456def789abc",
"content_hash": "012def456abc",
"change_summary": {
"bump_type": "initial",
"reason": "first_build",
"details": []
},
"columns": [
{
"name": "merchant_id",
"dtype": "String"
},
{
"name": "transaction_date",
"dtype": "Datetime(time_unit='us', time_zone=None)"
},
{
"name": "amt",
"dtype": "Float64",
"validators": [
{"validator": "not_null"},
{"validator": "greater_than_or_equal", "value": 0}
]
}
],
"features": [
{
"name": "merchant_spend__amt__sum__1d__7d",
"dtype": "Float64",
"input": "amt",
"agg": "sum",
"window": "7d"
},
{
"name": "merchant_spend__amt__count__1d__7d",
"dtype": "UInt32",
"input": "amt",
"agg": "count",
"window": "7d"
},
{
"name": "merchant_spend__amt__mean__1d__7d",
"dtype": "Float64",
"input": "amt",
"agg": "mean",
"window": "7d"
}
]
}
Notice how:
- Base columns (keys, timestamp, input) are in the columns array with validator info
- Generated features (rolling metrics) are in the features array with aggregation details
- Validators include both the name and parameters (e.g., value: 0)
Consolidated Manifest¶
You can generate a consolidated manifest.json file that contains metadata for all features:
This creates a single JSON file with all feature metadata, useful for:
- Documentation generation
- Feature catalog UIs
- Integration with other tools
- Version control tracking
The manifest file structure:
{
"version": "1.0",
"generated_at": "2024-01-16T08:30:00Z",
"features": {
"user_spend": { ... },
"merchant_spend": { ... },
"account_spend": { ... }
}
}
Use Cases¶
Feature Discovery¶
Quickly understand what features exist in your store:
Debugging¶
Check when a feature was last built and how many rows it has:
Monitoring¶
Track feature freshness by comparing last_updated timestamps:
from mlforge import LocalStore
from datetime import datetime, timezone, timedelta
store = LocalStore("./feature_store")
# Find stale features (not updated in last 24 hours)
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)
for meta in store.list_metadata():
last_updated = datetime.fromisoformat(meta.last_updated.replace('Z', '+00:00'))
if last_updated < cutoff:
print(f"Stale feature: {meta.name} (last updated {meta.last_updated})")
Documentation¶
Generate feature catalogs from metadata:
from mlforge import LocalStore
store = LocalStore("./feature_store")
print("# Feature Catalog\n")
for meta in store.list_metadata():
print(f"## {meta.name}")
if meta.description:
print(f"\n{meta.description}\n")
print(f"- **Entity**: {meta.entity}")
print(f"- **Rows**: {meta.row_count:,}")
print(f"- **Base Columns**: {len(meta.columns)}")
print(f"- **Features**: {len(meta.features)}")
if meta.tags:
print(f"- **Tags**: {', '.join(meta.tags)}")
print()
Metadata Schema¶
See the Manifest API Reference for detailed JSON schema documentation.
Next Steps¶
- CLI Reference -
inspectandmanifestcommands - Manifest API - Programmatic access to metadata
- Building Features - How to build features that generate metadata