Skip to content

Manifest API

The manifest module provides dataclasses and utilities for tracking feature metadata.

Overview

When features are built, mlforge automatically captures metadata about each feature, including:

  • Feature configuration (keys, timestamp, interval)
  • Storage details (path, row count)
  • Column information (names, types, aggregations)
  • Build timestamp and source data

This metadata is stored in .meta.json files alongside the feature parquet files and can be queried using the CLI or programmatically.

Dataclasses

mlforge.manifest.ColumnMetadata dataclass

Metadata for a single column in a feature.

For columns derived from Rolling metrics, captures the source column, aggregation type, and window size. For other columns, captures dtype. For base columns, captures validator information.

Attributes:

Name Type Description
name str

Column name in the output

dtype str | None

Data type string (e.g., "Int64", "Float64")

input str | None

Source column name for aggregations

agg str | None

Aggregation type (count, mean, sum, etc.)

window str | None

Time window for rolling aggregations (e.g., "7d")

validators list[dict[str, Any]] | None

List of validator specifications applied to this column

Source code in src/mlforge/manifest.py
@dataclass
class ColumnMetadata:
    """
    Metadata for a single column in a feature.

    For columns derived from Rolling metrics, captures the source column,
    aggregation type, and window size. For other columns, captures dtype.
    For base columns, captures validator information.

    Attributes:
        name: Column name in the output
        dtype: Data type string (e.g., "Int64", "Float64")
        input: Source column name for aggregations
        agg: Aggregation type (count, mean, sum, etc.)
        window: Time window for rolling aggregations (e.g., "7d")
        validators: List of validator specifications applied to this column
    """

    name: str
    dtype: str | None = None
    input: str | None = None
    agg: str | None = None
    window: str | None = None
    validators: list[dict[str, Any]] | None = None

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary, excluding None values."""
        return {k: v for k, v in asdict(self).items() if v is not None}

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> ColumnMetadata:
        """Create from dictionary."""
        return cls(
            name=data["name"],
            dtype=data.get("dtype"),
            input=data.get("input"),
            agg=data.get("agg"),
            window=data.get("window"),
            validators=data.get("validators"),
        )

from_dict classmethod

from_dict(data: dict[str, Any]) -> ColumnMetadata

Create from dictionary.

Source code in src/mlforge/manifest.py
@classmethod
def from_dict(cls, data: dict[str, Any]) -> ColumnMetadata:
    """Create from dictionary."""
    return cls(
        name=data["name"],
        dtype=data.get("dtype"),
        input=data.get("input"),
        agg=data.get("agg"),
        window=data.get("window"),
        validators=data.get("validators"),
    )

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary, excluding None values.

Source code in src/mlforge/manifest.py
def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary, excluding None values."""
    return {k: v for k, v in asdict(self).items() if v is not None}

mlforge.manifest.FeatureMetadata dataclass

Metadata for a single materialized feature.

Captures all information about a feature from both its definition and the results of building it.

Attributes:

Name Type Description
name str

Feature identifier

path str

Storage path for the parquet file

entity str

Primary entity key (first key in keys list)

keys list[str]

All entity key columns

source str

Source data file path

row_count int

Number of rows in materialized feature

updated_at str

ISO 8601 timestamp of last build (renamed from last_updated in v0.5.0)

version str

Semantic version string (v0.5.0)

created_at str

ISO 8601 timestamp when version was first created (v0.5.0)

content_hash str

Hash of data.parquet for integrity verification (v0.5.0)

schema_hash str

Hash of column names + dtypes for change detection (v0.5.0)

config_hash str

Hash of keys, timestamp, interval, metrics config (v0.5.0)

source_hash str

Hash of source data file for reproducibility verification (v0.5.0)

timestamp str | None

Timestamp column for temporal features

interval str | None

Time interval for rolling aggregations

columns list[ColumnMetadata]

Base column metadata (from feature function before metrics)

features list[ColumnMetadata]

Generated feature column metadata (from metrics)

tags list[str]

Feature grouping tags

description str | None

Human-readable description

change_summary dict[str, Any] | None

Documents why version was bumped (v0.5.0)

Source code in src/mlforge/manifest.py
@dataclass
class FeatureMetadata:
    """
    Metadata for a single materialized feature.

    Captures all information about a feature from both its definition
    and the results of building it.

    Attributes:
        name: Feature identifier
        path: Storage path for the parquet file
        entity: Primary entity key (first key in keys list)
        keys: All entity key columns
        source: Source data file path
        row_count: Number of rows in materialized feature
        updated_at: ISO 8601 timestamp of last build (renamed from last_updated in v0.5.0)
        version: Semantic version string (v0.5.0)
        created_at: ISO 8601 timestamp when version was first created (v0.5.0)
        content_hash: Hash of data.parquet for integrity verification (v0.5.0)
        schema_hash: Hash of column names + dtypes for change detection (v0.5.0)
        config_hash: Hash of keys, timestamp, interval, metrics config (v0.5.0)
        source_hash: Hash of source data file for reproducibility verification (v0.5.0)
        timestamp: Timestamp column for temporal features
        interval: Time interval for rolling aggregations
        columns: Base column metadata (from feature function before metrics)
        features: Generated feature column metadata (from metrics)
        tags: Feature grouping tags
        description: Human-readable description
        change_summary: Documents why version was bumped (v0.5.0)
    """

    name: str
    path: str
    entity: str
    keys: list[str]
    source: str
    row_count: int
    updated_at: str

    # v0.5.0: New required fields for versioning
    version: str = "1.0.0"
    created_at: str = ""
    content_hash: str = ""
    schema_hash: str = ""
    config_hash: str = ""
    source_hash: str = ""

    # Existing optional fields
    timestamp: str | None = None
    interval: str | None = None
    columns: list[ColumnMetadata] = field(default_factory=list)
    features: list[ColumnMetadata] = field(default_factory=list)
    tags: list[str] = field(default_factory=list)
    description: str | None = None

    # v0.5.0: New optional field
    change_summary: dict[str, Any] | None = None

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        result: dict[str, Any] = {
            "name": self.name,
            "version": self.version,
            "path": self.path,
            "entity": self.entity,
            "keys": self.keys,
            "source": self.source,
            "row_count": self.row_count,
            "created_at": self.created_at,
            "updated_at": self.updated_at,
            "content_hash": self.content_hash,
            "schema_hash": self.schema_hash,
            "config_hash": self.config_hash,
            "source_hash": self.source_hash,
        }
        if self.timestamp:
            result["timestamp"] = self.timestamp
        if self.interval:
            result["interval"] = self.interval
        if self.columns:
            result["columns"] = [col.to_dict() for col in self.columns]
        if self.features:
            result["features"] = [col.to_dict() for col in self.features]
        if self.tags:
            result["tags"] = self.tags
        if self.description:
            result["description"] = self.description
        if self.change_summary:
            result["change_summary"] = self.change_summary
        return result

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> FeatureMetadata:
        """Create from dictionary."""
        columns = [ColumnMetadata.from_dict(c) for c in data.get("columns", [])]
        features = [
            ColumnMetadata.from_dict(c) for c in data.get("features", [])
        ]

        # Handle backward compatibility: last_updated → updated_at
        updated_at = data.get("updated_at") or data.get("last_updated", "")

        return cls(
            name=data["name"],
            version=data.get("version", "1.0.0"),
            path=data["path"],
            entity=data["entity"],
            keys=data["keys"],
            source=data["source"],
            row_count=data["row_count"],
            created_at=data.get("created_at", ""),
            updated_at=updated_at,
            content_hash=data.get("content_hash", ""),
            schema_hash=data.get("schema_hash", ""),
            config_hash=data.get("config_hash", ""),
            source_hash=data.get("source_hash", ""),
            timestamp=data.get("timestamp"),
            interval=data.get("interval"),
            columns=columns,
            features=features,
            tags=data.get("tags", []),
            description=data.get("description"),
            change_summary=data.get("change_summary"),
        )

from_dict classmethod

from_dict(data: dict[str, Any]) -> FeatureMetadata

Create from dictionary.

Source code in src/mlforge/manifest.py
@classmethod
def from_dict(cls, data: dict[str, Any]) -> FeatureMetadata:
    """Create from dictionary."""
    columns = [ColumnMetadata.from_dict(c) for c in data.get("columns", [])]
    features = [
        ColumnMetadata.from_dict(c) for c in data.get("features", [])
    ]

    # Handle backward compatibility: last_updated → updated_at
    updated_at = data.get("updated_at") or data.get("last_updated", "")

    return cls(
        name=data["name"],
        version=data.get("version", "1.0.0"),
        path=data["path"],
        entity=data["entity"],
        keys=data["keys"],
        source=data["source"],
        row_count=data["row_count"],
        created_at=data.get("created_at", ""),
        updated_at=updated_at,
        content_hash=data.get("content_hash", ""),
        schema_hash=data.get("schema_hash", ""),
        config_hash=data.get("config_hash", ""),
        source_hash=data.get("source_hash", ""),
        timestamp=data.get("timestamp"),
        interval=data.get("interval"),
        columns=columns,
        features=features,
        tags=data.get("tags", []),
        description=data.get("description"),
        change_summary=data.get("change_summary"),
    )

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/mlforge/manifest.py
def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    result: dict[str, Any] = {
        "name": self.name,
        "version": self.version,
        "path": self.path,
        "entity": self.entity,
        "keys": self.keys,
        "source": self.source,
        "row_count": self.row_count,
        "created_at": self.created_at,
        "updated_at": self.updated_at,
        "content_hash": self.content_hash,
        "schema_hash": self.schema_hash,
        "config_hash": self.config_hash,
        "source_hash": self.source_hash,
    }
    if self.timestamp:
        result["timestamp"] = self.timestamp
    if self.interval:
        result["interval"] = self.interval
    if self.columns:
        result["columns"] = [col.to_dict() for col in self.columns]
    if self.features:
        result["features"] = [col.to_dict() for col in self.features]
    if self.tags:
        result["tags"] = self.tags
    if self.description:
        result["description"] = self.description
    if self.change_summary:
        result["change_summary"] = self.change_summary
    return result

mlforge.manifest.Manifest dataclass

Consolidated manifest containing all feature metadata.

Aggregates individual feature metadata into a single view. Generated on demand from per-feature .meta.json files.

Attributes:

Name Type Description
version str

Schema version for compatibility

generated_at str

ISO 8601 timestamp when manifest was generated

features dict[str, FeatureMetadata]

Mapping of feature names to their metadata

Source code in src/mlforge/manifest.py
@dataclass
class Manifest:
    """
    Consolidated manifest containing all feature metadata.

    Aggregates individual feature metadata into a single view.
    Generated on demand from per-feature .meta.json files.

    Attributes:
        version: Schema version for compatibility
        generated_at: ISO 8601 timestamp when manifest was generated
        features: Mapping of feature names to their metadata
    """

    version: str = "1.0"
    generated_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc)
        .isoformat()
        .replace("+00:00", "Z")
    )
    features: dict[str, FeatureMetadata] = field(default_factory=dict)

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "version": self.version,
            "generated_at": self.generated_at,
            "features": {
                name: meta.to_dict() for name, meta in self.features.items()
            },
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> Manifest:
        """Create from dictionary."""
        features = {
            name: FeatureMetadata.from_dict(meta)
            for name, meta in data.get("features", {}).items()
        }
        return cls(
            version=data.get("version", "1.0"),
            generated_at=data.get("generated_at", ""),
            features=features,
        )

    def add_feature(self, metadata: FeatureMetadata) -> None:
        """Add or update a feature in the manifest."""
        self.features[metadata.name] = metadata

    def remove_feature(self, name: str) -> None:
        """Remove a feature from the manifest."""
        self.features.pop(name, None)

    def get_feature(self, name: str) -> FeatureMetadata | None:
        """Get metadata for a specific feature."""
        return self.features.get(name)

add_feature

add_feature(metadata: FeatureMetadata) -> None

Add or update a feature in the manifest.

Source code in src/mlforge/manifest.py
def add_feature(self, metadata: FeatureMetadata) -> None:
    """Add or update a feature in the manifest."""
    self.features[metadata.name] = metadata

from_dict classmethod

from_dict(data: dict[str, Any]) -> Manifest

Create from dictionary.

Source code in src/mlforge/manifest.py
@classmethod
def from_dict(cls, data: dict[str, Any]) -> Manifest:
    """Create from dictionary."""
    features = {
        name: FeatureMetadata.from_dict(meta)
        for name, meta in data.get("features", {}).items()
    }
    return cls(
        version=data.get("version", "1.0"),
        generated_at=data.get("generated_at", ""),
        features=features,
    )

get_feature

get_feature(name: str) -> FeatureMetadata | None

Get metadata for a specific feature.

Source code in src/mlforge/manifest.py
def get_feature(self, name: str) -> FeatureMetadata | None:
    """Get metadata for a specific feature."""
    return self.features.get(name)

remove_feature

remove_feature(name: str) -> None

Remove a feature from the manifest.

Source code in src/mlforge/manifest.py
def remove_feature(self, name: str) -> None:
    """Remove a feature from the manifest."""
    self.features.pop(name, None)

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/mlforge/manifest.py
def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "version": self.version,
        "generated_at": self.generated_at,
        "features": {
            name: meta.to_dict() for name, meta in self.features.items()
        },
    }

Functions

mlforge.manifest.derive_column_metadata

derive_column_metadata(
    feature: Feature,
    schema: dict[str, str],
    base_schema: dict[str, str] | None = None,
    schema_source: str = "polars",
) -> tuple[list[ColumnMetadata], list[ColumnMetadata]]

Derive column metadata from feature definition and result schema.

Separates base columns (keys, timestamp, other non-metric columns) from generated feature columns (rolling metrics). Uses base_schema when available for accurate separation, falls back to regex parsing for backward compatibility.

Parameters:

Name Type Description Default
feature Feature

The Feature definition object

required
schema dict[str, str]

Dictionary mapping column names to dtype strings (final schema after metrics)

required
base_schema dict[str, str] | None

Dictionary mapping column names to dtype strings (before metrics). When provided, enables accurate column separation. Defaults to None.

None
schema_source str

Engine source for type normalization ("polars" or "duckdb")

'polars'

Returns:

Type Description
tuple[list[ColumnMetadata], list[ColumnMetadata]]

Tuple of (base_columns, feature_columns) where: - base_columns: Keys, timestamp, and other non-metric columns with validators - feature_columns: Rolling metric columns with aggregation metadata

Source code in src/mlforge/manifest.py
def derive_column_metadata(
    feature: Feature,
    schema: dict[str, str],
    base_schema: dict[str, str] | None = None,
    schema_source: str = "polars",
) -> tuple[list[ColumnMetadata], list[ColumnMetadata]]:
    """
    Derive column metadata from feature definition and result schema.

    Separates base columns (keys, timestamp, other non-metric columns) from
    generated feature columns (rolling metrics). Uses base_schema when available
    for accurate separation, falls back to regex parsing for backward compatibility.

    Args:
        feature: The Feature definition object
        schema: Dictionary mapping column names to dtype strings (final schema after metrics)
        base_schema: Dictionary mapping column names to dtype strings (before metrics).
            When provided, enables accurate column separation. Defaults to None.
        schema_source: Engine source for type normalization ("polars" or "duckdb")

    Returns:
        Tuple of (base_columns, feature_columns) where:
            - base_columns: Keys, timestamp, and other non-metric columns with validators
            - feature_columns: Rolling metric columns with aggregation metadata
    """
    if base_schema:
        return _derive_with_base_schema(
            feature, schema, base_schema, schema_source
        )
    return _derive_legacy(feature, schema, schema_source)

mlforge.manifest.write_metadata_file

write_metadata_file(
    path: Path, metadata: FeatureMetadata
) -> None

Write feature metadata to a JSON file.

Parameters:

Name Type Description Default
path Path

Path to write the .meta.json file

required
metadata FeatureMetadata

FeatureMetadata to serialize

required
Source code in src/mlforge/manifest.py
def write_metadata_file(path: Path, metadata: FeatureMetadata) -> None:
    """
    Write feature metadata to a JSON file.

    Args:
        path: Path to write the .meta.json file
        metadata: FeatureMetadata to serialize
    """
    with open(path, "w") as f:
        json.dump(metadata.to_dict(), f, indent=2)

mlforge.manifest.read_metadata_file

read_metadata_file(path: Path) -> FeatureMetadata | None

Read feature metadata from a JSON file.

Parameters:

Name Type Description Default
path Path

Path to the .meta.json file

required

Returns:

Type Description
FeatureMetadata | None

FeatureMetadata if file exists and is valid, None otherwise

Source code in src/mlforge/manifest.py
def read_metadata_file(path: Path) -> FeatureMetadata | None:
    """
    Read feature metadata from a JSON file.

    Args:
        path: Path to the .meta.json file

    Returns:
        FeatureMetadata if file exists and is valid, None otherwise
    """
    if not path.exists():
        return None

    try:
        with open(path) as f:
            data = json.load(f)
    except json.JSONDecodeError as e:
        logger.warning(f"Invalid JSON in {path}: {e}")
        return None

    try:
        return FeatureMetadata.from_dict(data)
    except KeyError as e:
        logger.warning(f"Schema mismatch in {path}: missing key {e}")
        return None

mlforge.manifest.write_manifest_file

write_manifest_file(
    path: Path | str, manifest: Manifest
) -> None

Write consolidated manifest to a JSON file.

Parameters:

Name Type Description Default
path Path | str

Path to write the manifest.json file

required
manifest Manifest

Manifest to serialize

required
Source code in src/mlforge/manifest.py
def write_manifest_file(path: Path | str, manifest: Manifest) -> None:
    """
    Write consolidated manifest to a JSON file.

    Args:
        path: Path to write the manifest.json file
        manifest: Manifest to serialize
    """
    with open(path, "w") as f:
        json.dump(manifest.to_dict(), f, indent=2)

mlforge.manifest.read_manifest_file

read_manifest_file(path: Path) -> Manifest | None

Read consolidated manifest from a JSON file.

Parameters:

Name Type Description Default
path Path

Path to the manifest.json file

required

Returns:

Type Description
Manifest | None

Manifest if file exists and is valid, None otherwise

Source code in src/mlforge/manifest.py
def read_manifest_file(path: Path) -> Manifest | None:
    """
    Read consolidated manifest from a JSON file.

    Args:
        path: Path to the manifest.json file

    Returns:
        Manifest if file exists and is valid, None otherwise
    """
    if not path.exists():
        return None

    try:
        with open(path) as f:
            data = json.load(f)
    except json.JSONDecodeError as e:
        logger.warning(f"Invalid JSON in {path}: {e}")
        return None

    try:
        return Manifest.from_dict(data)
    except KeyError as e:
        logger.warning(f"Schema mismatch in {path}: missing key {e}")
        return None

Usage Examples

Reading Feature Metadata

from mlforge import LocalStore

store = LocalStore("./feature_store")

# Read metadata for a specific feature
metadata = store.read_metadata("user_spend")

if metadata:
    print(f"Feature: {metadata.name}")
    print(f"Rows: {metadata.row_count:,}")
    print(f"Last updated: {metadata.last_updated}")
    print(f"Columns: {len(metadata.columns)}")

Listing All Metadata

from mlforge import LocalStore

store = LocalStore("./feature_store")

# Get all feature metadata
all_metadata = store.list_metadata()

for meta in all_metadata:
    print(f"{meta.name}: {meta.row_count:,} rows")

Creating a Consolidated Manifest

from mlforge import LocalStore
from mlforge.manifest import Manifest, write_manifest_file
from datetime import datetime, timezone

store = LocalStore("./feature_store")

# Create manifest from all features
manifest = Manifest(
    generated_at=datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
)

for meta in store.list_metadata():
    manifest.add_feature(meta)

# Write to file
write_manifest_file("manifest.json", manifest)

Inspecting Column Metadata

from mlforge import LocalStore

store = LocalStore("./feature_store")
metadata = store.read_metadata("user_spend")

if metadata and metadata.columns:
    for col in metadata.columns:
        if col.agg:
            # Rolling aggregation column
            print(f"{col.name}: {col.agg}({col.input}) over {col.window}")
        else:
            # Regular column
            print(f"{col.name}: {col.dtype}")

Metadata Schema

Feature Metadata JSON

Per-feature metadata is stored in _metadata/<feature_name>.meta.json:

{
  "name": "merchant_spend",
  "path": "merchant_spend.parquet",
  "entity": "merchant_id",
  "keys": ["merchant_id"],
  "source": "data/transactions.parquet",
  "row_count": 15482,
  "last_updated": "2024-01-16T08:30:00Z",
  "timestamp": "transaction_date",
  "interval": "1d",
  "columns": [
    {"name": "merchant_id", "dtype": "Utf8"},
    {"name": "transaction_date", "dtype": "Date"},
    {
      "name": "amt__count__7d",
      "dtype": "UInt32",
      "input": "amt",
      "agg": "count",
      "window": "7d"
    },
    {
      "name": "amt__sum__7d",
      "dtype": "Float64",
      "input": "amt",
      "agg": "sum",
      "window": "7d"
    }
  ],
  "tags": ["merchants"],
  "description": "Merchant spend aggregations"
}

Consolidated Manifest JSON

The manifest consolidates all feature metadata into a single file:

{
  "version": "1.0",
  "generated_at": "2024-01-16T08:30:00Z",
  "features": {
    "merchant_spend": {
      "name": "merchant_spend",
      "path": "merchant_spend.parquet",
      ...
    },
    "user_spend": {
      "name": "user_spend",
      "path": "user_spend.parquet",
      ...
    }
  }
}

Column Naming Convention

For features with Rolling metrics, columns follow this pattern:

{feature_name}__{column}__{aggregation}__{interval}__{window}

Examples:

  • user_spend__amt__sum__1d__7d - Sum of amt over 7-day window with 1-day interval
  • user_spend__amt__count__1d__30d - Count of amt over 30-day window with 1-day interval

The derive_column_metadata() function parses these column names to extract:

  • input: Source column name (amt)
  • agg: Aggregation type (sum, count, etc.)
  • window: Time window (7d, 30d, etc.)

See Also