Retrieving Features¶
mlforge provides two retrieval functions for different use cases:
| Function | Use Case | Store Type | Point-in-Time |
|---|---|---|---|
get_training_data() |
Training, batch scoring | Offline (LocalStore, S3Store) | Yes |
get_online_features() |
Real-time inference | Online (RedisStore) | No (latest only) |
Offline Retrieval (Training)¶
Use get_training_data() to join features to your entity DataFrame for training or batch scoring.
Basic Usage¶
import mlforge as mlf
import polars as pl
# Load your entity data (e.g., labels, predictions)
entities = pl.read_parquet("data/labels.parquet")
# Get features joined to entities
training_data = mlf.get_training_data(
features=["user_total_spend", "user_avg_spend"],
entity_df=entities
)
Function Signature¶
def get_training_data(
features: list[str],
entity_df: pl.DataFrame,
store: str | Path | Store = "./feature_store",
entities: list[EntityKeyTransform] | None = None,
timestamp: str | None = None,
) -> pl.DataFrame
Parameters¶
features (required)¶
List of feature names to retrieve. Must match the names of built features.
training_data = mlf.get_training_data(
features=["user_age", "user_tenure_days"],
entity_df=entities
)
entity_df (required)¶
DataFrame containing entity keys and optionally timestamps. This is typically your:
- Training labels
- Prediction entities
- Evaluation dataset
store (optional)¶
Path to feature store or a Store instance. Defaults to "./feature_store".
import mlforge as mlf
# Using default path
training_data = mlf.get_training_data(
features=["user_age"],
entity_df=entities
)
# Custom path
training_data = mlf.get_training_data(
features=["user_age"],
entity_df=entities,
store="./my_features"
)
# Store instance
store = mlf.LocalStore("./my_features")
training_data = mlf.get_training_data(
features=["user_age"],
entity_df=entities,
store=store
)
entities (optional)¶
List of entity key transforms to apply to entity_df before joining. Use this when your entity DataFrame doesn't have the required keys.
import mlforge as mlf
# Define transform
with_user_id = mlf.entity_key("first", "last", "dob", alias="user_id")
# Apply during retrieval
training_data = mlf.get_training_data(
features=["user_spend_stats"],
entity_df=raw_entities,
entities=[with_user_id] # Adds user_id column
)
See Entity Keys for details.
timestamp (optional)¶
Column name in entity_df to use for point-in-time joins. When specified, features with timestamps are joined using asof joins.
training_data = mlf.get_training_data(
features=["user_spend_mean_30d"],
entity_df=transactions,
timestamp="transaction_time" # Point-in-time correct
)
See Point-in-Time Correctness for details.
Join Behavior¶
Standard Joins¶
When timestamp is not specified, features are joined using standard left joins:
entities = pl.DataFrame({
"user_id": ["u1", "u2", "u3"],
"label": [0, 1, 0]
})
training_data = mlf.get_training_data(
features=["user_total_spend"],
entity_df=entities
)
# Joins on common columns (user_id)
Point-in-Time Joins¶
When timestamp is specified and features have timestamps, asof joins are used:
transactions = pl.DataFrame({
"user_id": ["u1", "u1", "u2"],
"transaction_time": ["2024-01-05", "2024-01-15", "2024-01-10"],
"label": [0, 1, 0]
})
training_data = mlf.get_training_data(
features=["user_spend_mean_30d"], # Has feature_timestamp
entity_df=transactions,
timestamp="transaction_time" # Asof join
)
# Features reflect data available at each transaction_time
Join Key Detection¶
Join keys are automatically detected from common columns:
# entity_df has: user_id, merchant_id, label
# feature has: user_id, merchant_id, total_spend
# Joins on: user_id, merchant_id
training_data = mlf.get_training_data(
features=["user_merchant_spend"],
entity_df=entities
)
Timestamp columns are excluded from join keys when performing asof joins.
Complete Example¶
import mlforge as mlf
import polars as pl
# 1. Load entity data
transactions = pl.read_parquet("data/transactions.parquet")
# 2. Define entity transform
with_user_id = mlf.entity_key("first", "last", "dob", alias="user_id")
# 3. Retrieve features with point-in-time correctness
training_data = mlf.get_training_data(
features=["user_spend_mean_30d", "user_total_spend"],
entity_df=transactions,
entities=[with_user_id],
timestamp="trans_date_trans_time",
store="./feature_store"
)
# 4. Use in training
from sklearn.ensemble import RandomForestClassifier
X = training_data.select(["user_spend_mean_30d", "user_total_spend"])
y = training_data.select("label")
model = RandomForestClassifier()
model.fit(X.to_pandas(), y.to_pandas())
Error Handling¶
Missing Features¶
If a requested feature hasn't been built:
import mlforge as mlf
try:
training_data = mlf.get_training_data(
features=["nonexistent_feature"],
entity_df=entities
)
except ValueError as e:
print(e)
# Feature 'nonexistent_feature' not found. Run `mlforge build` first.
Build the feature first:
Missing Join Keys¶
If entity_df and features don't share common columns:
# entity_df has: customer_id, label
# feature has: user_id, total_spend
try:
training_data = mlf.get_training_data(
features=["user_total_spend"],
entity_df=entities
)
except ValueError as e:
print(e)
# No common columns to join 'user_total_spend'.
Solution: Use entity transforms to add required keys:
with_user_id = mlf.entity_key("customer_id", alias="user_id")
training_data = mlf.get_training_data(
features=["user_total_spend"],
entity_df=entities,
entities=[with_user_id]
)
Timestamp Type Mismatch¶
For asof joins, timestamp columns must have matching data types:
# entity_df["event_time"] is String
# feature["feature_timestamp"] is Datetime
try:
training_data = mlf.get_training_data(
features=["temporal_feature"],
entity_df=entities,
timestamp="event_time"
)
except ValueError as e:
print(e)
# Timestamp dtype mismatch: entity_df['event_time'] is String,
# but feature has Datetime.
Solution: Convert timestamps before calling mlf.get_training_data():
entities = entities.with_columns(
pl.col("event_time").str.to_datetime().alias("event_time")
)
training_data = mlf.get_training_data(
features=["temporal_feature"],
entity_df=entities,
timestamp="event_time"
)
Multiple Feature Stores¶
You can retrieve features from different stores by calling mlf.get_training_data() multiple times:
import mlforge as mlf
# Features from store A
training_data = mlf.get_training_data(
features=["user_age", "user_tenure"],
entity_df=entities,
store="./store_a"
)
# Add features from store B
training_data = mlf.get_training_data(
features=["user_spend_stats"],
entity_df=training_data,
store="./store_b"
)
Best Practices¶
1. Convert Timestamps Early¶
Always convert timestamp columns to proper datetime types before retrieval:
entities = (
pl.read_parquet("data/labels.parquet")
.with_columns(
pl.col("event_time").str.to_datetime("%Y-%m-%d %H:%M:%S")
)
)
training_data = mlf.get_training_data(
features=["temporal_features"],
entity_df=entities,
timestamp="event_time"
)
2. Use Type Hints¶
Add type hints for clarity:
import polars as pl
import mlforge as mlf
entities: pl.DataFrame = pl.read_parquet("data/labels.parquet")
training_data: pl.DataFrame = mlf.get_training_data(
features=["user_age"],
entity_df=entities
)
3. Verify Features Exist¶
Check built features before retrieval:
4. Handle Missing Values¶
Features may have nulls for entities not in the feature source:
training_data = mlf.get_training_data(
features=["user_total_spend"],
entity_df=entities
)
# Fill nulls if needed
training_data = training_data.with_columns(
pl.col("user_total_spend").fill_null(0)
)
Online Retrieval (Inference)¶
For real-time inference, use get_online_features() to retrieve features from an online store like Redis.
Basic Usage¶
import mlforge as mlf
from mlforge.online import RedisStore
import polars as pl
# Connect to Redis
store = RedisStore(host="localhost", port=6379)
# Define entity transform (same as training)
with_user_id = mlf.entity_key("first", "last", "dob", alias="user_id")
# Inference request
request_df = pl.DataFrame({
"request_id": ["req_001", "req_002"],
"first": ["John", "Jane"],
"last": ["Doe", "Smith"],
"dob": ["1990-01-15", "1985-06-20"],
})
# Retrieve features
features_df = mlf.get_online_features(
features=["user_spend"],
entity_df=request_df,
store=store,
entities=[with_user_id],
)
Function Signature¶
def get_online_features(
features: list[str],
entity_df: pl.DataFrame,
store: OnlineStore,
entities: list[EntityKeyTransform] | None = None,
) -> pl.DataFrame
Key Differences from Training Retrieval¶
| Aspect | get_training_data() |
get_online_features() |
|---|---|---|
| Versioning | Supports ("feature", "1.0.0") |
No versioning (latest only) |
| Point-in-time | Uses timestamp parameter |
Not applicable |
| Store parameter | Path string or Store instance | OnlineStore instance required |
| Missing entities | Returns null | Returns null |
Prerequisites¶
Before using online retrieval:
-
Configure online store in your definitions:
-
Build to online store:
-
Ensure Redis is running:
See Online Stores for detailed setup instructions.
Next Steps¶
- Entity Keys - Work with surrogate keys and transforms
- Point-in-Time Correctness - Understand temporal joins
- Online Stores - Set up Redis for real-time serving
- API Reference - Detailed API documentation