Validation¶
mlforge.validation ¶
Validation runner for feature data quality checks.
This module provides the infrastructure for running validators against DataFrames and collecting results. It is used by the build process to validate feature function outputs before metrics are applied.
Example
from mlforge.validation import validate_dataframe from mlforge.validators import not_null, greater_than
validators = { "amount": [not_null(), greater_than(0)], "user_id": [not_null()], }
results = validate_dataframe(df, validators) if not results.passed: for failure in results.failures: print(f"{failure.column}: {failure.result.message}")
ColumnValidationResult
dataclass
¶
Result of running a single validator on a single column.
Attributes:
| Name | Type | Description |
|---|---|---|
column |
str
|
Name of the column that was validated |
validator_name |
str
|
Name of the validator that was run |
result |
ValidationResult
|
The ValidationResult from the validator |
Source code in src/mlforge/validation.py
FeatureValidationResult
dataclass
¶
Aggregated validation results for a feature.
Attributes:
| Name | Type | Description |
|---|---|---|
feature_name |
str
|
Name of the feature that was validated |
column_results |
list[ColumnValidationResult]
|
List of individual column validation results |
Source code in src/mlforge/validation.py
validate_dataframe ¶
validate_dataframe(
df: DataFrame, validators: dict[str, list[Validator]]
) -> list[ColumnValidationResult]
Run validators against DataFrame columns.
Validators are only run on columns that exist in the DataFrame. Missing columns are silently skipped (dbt-style behavior).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to validate |
required |
validators
|
dict[str, list[Validator]]
|
Mapping of column names to lists of validators |
required |
Returns:
| Type | Description |
|---|---|
list[ColumnValidationResult]
|
List of ColumnValidationResult for each validator run |
Example
results = validate_dataframe( df, {"amount": [not_null(), greater_than(0)]} ) for r in results: if not r.result.passed: print(f"{r.column} failed {r.validator_name}: {r.result.message}")
Source code in src/mlforge/validation.py
validate_feature ¶
validate_feature(
feature_name: str,
df: DataFrame,
validators: dict[str, list[Validator]],
) -> FeatureValidationResult
Validate a feature's DataFrame and return aggregated results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_name
|
str
|
Name of the feature being validated |
required |
df
|
DataFrame
|
DataFrame output from feature function |
required |
validators
|
dict[str, list[Validator]]
|
Mapping of column names to lists of validators |
required |
Returns:
| Type | Description |
|---|---|
FeatureValidationResult
|
FeatureValidationResult with all validation outcomes |
Example
result = validate_feature( "user_spend", df, {"amount": [not_null()]} ) if not result.passed: print(f"Feature {result.feature_name} failed validation")