Datasets and transformations in Featureform

Register datasets and define SQL or DataFrame transformations for Featureform features.

Datasets and transformations are the bridge between your raw source data and your feature definitions.

  • datasets point Featureform at existing tables or files
  • transformations create reusable feature engineering logic on top of those datasets

Register datasets

Register the source objects that Featureform should treat as named inputs.

Warehouse tables

transactions = snowflake.register_table(
    name="transactions",
    table="TRANSACTIONS",
    description="Raw transaction table",
)

For Snowflake providers, the provider-level database, schema, and catalog configuration determines where Featureform resolves the table unless you override those values during registration.

Delta tables

transactions = spark.register_delta_table(
    name="transactions",
    database="my_catalog.my_schema",
    table="transactions",
)

Use catalog.schema in the database parameter when you are working with Unity Catalog.

Apache Iceberg tables

If your provider is configured with an Iceberg-compatible catalog, register the table against that provider and then use it like any other dataset:

orders = snowflake.register_table(
    name="orders_iceberg",
    table="ORDERS_ICEBERG",
    description="Apache Iceberg table registered through the configured catalog",
)

You can then reference the Iceberg-backed dataset in transformations:

@snowflake.sql_transformation(inputs=[orders])
def recent_orders(orders):
    return """
    SELECT
        order_id,
        customer_id,
        order_total,
        order_timestamp
    FROM {{ orders }}
    WHERE order_timestamp >= DATEADD(day, -30, CURRENT_TIMESTAMP())
    """

File-based datasets

events = spark.register_file(
    name="events",
    file_path="s3://my-bucket/events/transactions.parquet",
    description="Event data stored in parquet",
)

Spark providers also support JSON-backed primary sources:

events = spark.register_json(
    name="raw_events",
    file_path="s3://my-bucket/events.jsonl",
)

Flatten JSON-backed sources in a transformation before you use them for features or labels.

Choose dataset types that match your source-of-truth platform. The goal is to create stable named inputs that other definitions can depend on.

SQL transformations

Use SQL transformations when your provider is SQL-native and the logic is easiest to express in SQL:

@snowflake.sql_transformation(inputs=[transactions])
def user_spend(transactions):
    return """
        SELECT
            user_id,
            AVG(amount) AS avg_transaction_amount,
            MAX(timestamp) AS latest_transaction
        FROM {{ transactions }}
        GROUP BY user_id
    """

For Snowflake providers configured with an Iceberg catalog, you can also create dynamic-table-backed transformations:

from featureform import SnowflakeDynamicTableConfig, RefreshMode, Initialize

@snowflake.sql_transformation(
    inputs=[transactions],
    resource_snowflake_config=SnowflakeDynamicTableConfig(
        refresh_mode=RefreshMode.INCREMENTAL,
        initialize=Initialize.ON_CREATE,
        target_lag="30 minutes",
    )
)
def incremental_user_features(transactions):
    return """
    SELECT
        user_id,
        AVG(amount) AS avg_amount,
        COUNT(*) AS tx_count,
        MAX(timestamp) AS last_tx
    FROM {{ transactions }}
    GROUP BY user_id
    """

DataFrame transformations

Use DataFrame transformations when you need programmatic control or Spark-native operations:

@spark.df_transformation(inputs=[transactions])
def user_transaction_features(transactions_df):
    from pyspark.sql import functions as F

    return transactions_df.groupBy("user_id").agg(
        F.avg("amount").alias("avg_transaction_amount"),
        F.count("*").alias("transaction_count"),
        F.max("timestamp").alias("latest_transaction"),
    )

Chaining transformations

You can build transformations on top of other transformations. This keeps feature engineering logic modular and easier to review:

@spark.df_transformation(inputs=[user_transaction_features])
def high_value_users(features_df):
    return features_df.filter("avg_transaction_amount > 100")

Accessing transformation data

For development and validation, Featureform can retrieve the output of registered datasets or transformations as data frames:

df = client.dataframe(user_transaction_features)

Use this sparingly in production workflows. Its main value is validation and iteration.

Apply registrations

Once datasets and transformations are defined, register them with Featureform:

client.apply()

This step records the metadata and dependency graph so later resources, such as features and training sets, can reference them.

Best practices

  • Register stable business-level datasets, not every transient table.
  • Keep transformation names descriptive and reusable.
  • Push provider-specific parsing, casting, and joins into transformations before defining features.
  • Include timestamps where temporal correctness matters later in training or serving.
RATE THIS PAGE
Back to top ↑