Prepare Snowflake for RDI

Prepare Snowflake databases to work with RDI

This guide describes the steps required to prepare a Snowflake database as a source for Redis Data Integration (RDI) pipelines.

During both the snapshot and Change data capture (CDC) phases, RDI uses Snowflake Streams to read data from the monitored tables. For the initial snapshot, RDI creates the stream with SHOW_INITIAL_ROWS = TRUE so it can read the current table contents before continuing with ongoing CDC. RDI automatically creates and manages the required streams.

Setup

The following checklist shows the steps to prepare a Snowflake database for RDI, with links to the sections that explain the steps in full detail. You may find it helpful to track your progress with the checklist as you complete each step.

Note:
Snowflake is only supported with RDI deployed on Kubernetes/Helm. RDI VM mode does not support Snowflake as a source database.
- [ ] [Set up Snowflake permissions](#1-set-up-snowflake-permissions)
- [ ] [Configure authentication](#2-configure-authentication)
- [ ] [Set up secrets for Kubernetes deployment](#3-set-up-secrets-for-kubernetes-deployment)
- [ ] [Configure RDI for Snowflake](#4-configure-rdi-for-snowflake)

1. Set up Snowflake permissions

The following are the minimum runtime permissions for the RDI role to read the source tables and create the Snowflake objects RDI uses for CDC:

  • USAGE, OPERATE on the warehouse used for RDI reads
  • USAGE on the source database and source schema
  • SELECT on the source tables
  • USAGE on the CDC schema used by RDI
  • CREATE STREAM, CREATE TABLE on the CDC schema used by RDI

If you configure cdcDatabase and cdcSchema, grant the CDC permissions there. Otherwise, grant them in the source schema. If your Snowflake setup requires it, also grant any additional cross-database privileges needed for the CDC schema to reference the source tables.

Note:

RDI manages the Snowflake streams it uses for snapshot and CDC. The collector creates the stream in the configured CDC schema and later issues CREATE OR REPLACE STREAM statements to keep the stream aligned with the expected offset, so the RDI role must be able to create and own those stream objects in the CDC schema.

There is one stricter bootstrap requirement for the first stream created on a source table: if Snowflake change tracking is not already enabled on that table, only the table owner can create that initial stream. If the source tables are not owned by the RDI role, ask a Snowflake administrator or table owner to enable change tracking first:

ALTER TABLE MYDB.PUBLIC.customers SET CHANGE_TRACKING = TRUE;
ALTER TABLE MYDB.PUBLIC.orders SET CHANGE_TRACKING = TRUE;

Grant the required permissions to your RDI user:

-- Grant usage on the warehouse
GRANT USAGE, OPERATE ON WAREHOUSE COMPUTE_WH TO ROLE rdi_role;

-- Grant usage on the source database and schema
GRANT USAGE ON DATABASE MYDB TO ROLE rdi_role;
GRANT USAGE ON SCHEMA MYDB.PUBLIC TO ROLE rdi_role;

-- Grant SELECT on tables to capture
GRANT SELECT ON TABLE MYDB.PUBLIC.customers TO ROLE rdi_role;
GRANT SELECT ON TABLE MYDB.PUBLIC.orders TO ROLE rdi_role;

-- Grant permissions on the schema RDI uses for CDC objects
GRANT USAGE ON SCHEMA MYDB.RDI_CDC TO ROLE rdi_role;
GRANT CREATE STREAM, CREATE TABLE ON SCHEMA MYDB.RDI_CDC TO ROLE rdi_role;

-- Assign the role to your RDI user
GRANT ROLE rdi_role TO USER rdi_user;

If you use centralized grant management, you can also add future grants in the CDC schema so newly created tables and streams automatically receive the desired privileges. These grants are optional and are not part of the minimum runtime permissions:

GRANT SELECT ON FUTURE TABLES IN SCHEMA MYDB.RDI_CDC TO ROLE rdi_role;
GRANT SELECT ON FUTURE STREAMS IN SCHEMA MYDB.RDI_CDC TO ROLE rdi_role;

2. Configure authentication

RDI supports two authentication methods for Snowflake. You must configure one of these methods.

Password authentication

Use standard username and password credentials. Store these securely using Kubernetes secrets (see step 3).

Note:
Many Snowflake accounts require MFA for password-based sign-ins. If you want to use password authentication for RDI, configure the Snowflake user as a service user that is allowed to authenticate non-interactively. Otherwise, use private key authentication instead. For more information, see the Snowflake MFA rollout documentation.

Private key authentication

For enhanced security, use key-pair authentication:

  1. Generate a private key:

    openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
    
  2. Generate the public key:

    openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
    
  3. Register the public key with your Snowflake user:

    ALTER USER rdi_user SET RSA_PUBLIC_KEY='<public_key_content>';
    

3. Set up secrets for Kubernetes deployment

Before deploying the RDI pipeline, configure the necessary secrets.

Password authentication

kubectl create secret generic source-db \
  --namespace=rdi \
  --from-literal=SOURCE_DB_USERNAME=your_username \
  --from-literal=SOURCE_DB_PASSWORD=your_password

Private key authentication

Create a secret with the private key file:

kubectl create secret generic source-db-ssl \
  --namespace=rdi \
  --from-file=client.key=/path/to/rsa_key.p8

Also create the source-db secret with the username:

kubectl create secret generic source-db \
  --namespace=rdi \
  --from-literal=SOURCE_DB_USERNAME=your_username

4. Configure RDI for Snowflake

Use the following example configuration in your config.yaml file:

sources:
  snowflake:
    type: riotx
    connection:
      type: snowflake
      url: "jdbc:snowflake://myaccount.snowflakecomputing.com/"
      user: "${SOURCE_DB_USERNAME}"
      password: "${SOURCE_DB_PASSWORD}"  # Omit for key-pair auth
      database: "MYDB"
      warehouse: "COMPUTE_WH"
      # role: "RDI_ROLE"                 # Optional: Snowflake role
      # cdcDatabase: "CDC_DB"            # Optional: Separate database for CDC streams
      # cdcSchema: "CDC_SCHEMA"          # Optional: Separate schema for CDC streams
    schemas:
      - PUBLIC
    tables:
      PUBLIC.customers: {}
      PUBLIC.orders: {}
    advanced:
      riotx:
        poll: "30s"
        snapshot: "INITIAL"              # Or "NEVER" to skip initial snapshot
        # streamPrefix: "data:"          # Optional: Redis stream prefix
        # streamLimit: 100000            # Optional: Max stream length
        # keyColumns:                    # Recommended: stable key columns
        #   - "id"
        # clearOffset: false             # Optional: Clear offset on start

targets:
  target:
    connection:
      type: redis
      host: ${TARGET_DB_HOST}
      port: ${TARGET_DB_PORT}
      user: ${TARGET_DB_USERNAME}
      password: ${TARGET_DB_PASSWORD}

processors:
  target_data_type: json
Note:
Snowflake uses one configured database and one or more source-level schemas. In the tables section, specify each table as SCHEMA.table. Even when you configure only one schema, explicit SCHEMA.table names are recommended for clarity.

Snowflake connection properties

Property Type Required Description
type string Yes Must be "snowflake"
url string Yes JDBC URL: jdbc:snowflake://<account>.snowflakecomputing.com/
user string Yes Snowflake user
password string No* Snowflake password
database string Yes Snowflake database name
warehouse string Yes Snowflake warehouse name
role string No Snowflake role name
cdcDatabase string No Database for CDC streams (if different from source)
cdcSchema string No Schema for CDC streams (if different from source)

Snowflake source properties

Property Type Required Description
schemas array Yes Schema names to capture from
tables object Yes Tables to capture, keyed as SCHEMA.table

Advanced configuration options

Configure under sources.<name>.advanced.riotx:

Property Type Default Description
poll string "30s" Polling interval for stream changes
snapshot string "INITIAL" Snapshot mode: INITIAL or NEVER
streamPrefix string "data:" Prefix for the Redis stream written by RDI
streamLimit integer - Maximum stream length (XTRIM MAXLEN)
keyColumns array - Stable source columns to use as message keys
clearOffset boolean false Clear existing offset on start
count integer 0 Limit records per poll (0 = unlimited)

For reliable update and delete handling, define keyColumns with a stable business key or surrogate key when possible.

Troubleshooting

Connection issues

Error: "Failed to connect to Snowflake"

  • Verify the account URL is correct (format: <account>.snowflakecomputing.com)
  • Check network connectivity to Snowflake
  • Verify the warehouse is running and accessible
  • Check firewall rules allow outbound HTTPS (port 443)

Error: "Authentication failed"

  • For password auth: verify username and password are correct
  • For key-pair auth: verify the private key matches the public key registered in Snowflake
  • Ensure the user has appropriate permissions

Error: "Warehouse not found"

  • Verify the warehouse name is correct
  • Ensure the user has USAGE permission on the warehouse

CDC issues

No data appearing in Redis

  1. Verify Snowflake Streams exist in the CDC schema:

    SHOW STREAMS IN SCHEMA my_cdc_database.my_cdc_schema;
    
  2. Check the polling interval configuration

  3. Verify Redis connection is working

  4. Check the collector logs:

    kubectl get deployments -n rdi | grep riotx-collector
    kubectl logs -n rdi deployment/<riotx-collector-deployment>
    

Stale or missing changes

  • Snowflake Streams depend on Snowflake change tracking and retention settings
  • If the collector was offline longer than the available retention window, changes may be lost
  • Consider using clearOffset: true to restart from current state

Performance tuning

High Snowflake warehouse usage

  • Increase poll interval (e.g., "60s" or "120s")
  • Use a dedicated warehouse for CDC operations
  • Each poll first calls Snowflake's SYSTEM$STREAM_HAS_DATA function to check whether the stream has new data. This check does not start the warehouse; warehouse compute starts only when RDI reads rows from the stream.

Redis memory concerns

  • Set streamLimit to cap stream length
  • Use count to limit records per poll batch

Initial snapshot too slow

  • Use snapshot: "NEVER" to skip initial snapshot
  • Pre-load data using other methods if needed

Enable debug logging

Enable debug logging in the source configuration:

sources:
  snowflake:
    type: riotx
    logging:
      level: debug
    # ... rest of configuration

View collector logs:

kubectl get deployments -n rdi | grep riotx-collector
kubectl logs -n rdi deployment/<riotx-collector-deployment> -f

5. Configuration is complete

Once you have followed the steps above, your Snowflake database is ready for RDI to use.

See also

RATE THIS PAGE
Back to top ↑