Change Data Capture (CDC)

CDC is a feature that allows you to not only query the current state of a database’s table, but also query the history of all changes made to the table.

As an example, suppose you made a sequence of changes to some table in the given order:

UPDATE ks.t SET v = 0 WHERE pk = 0 AND ck = 0;
UPDATE ks.t SET v = 1 WHERE pk = 0 AND ck = 0;
UPDATE ks.t SET v = 2 WHERE pk = 0 AND ck = 0;
UPDATE ks.t SET v = 2 WHERE pk = 0 AND ck = 1;
UPDATE ks.t SET v = 1 WHERE pk = 0 AND ck = 1;
UPDATE ks.t SET v = 0 WHERE pk = 0 AND ck = 1;

Normally, querying the table would return

 pk | ck | v
----+----+---
  0 |  0 | 2
  0 |  1 | 0

(2 rows)

but with CDC, you can also learn the history of all changes:

change at 2020-01-29 14:37:32: UPDATE ks.t SET v = 0 WHERE pk = 0 AND ck = 0;
change at 2020-01-29 14:37:33: UPDATE ks.t SET v = 1 WHERE pk = 0 AND ck = 0;
change at 2020-01-29 14:37:35: UPDATE ks.t SET v = 2 WHERE pk = 0 AND ck = 0; <- latest change
change at 2020-01-29 14:37:38: UPDATE ks.t SET v = 2 WHERE pk = 0 AND ck = 1;
change at 2020-01-29 14:37:39: UPDATE ks.t SET v = 1 WHERE pk = 0 AND ck = 1;
change at 2020-01-29 14:37:40: UPDATE ks.t SET v = 0 WHERE pk = 0 AND ck = 1; <- latest change

(not an actual syntax, the above example just presents the general concept).

Use cases for CDC

Some examples where CDC may be benificial:

  • Heterogeneous database replication: applying captured changes to another database or table. The other database may use a different schema (or no schema at all), better suited for some specific workloads. An example is replication to ElasticSearch for efficient text searches.

  • Implementing a notification system.

  • In-flight analytics: looking for patterns in the changes in order to derive useful information, e.g. for fraud detection.

In Scylla CDC is optional and enabled on a per-table basis. The history of changes made to a CDC-enabled table is stored in a separate associated table.

Terminology

  • Base Table - this is the original table, where all changes are made.

  • Log Table - this is the table associated to the base table which is created when CDC is enabled. Read about it in the log table document.

Enabling CDC

You can enable CDC when creating or altering a table using the cdc option, for example:

CREATE TABLE ks.t (pk int, ck int, v int, PRIMARY KEY (pk, ck, v)) WITH cdc = {'enabled':true};

CDC Parameters

The following table contains parameters available inside the cdc option:

Parameter name

Definition

Default

enabled

If true, the log table is created and each base table write will get a corresponding log table write: a delta row. The delta row describes “what has changed”.

false

preimage

If true, each base write will get a corresponding preimage row in the log table. The preimage row describes “what was before”. Note that preimages are costly: they require an additional read-before-write.

false

postimage

If true, each base write will get a corresponding postimage row in the log table. The postimage row describes “what does the row look like after the change”. Note that postimages, similarly to preimages, are costly: they require an additional read-before-write. However, if you enable both preimage and postimage, only one read will be required for both of them.

false

ttl

Each log table row has a TTL (time-to-live) set on each of its columns, so that the log doesn’t grow endlessly. This option specifies what the TTL should be in seconds; the default is 86400 seconds (24 hours). You can also set it to 0, which means that the TTL won’t be set, thus log rows won’t be removed. Be careful however: in that case the log will consume more and more disk space. You will probably want to setup a separate cleaning mechanism if you set TTL to 0.

86400