Tracing

Tracing is a ScyllaDB tool meant to help debugging and analyzing internal flows in the server. One example of such a flow is CQL request processing.

NOTE: If you’re planning to use either probabilistic tracing or slow query logging (see below), it’s advisable to change the replication_factor of the system_traces keyspace to ALL.

cqlsh

Enabling

While inside a cqlsh prompt you can enable tracing for this session with the command TRACING ON|OFF.

NOTE: source_elapsed starts over on 127.0.0.1 when execution gets there.

Viewing

The raw tracing data can be queried as below (scroll to view):

Storing

Traces from cqlsh are stored in the system_traces keyspace.

Probabilistic Tracing

Tracing implies a significant performance penalty on a cluster when enabled. Therefore, if tracing is required for some ongoing workload, it is undesirable to enable it for every request but rather for some (small) portion of requests. This can be done using probabilistic tracing, which randomly chooses a request to be traced with some defined probability.

Enabling

If you want to trace 0.01% of all queries in the cluster, you can set a probabilistic tracing with the probability 0.0001:

$ nodetool settraceprobability 0.0001

Viewing

If we need trace points for a specific session, we can query the events table for a given session’s id. For example:

SELECT * from system_traces.sessions where session_id = 141ab010-d994-11e7-899e-000000000002;

Storing

Traces are stored in the system_traces keyspace for 24 hours. The keyspace consists of two tables with a replication factor of 2:

  • sessions table contains a single row for each tracing session;
  • events table contains a single row for each trace point.

Traces are created in the context of a tracing session. For instance, if we trace an INSERT CQL command, a tracing session with a unique ID (session_id column in the tables mentioned above) will be created and all trace points hit during its execution will be stored in a context of this session. This defines the format in which tracing data is stored.

sessions table column descriptions

  • session_id: ID of this tracing session
  • command: currently, this can only have a QUERY value
  • client: address of the client that sent this query
  • coordinator: address of the coordinator node that received this query from the client
  • duration: the total duration of this tracing session
  • parameters: this map contains string pairs that describe the query. This may include query string or consistency level.
  • request: a short string describing the current query, like “Execute CQL3 query”
  • started_at: a timestamp taken when the tracing session has begun

events table column descriptions

  • session_id: ID of this tracing session
  • event_id: ID of this specific trace entry
  • activity: a trace message
  • source: address of a node where the trace entry has been created
  • source_elapsed: a number of microseconds passed since the beginning of the tracing session on a specific node (see examples above)
  • thread: currently this contains a number of the shard on which this trace point has been taken

Slow Query Logging

Often in real life installations, one of the most important parameters of the system is the longest response time. Naturally, the shorter it is, the better. Therefore, capturing a request that takes a long time and understanding why it took it so long is a very critical and challenging task.

Slow query logging will greatly ease debugging related to long requests. When enabled, it records queries with handling time above the specified threshold. As a result, there will be a new record created in system_traces.node_slow_log table. All tracing records created in the context of the current query on a coordinator node will also be written. In addition, if handling on a given replica takes too long, traces will be stored.

Enabling and configuring

Slow query logging is disabled by default. A REST API allows configuring and querying the configuration of the feature.

To set the parameters, run:

$ curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://<Node's address>:10000/storage_service/slow_query?enable=<true|false>&ttl=<in seconds>&threshold=<threshold in microseconds>"

For example, to disable the feature on a node with the address 127.0.0.1, set the ttl to 8600 and the threshold to 10000:

$ curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/slow_query?enable=false&ttl=8600&threshold=10000"

To get the current configuration, run:

$ curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://<Node's address>:10000/storage_service/slow_query"

After the POST command above, the query and result will look as below:

$ curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/slow_query"
{"threshold": 10000, "enable": false, "ttl": 8600}

Viewing

Two time series helper tables were introduced that will help simplify the querying of traces.

sessions_time_idx is for querying regular traces. Another table, the node_slow_log_time_idx table, is for querying slow query records.

sessions_time_idx and node_slow_log_time table column descriptions

  • minute: the minute, from epoch time, from when the record was taken
  • started_at: a timestamp taken when the tracing session has begun
  • session_id: the corresponding tracing session ID
  • start_time: time when the query was initiated
  • node_ip: address of a coordinator node
  • shard: shard ID on a coordinator, where the query has been handled

With these tables, one may get the relevant traces using a query like the one below:

SELECT * from system_traces.sessions_time_idx where minutes in ('2016-09-07 16:56:00-0700') and started_at > '2016-09-07 16:56:30-0700';

Storing

Slow query logging results are stored in the node_slow_log table.

node_slow_log table column descriptions

  • start_time and date: time when the query was initiated
  • node_ip: address of a coordinator node
  • shard: shard ID on a coordinator, where the query has been handled
  • command: the query command, e.g. select * from my_ks.my_cf
  • duration: the duration query handling in microseconds
  • parameters: query parameters like a parameters column in a system_traces.sessions table
  • session_id: the corresponding tracing session ID
  • source ip: address of the client that sent this query
  • table_names: a list of tables used for this query, where applicable
  • username: a user name used for authentication with this query