Topic: System administration
Learn: How to configure time synchronization for Scylla
Audience: Scylla and Apache Cassandra administrators
Apache Cassandra and Scylla depend on an accurate system clock. Kyle Kingsbury,
author of the
jepsen distributed systems testing tool,
Apache Cassandra uses wall-clock timestamps provided by the server, or optionally by the client, to order writes. It makes several guarantees about the monotonicity of writes and reads given timestamps. For instance, Cassandra guarantees most of the time that if you write successfully to a quorum of nodes, any subsequent read from a quorum of nodes will see that write or one with a greater timestamp.
So servers need to keep their time in sync. Not a hard problem, since we all have NTP on our Linux systems, right? Not quite. The way that NTP ships out of the box is fine for a stand-alone server, but can be a problem for a distributed data store.
The default NTP configuration that comes with a typical Linux system
uses “NTP pools”, lists of publicly available time servers contributed
by public-minded Internet timekeeping system administrators. The pools
are a valuable service, but in order to spare the NTP traffic load on
any given server, they’re managed with DNS round robin. One client that
tries to resolve the hostname
0.pool.ntp.org will get a different
result from another client.
As Viliam Holub points out in a two-part series – part 1, part 2 – if Apache Cassandra nodes in a cluster are independently obtaining their time from random pool servers out on the Internet, the chances that two nodes can have widely (by NTP standards) differing time is high. For example, if a cluster has 10 nodes, 50% of the time some pair of nodes will have time that differs by more than 10.9ms. The problem only grows as more nodes are added.
The solution is to be able to take that ntp.conf file that came with your Linux distribution, and take the default “pool” servers out and put your data center’s own NTP servers in.
Instead of lines that looks something like:
server 0.fedora.pool.ntp.org iburst server 1.fedora.pool.ntp.org iburst
server 0.debian.pool.ntp.org iburst server 1.debian.pool.ntp.org iburst
use your own servers. So ntp.conf will have “server” lines pointing to your own NTP servers, and look more like:
# begin ntp.conf # Store clock drift -- see ntp.conf(5) driftfile /var/lib/ntp/drift # Restrict all access by default restrict default nomodify notrap nopeer noquery # Allow localhost access and LAN management restrict 127.0.0.1 restrict ::1 restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap # Use our company’s NTP servers only server 0.ntp.example.com iburst server 1.ntp.example.com iburst server 2.ntp.example.com iburst # End ntp.conf
The same ntp.conf can be deployed to all the servers in your data center. Not just Apache Cassandra nodes, but the application servers that use them. It’s much more important for the time to be in sync throughout the cluster than for any node to match some random machine out on the Internet. It’s also helpful to keep the data store time the same as the application server’s time, for ease in troubleshooting and matching up log entries.
Dedicated NTP appliances are available, and might be a good choice for large sites. Otherwise, any standard Linux system should make a good NTP server.
On the NTP servers, you can go ahead and use the “pool.ntp.org” server lines that shipped with your Linux distribution if you don’t have a known good time server. But a good hosting provider or business-class ISP probably has NTP servers that are close to you on the network, and that would be better choices to replace the pool entries.
Your NTP servers should peer with each other:
peer 0.ntp.example.com prefer peer 1.ntp.example.com peer 2.ntp.example.com
What happens when the network goes down? In most cases, NTP should just work. Your NTP servers will establish a new consensus time among themselves. Old-school NTP documentation had “fudge” lines to let the NTP server rely on the local system clock if the network connection failed. On modern versions of NTP, the “fudge” functionality has been replaced with Orphan mode.
Add an “orphan” line to ntp.conf on each NTP server:
tos orphan 9
And the NTP servers will do the right thing and stay synchronized among themselves if there’s a problem reaching the servers on the outside.
That’s all it takes. One relatively simple system administration project can save a bunch of troubleshooting grief later on. Once your NTP servers are working, have a look at the instructions for joining the NTP pool yourself, so that you can help share the correct time with others
© 2016, The Apache Software Foundation.
Apache®, Apache Cassandra®, Cassandra®, the Apache feather logo and the Apache Cassandra® Eye logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.