Adding Context to CockroachDB's Article "Your Database Should Work Like a CDN"

I was excited this morning when I checked Hacker News and saw an article from Cockroach Labs titled “Your Database Should Work Like a CDN”. I’m a big fan of Cockroach DB and enjoy their blog posts. I clicked on this one, but came away disapointed. I don’t think it fairly contrasted Cockroach with it’s competitors, and instead made some incomplete arguments instead of selling Cockroach on it’s merits.

In this article I will analyse sections of the post and add more context where I think some was left out.

Analysis

Availability

To maximize the value of their services, companies and their CTOs chase down the elusive Five Nines of uptime

Only companies with sophisticated operations teams can seriously set an SLA for five nines. It’s doable, but comes with a heavy lift if your service is doing anything non-trivial. It’s certainly not the default position among the industry as far as I can tell. The Google SRE book has a good section on this.

For many companies the cost of moving from a single region deployment to a multi-region one is too great and doesn’t provide enough benefits. Some services don’t actually need to be available to five nines, and if not done well, moving to multi-region deployments may make your system less fault-tolerant, not more.

This is particularly crucial for your customer’s data. If your data’s only located in a single region and it goes down, you are faced with a “non-zero RPO”, meaning you will simply lose all transactions committed after your last backup.

If a whole region dropped into a sinkhole in the earth then you would lose all transactions after your last backup. However that’s not a failure scenario that we tend to worry about (although maybe we should?). Every time a major cloud provider has had a zone/region go down, data was partially/wholly unavailable during the outage, but no committed data was lost once service was restored (that I know of?).

…and Bureaucracy?

This impending legislation requires that businesses receive explicit consent from EU users before storing or even processing their data outside the EU. If the user declines? Their data must always (and only) reside within the EU. If you’re caught not complying to GDPR? You’ll face fines of either 4% of annual global turnover or €20 Million, whichever is greater.

I’m not a GPDR expert, but this seems like an incomplete summary of the GPDR rules around processing, especially as the processing rule applies only to the processing of an EU user’s personal data (AFAICT), and conflates requiring consent for storing/processing personal data with storing the data outside of the EU. From article 6:

Processing shall be lawful only if and to the extent that at least one of the following applies:

a) the data subject has given consent to the processing of his or her personal data for one or more specific purposes;

b) processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract;

c-f) [Public/government interest exceptions)

If you have entered into a contract with the data subject, and that data is necessary for the contract, then I think you are fine (IANAL). As far as I can tell, transfers to Third Countries are OK as long as there are appropriate safeguards.

This is the part of my critique that I am least confident about, I’d welcome some links confirming/rebutting this.

When you take GDPR in the context of the existing Chinese and Russian data privacy laws––which require you to keep their citizen’s data housed within their countries […]

If you follow the links in the article, you see this for China:

Does the Cybersecurity Law require my company to keep certain data in China?

[…] To that end, the Cybersecurity Law requires “critical information infrastructure” providers to store “personal information” and “important data” within China unless their business requires them to store data overseas and they have passed a security assessment. At this point, it remains unclear what qualifies as “important data,” although its inclusion in the text of the law alongside “personal data” means that it likely refers to non-personal data. […]

“Critical Information Infrastructure” providers are defined a bit more narrowly, but the law still casts a fairly wide net. […] the law names information services, transportation, water resources, and public services, among other service providers, as examples.

and this for Russia:

3.6 Generally, to transfer personal data outside the Russian Federation, the operator will have to make sure, prior to such transfer, that the rights of personal data subjects will enjoy adequate and sufficient protection in the country of destination.

Some companies will need to store data on China/Russian servers, but the laws here are far narrower than “If you have any data from a Chinese or Russian person or company, you must store it in their country”.

Managed & Cloud Databases

Managed and Cloud databases often tout their survivability because they run in “multiple zones.” This often leads users to believe that a cloud database that runs in multiple availability zones can also be distributed across the globe.

It might be an assumption you would make if you have no experience with databases, but it’s not one that I’ve ever seen a software engineer make.

There are caveats to this, of course. For example, with Amazon RDS, you can create read-only replicas that cross regions, but this risks introducing anomalies because of asynchronous replication: and anomalies can equal millions of dollars in lost revenue or fines if you’re audited.

I’m not sure how asynchronous replication lag could result in failing an audit and incurring millions of dollars of fines. I spent a few minutes trying to come up with a scenario and couldn’t. Losing revenue from users also seems speculative, I’m not really clear how this would happen.

Designing a system to run across multiple regions with asynchronous replication is certainly not trivial, but people do it every day. If they were losing millions of dollars from it, they would probably stop.

In addition, this forces all writes to travel to the primary copy of your data. This means, for example, you have to choose between not complying with GDPR or placing your primary replica in the EU, providing poor experiences for non-EU users.

Again, I don’t think GPDR requires this.

NoSQL

For example, NoSQL databases suffer from split-brain during partitions (i.e. availability events), with data that is impossible to reconcile. When partitions heal, you might have to make ugly decisions: which version of your customer’s data to you choose to discard? If two partitions received updates, it’s a lose-lose situation.

This paints all NoSQL databases with a broad brush. While some NoSQL databases are (were?) notorious for losing data, that is not inherent to NoSQL databases.

Certainly if you are using an AP NoSQL database, you need to design your application to correctly handle conflicts, use CRDT’s, or make idempotent writes. Partitions do happen, and it’s not trivial to handle them correctly, but neither is it the case that you always need to discard your customer’s data.

Sharded Relational Databases

Sharded Relational databases come in many shapes and suffer from as many different types of ailments when deployed across regions: some sacrifice replication and availability for consistency, some do the opposite.

I assume the author is referring to systems like Citus. I don’t have enough experience with systems like this to judge the assertion, but this seems fair.

Conclusion

If you do need more reliability/availability than is possible from a single region, then Cockroach is a strong option to consider. I think a far better argument for CockroachDB and against NoSQL, Replicated SQL, and Sharded Relational DBs is minimising complexity and developer time. It is possible for developers to design their application for the nuances of each of these databases, but it’s certainly not easy or cheap, especially if you want it to be correct under failure. The reason Google created Spanner (the inspiration for Cockroach) was that developers found it hard to build reliable systems with weak consistency models.

[…] It is much easier to reason about and write software for a database that supports strong consistency than for a database that only supports row-level consistency, entity-level consistency, or has no consistency guarantees at all. - Life of Cloud Spanner Reads & Writes

CockroachDB provides consistent reads and writes, supports SQL, is able to be deployed multi-region, and in any datacenter. Those are a compelling set of features. If your application can handle the latency and performance tradeoffs that it makes (which are getting better all the time), then it will let you write software against a consistent datastore without spending as much time reasoning about hard consistency problems. Cockroach is a great product, and I think it stands well on it’s merits.