Tuesday, March 09 2010

Digg And Cassandra, sitting in a B-Tree

Digg recently started transitioning parts of their platform to the Cassandra open-source, Facebook-originated NoSQL solution.

They're the perfect customer for NoSQL: The value per user and transaction is very low, demanding solutions that allow them to scale at minimal cost; some data loss or inconsistency can be accepted; and a lot of the data can be effectively siloed into islands.

Nonetheless, the article they posted about the move is filled with the sort of thinking that has littered the web with misinformation about the relational database.

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes

The relational database "mindset" imposes no such burden.

It's All About Finding Balance

Indexes, for instance, are a rudimentary tool of every competent database user. Each additional index adds an expense to every write to the table, forcing row changes to update every index in addition to the base table, in return easing certain read scenarios.

You apply as appropriate, striving for the perfect balance between read and write performance.

I posted parts I and II of a very simple "introductory to databases" article back in 2005 (never getting around to finishing part III), and I strongly encourage it for anyone who doesn't understand how indexes work, or how important concepts like covering indexes are (which I'll touch upon later in regards to the Digg scenario).

Many relational database users make heavy use of triggers and cascade activities that slow writes while lubricating reads. While many are wary of triggers in general (especially where business logic gets embedded in the data layer), this is common in the relational database world and makes an appearance in most solutions.

For Digg's particular scenario, however, the RDBMS analogy to their NoSQL approach is a basic materialized view (aka indexed view), which is a feature of most RDBMS products, from big to small.

Implementing materialized views adds a sometimes substantial cost to writes in return for supercharged reads. If I have a particular set of joins and functions that are queried often, I can materialize the view with the appropriate indexes and every change to any of the source tables automatically, as an added cost of the DML, updates the materialized view as well.

Some RDBMS systems support deferred materialized view updates where it automatically queues up the view changes without adding cost to the origination tables.

This is very old hat for virtually anyone competent with relational databases, though real-time materialized views need to be used judiciously because they fall under the auspices of ACID and can front-load write operations significantly.

Digg Don't Do Indexing (properly)

Ignoring that obvious solution of materialized views (which, to be fair, aren't natively supported by MySQL despite being a basic feature of most other database products), it is revealed that they aren't using the database in the appropriate manner — or that MySQL is simply a broken platform and is turning people against the RDBMS when really they should be against MySQL — when they note that they are manually performing the joins in PHP, claiming that the join takes too long to run as a simply query.

A likely contributor to their poor performance is that while they've made the artificial key "id" their clustered index, their userid/friendid index is only a unique index, and I suspect, from the operation of their site, that they are likely making use of the denormalized friendname column in their consumer as well, forcing a full row lookup for every match.

If they retrieve columns outside of a non-clustered index (the most common mistake is doing a "SELECT *" when you don't actually care about all of the columns), on every lookup match the database server pulls the row id (in this case the primary key) and then has to do another lookup for the actual row data. In their case — given that the relationship is unique — they should have made the compound key of user_id/friend_id the clustered index and eliminated the id column altogether.

This oversight means that instead of doing a simple partial index scan by user_id and pulling the limited set, the query engine is forced to pull the list of rows, and then lookup each and every row individually. So someone with 400 friends yields 400 IO cycles, versus 1 with a proper index.

The same problem exists in the Diggs table, but is made worse. The userid index is of limited value given that again it only helps them look up the surrogate record key (again, why not a primary key on itemid/userid with a secondary index of userid/itemid? Surrogate keys are usually a mistake if there's another unique key on the table, though of course it depends upon the scenario: foreign-keys or numerous secondary indexes might make such a simple key the best choice). The query engine is forced to lookup the records by either the itemid or the userid (by the friendid) and then lookup the root record, and then compare the corresponding value.

So many developers are so blissfully ignorant of how databases work, quick to ascribe their own shortcomings to the platform. Most will wave their hands and talk about how hard to come by a "good DBA" is, which is akin to pushing brutal bubble-sort algorithms and just distributing them across a MapReduce deployment, claiming that a good "sort algorithm guy" is hard to find and "scaling out" is what the big boys do.

So they could see a major performance improvement by indexing properly (I'm allowing that maybe they just gave a bad example, though their atrocious query performance seems to validate its accuracy), but even then looking up hundreds of seemingly randomly distributed records can be a costly exercise.

Change Is In The Air

Let's step back for a minute and ignore materialized views and appropriately created and used indexes and look at the core performance issue that Digg faced — looking up several hundred rows in the Friends table, and interrogating the Diggs table by userid/itemid for the same. Presume that the dataset is very large and it can't be cached in memory, which should be a normal design assumption.

Why is looking up several hundred randomly distributed records such a big deal?

Hard Drive

That's why. Most hard drives can only manage to seek to different locations on the disk about a hundred times per second. If you're relying on Amazon's EBS you have it even worse, with an esimated 72 IOPS per second.

That's slow.

Imagine that the query engine has a hundred row locations in hand; It would take it a full second to jump over the disk to gather up the data necessary to retrieve the contents of those rows. That's a best case scenario because in the real world it usually has to walk the index b-trees, find the matching data, and in Diggs' inappropriately indexed table case do yet another lookup to find the actual row itself.

This is why database systems often completely ignore indexes if the estimated match count exceeds a relatively small percentage of the data, anemic storage systems forcing them to do expensive operations like full scans because in the end it's a cheaper choice. Why it often just reads and filters a burst of MBs of data rather than select a few sparse records from an index.

It's why it's desirable to have the data in RAM, and why database servers should be loaded with copious memory. [Sidenote: It's also why denormalizing can paradoxically slow down a database in many scenarios because it grows a database beyond RAM unnecessarily. In the Digg case note the username and friendname fields in the Friends table]

The IOPS weak-point is why most enterprise databases add SANs with ranks and ranks of hard drives, ganging them together in such a way that many seeks occur simultaneously, vastly increasing the I/O rate.

A more attainable and far more disruptive advance is moving into reality, however, and that is SSDs.

Take a look at the Anandtech review of the OCZ Vertex LE 100GB MLC SSD. In particular look at the 4KB random read - MB/s results on page 10. Near the bottom are a couple of magnetic disks, including the esteemed VelociRaptor, which are absolutely decimated by the SSDs.

That is the test that is most applicable to the Digg scenario, and it is clearly evident how big of an impact it would have on their situation.

Instead of 100 IOPS, they would be looking at 15,000 IOPS. Put 6 of these in a RAID-10 array and you'd have a yield of 45,000 IOPS and reliability. Even without learning how to properly index they could see an easy 5000x performance improvement in that class of RDBMS queries. Add a materialized view and...the speed would be so obscene it would get banned from the App Store.

Those units are just $400 a piece, and the technology keeps getting bigger and faster and cheaper. SSDs are a deeply, deeply disruptive change, especially to the large-scale database world.

The drive I mentioned is an MLC unit that isn't intended for the enterprise market, but in some ways it fits the same role as NoSQL — less reliable, but it gets the job done. The nature of the Digg table (that it is largely an additive table with likely little churn) is the perfect use-case for an MLC SSDs.

And really, 100GB is a lot of space for an operational database, even for a social media site. While it isn't appropriate for Facebook's 25TB "figure out how to sell you junk you don't want" daily activity log, it is certainly adequate for all of the Diggs and Friend relationships Digg would need, especially when removing denormalization that was put in place because of the poor IOPS of magnetic disks. And of course with the magic of RAID you can scale it up to whatever heights you'd like.

For $400.

Soon we'll have even faster, larger drives that are cheaper, and so on. The nature of flash technology is that they can keep making it more and more parallel, so the IOPS are going to keep going up and up and up.

Optimizing against slow seek times is an activity that is quickly going to be a negative return activity. Many who embrace NoSQL are seeking a solution to yesterday's problem. Digg, for instance, yields their entire NoSQL benefit from optimizing data locality — that all data for a given need is nicely bunched together, which of course is what materialized views do as well.

There Are Incredible MPP Options in the RDBMS World...But They'll Cost You

The people who really demand high levels of database performance usually have a lot of money. Which is why many of the products that deliver options like column-oriented storage (an implementation detail of a RDBMS that is primarily suited to very large-scale column aggregations. It isn't suitable for a OLTP DB), or MPP (Massively Parallel Processing), cost absurdly high amounts.

Greenplum, Vertica, TeraData, parAccel, Oracle RAC, Sybase ASE, DB2 MPP...these things are often priced out of all but the largest enterprise's reach.

Look at the pricing of the upcoming release of SQL Server 2008 R2, in particular the Parallel Data Warehouse product that brings MPP to that server. $58K per processor, which obviously excludes it from contention for the vast majority of applications.

Come on.

If there is one thing that I would like to see come out of the NoSQL advocacy movement, it would be that mainstream databases feel the pressure to push down the functionality that they currently limit to the people with the biggest bank accounts (which they sell using the "how much do you have?" pricing model).

 SQL  NoSQL 
   

Reader Comments

Another brilliant entry. You are obviously back from your blogging retirement.
Jeremy @ 3/9/2010 3:43:50 PM
Commercial RDBMs are often not even on the radar for a small cash-strapped media startup.

When your options are MySQL, PostgreSQL, or NoSQL, NoSQL seems like an attractive hedge against future growth.

Agree completely, RDBMs would get less of a bad rap if their pricing wasn't so insane.

I do find it somewhat amusing though, we've been having to scale managing assets in the region of 500-1GB per asset (1 asset = thousands of records) on SQL Server, in databases with thousands of assets, and still have it yield decent performance, and lose/corrupt no data, as its in the manufacturing sector.

Developers with web-only eyes are a blight.
James @ 3/10/2010 1:04:24 AM
You have my complete agreement, James.

It is sort of remarkable how the universe of so many RDBMS detractors has been completely defined by MySQL, when time and time again it has shown to be one of the weakest players in that field.

I completely understand the reasoning behind the choice of that product (financial, ease of entry, etc), but it is a bit disheartening how frequently deficiencies of that single product are held up and heralded as deficiencies of non-NoSQL solutions, when it is nothing of the sort.

Many anti-RDBMS pieces are more accurately anti-MySQL pieces that overstep their bounds.
Dennis Forbes @ 3/10/2010 4:33:05 AM
Dennis,
I'm loving your posts on NoSQL, very grounded and objective, as apposed to all the hype. One thing I wanted to add to the 'cost' barrier of entry issue. In addition to the cost of 'enterprise' RDBMS products, there's also the cost of their maintenance. In the context of a small start up, there is probably not going to be the funds to either hire a full time DBA(s) or send a developer out to get trained to ensure that they have someone with the domain knowledge of optimizing a database.

I don't think it's necessarily fair to fault a developer for not having some of the more in depth knowledge for database optimization, but I am totally on board for calling BS when someone without this type of knowledge makes sweeping statements about the deficencies in paradigms they don't understand.

So I think another good argument for when to use a NoSQL solution (similar to the ones you've already offered) may be "We're a small team of 3 software developers, we can't afford to bring someone on board solely for their RDBMS knowledge, so we're going with a solution that we already know how to tune properly".

-Mark
Mark Roddy @ 3/10/2010 8:17:11 AM
I agree with you that MySQL is NOT slow.

It's also true that the heavyweight features there sometimes take people down the rabbit hole and kill performance: for example triggers and lots of consistency constraints can be expensive. You don't have these philosophically in NoSQL. You *could* choose not to use them in RDBMS world and get pretty darn good single srever performance. But really, things like that are core to the orthodox philosophy there: if moving away from that, why not try something new? And on nosql lighter-weight data tiers is the orthodoxy.

I have done tests where MySQL has performed great, and others where things like MongoDB performed better. When MongoDB did better, it was often because the data was effectively denormalized: one MongoDB document equaled 10 normalized rows in mysql. Mysql has been tuned a lot of the years, but storing the data in a different matter can get one to some good speeds.

All that said, to me the real motivation for NoSQL usually isn't higher single server performance, but (1) horizontal scalability; (2) easier elasticity, replication, failover; and (3) easier coding (easier mapping from object data model and schemaless capabilities).
dwight_10gen @ 3/11/2010 5:55:13 PM
Thank you for the insight and clarity added to a very muddy discussion. I am co-founder of a company looking to launch a product in April/May time-frame. MySQL has been the workhorse of the product, simply because of cost. The product and company is bootstrapped, and $50k for an RDBMS is outside the remotest realm of possibility.

With that said, my two biggest worries are scalability and simplicity, because our coding department, administration department, graphics department, and tech support department all have one member: me. Toward this end, "we've" been looking at NoSQL favorites such as CouchDB, MongoDB, or Cassandra to lend a hand with non-numeric data. At this point, it's just an idea, though... seeing as I'll need to close myself to implement it.

On another note, James put forward a very strong statement: "Developers with web-only eyes are a blight." James, you might have years of experience w/ rich RDBMSes but that doesn't mean everyone else has. And there are tons of people developing with web-only eyes, enjoying it, and making a living from it. If you'd like to make such a strong statement, at least post a link back to a site where you can help educate the rest of us to your enlightened world view. I for one would welcome more information on the topic.
Daniel @ 3/12/2010 8:38:20 AM
A couple points to add.

"So many developers are so blissfully ignorant of how databases work..." in the case of MySQL, you can't lay all the blame for that on the developers. MySQL has virtually no usable instrumentation. Figuring out what it's really doing, which is necessary for figuring out how to improve things, is way too much work and actually requires reading source code sometimes.

In response to Mark, "there is probably not going to be the funds to either hire a full time DBA(s) or send a developer out to get trained to ensure that they have someone with the domain knowledge of optimizing a database" I find this to be a flawed argument. What is the alternative? Train someone on a new product that is much less popular and has much less mindshare? How big is the candidate pool for CouchDB experts? And what if you can't do everything with CouchDB, but you need a little MongoDB here and a little Cassandra there, too? How many esoteric skills are you going to be lucky enough to hire for? I think that a big reason people overload MySQL with things it's not good at is that they can actually find resources to support it.
Xaprb @ 3/13/2010 2:18:29 PM
Sure would love to see Part III...
jeo @ 3/14/2010 6:25:51 PM
Dennis,

Great post. I find this through a friend who shared it via Google Reader.

This debate has been happening on my site as well, most recently I gave space to a friend who is on the NoSQL side of the house in <a href="http://www.oraclenerd.com/2010/03/case-for-bit-bucket.html">The Case for the Bit Bucket</a>.

I've theorized for a couple of years now that the move towards NoSQL or ORM tools is precisely because application developers don't know how to properly utilize a database. I like some of the technology, and it has it's place, but at a cost, which most people don't quite understand.

chet
chet @ 3/21/2010 9:12:24 PM
Dennis,
Great article.

Chet,
Your theory is exactly what I've seen as well. Today's web developers seem to have an aversion to all things SQL. For some reason they absolutely don't want to learn the language beyond the simple CRUD statements and certainly have no clue about performance tuning.

I've now witnessed 3 different projects at different companies where they jumped on an ORM bandwagon (LINQ, nHibernate, etc) only to deliver a product that was barely usable by 100 simultaneous users. Then promptly swear it was all problems with the database server without bothering to check what was going on. The "obvious" next step to the visionaries on these projects was to add more servers..

By comparison, I've also witnessed a few well run (ie: no orm and dev's had great database skills) projects which handle thousands of simultaneous users without breaking a sweat and without having to "scale out".

It boils down to having an Architect on the team who actually knows what they are doing versus a bunch of people that jump on the latest language garbage without grokking what they are doing.
Chris Lively @ 3/23/2010 7:07:22 AM
Look.

If I have to hire a $100K (really, post taxes, $200K) engineer to manage my $50K DB, or a $50K (thus $100K) engineer to manage my $0K DB, guess what?

I'm going with the $100K expense over the $250K expense, no question.
Dan Kaminsky @ 3/26/2010 1:37:44 AM
As a sidebar issue, why do people choose MySQL over, say, PostgreSQL? Daniel, why did your company start with MySQL? The tenor of my question is nothing like, "Why on Earth would anyone use MySQL?", but rather, "Very smart people from time-to-time choose MySQL over PostgreSQL. There must be good reasons. I wonder what they are."

Best,
David
Dave Ventimiglia @ 4/6/2010 3:23:19 PM

Add Comment

Name *:

Email Address:

(your email address is not displayed)
Website:

Comment *:



About the Author
Dennis Forbes Dennis Forbes is a Toronto-based software architect. While focused primarily on the .NET and SQL Server worlds, Dennis frequently ventures outside of this comfort zone into game development and image processing. He has been published in several industry magazines, has been quoted in the Wall Street Journal and has been interviewed by NPR.

He is a vice president and lead software architect at an innovative New York City hedge fund back-office services firm.

Dennis has been working on solutions for the financial, telecommunications, and power generation markets for over 15 years.





 

Dennis Forbes