Back in May I wrote about the poor performance of Microsoft's SQL Server product when doing trivial queries like simple key-value lookups. In such cases the massive overhead of the query engine led to return rates several orders of magnitude lower than many specialized KV systems, such as Memcached or Redis.
In most operational scenarios this isn't a big problem, as most queries are complex enough that the overhead diminishes to the point of irrelevance, and the "best route" of the query plan overcomes that inefficiency.
At the time I was testing with SQL Server 2008 and 2005. I finally got around to redoing those tests with SQL Server 2008 R2.
What a difference an R iteration makes.
Instead of 5,000 key lookups per second per caller, with 2008 R2 I'm seeing more in the range of 200,000+ simple lookups per second per caller. I've replicated this result on several installs.
Remarkably this has gone with little to no commentary in the community.
For larger queries the performance doesn't show such a deviation from the 2008 baseline, but for the simple purposes it really is incredible.
Joe Stump – the former Digg lead architect with the coolest name in tech – posted a peripheral response to my recent entry about SSDs and NoSQL.
The original post was motivated by claims found on Digg’s technology blog.
And on Joe’s post.
Joe has been in the Web 2.01 trenches. He built a solution that powers one of the top sites on the net.
Remember when getting "Slashdotted" was a big deal? Getting on the front-page of Digg makes a Slashdotting-at-its-peak look like a little traffic bump. There are probably a hundred PR reps busy trying to botnet their clients onto the front-page of Digg for every one punished into spamming Slashdot these days.
Far more people know Joe’s all-out-of-bubblegum name than will ever know mine, and rightly so.
Joe comes out of the gate resorting to the venerable old-versus-new tactic: "It's just those old-school DBAs upset that us kids are rewriting the rules," he says in not so many words, while nailing himself and his peers onto a cross, seeking pity for the flames they doth receive for their unconventional, rebellious ways.
This is a bit strange, really. Barely a day goes by lately without Hacker's News or Reddit’s /r/programming featuring another front-pager about how the Incredible NoSQL is rewriting the rules of, well, everything. The general demeanour is one that, I think, is far more sympathetic to completely unsupported and undemonstrated pro-NoSQL claims than it is to anything that questions the hype.
Countless NoSQL blogs have appeared (though if you browse them looking for actual content you’ll instead find that most feature few facts but lots of zealous punditry. Advocacy seems to be the primary focus right now). Anyone involved with any sort of NoSQL initiative is spinning off their own start-up to capitalize on this sure-win formula, acting like it’s some sort of magic ingredient that will assure them of success.
It is very reminiscent of the XML heyday – I’m a very big fan of XML in its place, as an aside – when countless start-ups appeared with business models that could be boiled down to “something to do with XML”.
The big database vendors have remained quiet, largely because the miniscule-budget operations all clamouring for their piece of the NoSQL pie aren’t worth bothering with.
“But what about Google, Amazon, and Twitter!” you say. Joe resorted to that same appeal to authority by incanting the same magical trio (say it three times quickly and your TPS rate will quadruple!). Not really much to bother with there, beyond pointing out what a cargo cult is. Your bamboo headset won't make you successful like Google. It really won’t.
Unless you are targeting the same problem space as those companies – say like providing very low performance but highly “scalable” database solutions for countless low-value start-ups – their solution choices are utterly irrelevant.
I'm not a DBA (though knowing how indexes work now strangely qualifies one for such a title). I'm just a technically curious solutions guy that has an innate need to keep asking questions and probing deeper until the Want-To-Believe fog that often hides hype dissipates.
In Joe’s entry he focuses a lot of attention on the costs of RDBMS solutions.
One such argument is that it’s better to use computing hardware as a service than to buy, seemingly implying that while you can buy good hardware to run a RDBMS, it is better to rent less-good virtual hardware to run your NoSQL instances.
Yet leasing is what all the cool kids are doing these days, largely for the same financial reason. Writing it all off beats dealing with depreciation BS, and it makes financial planning a lot easier.
On the leasing front, $600 a month gets you an insanely powerful, makes-an-Extra-Memory-Quadruple-Extra-Large-EC2-Instance-Look-Like-A-Pile-Of-Puke server.
You’ll probably be paying 20x that for every developer you have working on your solutions. Is this really so astronomically high?
That less-than-the-cost-of-the-office-cleaners price tag gets you a server that with a bank of striped SSDs that will almost certainly demolish your impressive-in-count-but-not-in-throughput big scale out cluster, at least with a non-broken RDBMS system.
No really, it will. Of course for any sort of reliable system you’d have to pay for some DB licenses (presuming you aren’t going with PostgreSQL), and then you’ll want to double everything up into mirrors or some other reliable setup, so triple the price.
And really, is the $7,500 spent by 37signals on a disk array really even worth mentioning? I suspect that sort of number ends up almost as a rounding error on their expense sheets, and given that it's pivotal to their operation – it sits under the very foundations of their business – I doubt they spent many sleepless nights over it.
What sort of rinky-dink operations are we talking about here? Does Digg still qualify as a start-up? Don't they have a payroll and all of that, yet they're clamouring to wire up a collection of discount bin servers?
I posted the SSD entry because SSDs really do fascinate me, and I do think they change a lot of the rules of the game. It just happened to dovetail nicely with my investigation of the Digg scenario, where Digg solved their very real I/O issue by essentially pre-caching every possible query result for a targeted need.
Through extreme denormalization they traded storage to reduce I/O needs.
This is a very important point, because it’s far more pivotal to Digg’s solution that the NoSQL versus RDBMS debate.
Call up your old Digg coworkers, Joe, and have them setup a real database server with a couple of SSD drives and see how it compares with their impressive cluster. I’ll bet Dell would happily lend them a real server.
All of this is a bit humorous, really: The whole point of my original entry on this NoSQL topic was simply to say "what is good for Digg isn't necessarily appropriate for all database needs”, so it’s a bit unfortunate that it has come to this, with Digg’s former architect justifying their decision when they were held as a scenario where it is likely the perfect solution.
Then, after seeing the Digg case-study, I felt obliged to respond to their RDBMS claims because I saw them as flawed, indicative that the movement should really be called NoMySQL instead of NoSQL. It still doesn’t diminish the correctness of their choice.
But really, while I originally entered into this debate believing simply that NoSQL is being oversold (it is grossly inappropriate for the vast majority of non web 2.0 projects), the more I investigate the more I’m coming to think that it is a solution for the rapidly disappearing problem of pathetic I/O rates, at least assuming that you aren’t running on several of the cloud solutions where that is your only choice.
There are many other differences that come with NoSQL (many strongly questionable, like the oft lauded “no schema” claim for some of the solutions), but the I/O restriction is by far what sold it on the high end, and the high end is what convinced the little guy that it’s the way to go.
I very strongly agree with Joe about one thing: the licensing costs of the big RDBMS products are way too high.
They know that 2% of their potential customer base have giant budgets, and that they can squeeze more from that 2% than they could ever get from the other 98% who then get relegated to fighting over scraps like MySQL.
Not really sure how to solve that problem, but I concede that it is a non-trivial issue. PostgreSQL is probably the best low-to-no-cost database server, but even then quite a few performance features are missing (like real-time materialized views or SQL Server style clustered indexes).