Max Schireson, president of 10gen (the makers of MongoDB), has made the case for document-based data systems – such as MongoDB and CouchDB – by arguing against the heavily-normalized relational model.
Max offers up his entry as a challenge to the “relational-is-always-best set”, asking them to prove that the complexity of storing data in a relational form is worth the trouble, at least for the scenario he describes.
Given that I’ve been anointed as an anti-NoSQL crusader on a number of occasions, I feel obligated to argue on behalf of the relational model, which I will do in a later entry.
Despite being a big fan of MongoDB. As I have done many times in the past I encourage everyone to download and play around with the excellent MongoDB product. Do yourself a favour by running through the tutorial.
All things have a place.
I once sat in a meeting where a peer described the purportedly intractable complexity of a task they were failing at. They did this by drawing the various actors on the whiteboard and then detailing their many complex relationships.
Image the best path-finding algorithm. Now imagine the opposite: The least efficient, most unnecessarily sloppy routing imaginable.
That was how complexity was deceptively exaggerated, with absurdly circuitous relationship lines weaving to and fro. It was comical.
That memory came to mind, and how the deception goes both ways, while reading Max’s entry, and again when reading the linked entry by MongoDBer Kyle Banker.
When comparing the document model with the relational model, many if not all examples seem to contrast a complex relational model – one that encapsulates an end-to-end platform for a whole domain – against a trivial island of a tiny subset of data in a document structure. The former usually built to support entire operations and systems, while the latter tends to be crafted for one single purpose (like "allow customer services to look at an order", as was used in Max's scenario).
Max highlights relational complexity by pointing to an Oracle end-to-end order reference platform containing “126 tables”. Kyle does the same thing when comparing a simple could-be-one-single-row document (which humorous includes four relationships, which to resolve would require four expensive round-trips to the MongoDB server given the platform’s bizarre lack of server joins) against a complex catalogue schema. Both explain their arguably deceptive comparisons with statements like “Of course, this is not a complete representation of a product”…
I would argue that in such a case such a comparison shouldn’t be made at all. Why contrast an incomplete example of a document-based implementation – simplistic in its useless innocence – against a fully scoped relational platform?
It is the “MySpace angle” used to hide the ugly reality of technology. If you have a MongoDB simile of the compared product, have at it, but simply hiding the ugly details and zooming in on a non-functional, cherry-picked subset just misleads potential suitors.
Realtors use this trick when taking photos of homes, showing just enough of the grass while avoiding nearby structures. Your mind naturally extrapolates; imagining expanses of lush green fields, when in reality there’s usually another house imposing itself four feet over.
I have a full workload right now, but in the near future, during a mental lull, I will respond to Max. There is a very compelling counterargument to be made.
One of the most referenced papers in software development has to be Dijkstra's seminal paper titled "Goto Statement Considered Harmful".
Dijkstra didn't actually author the title, but instead it was the creation of an editor en route to being printed in an ACM publication. It was changed from its original title of "A case against the goto statement".
While the core essence of the essay is indeed that the goto statement can be harmful, Djikstra wasn't making an absolute statement (as is commonly claimed, and which is an absolutism tendency of far too many in this industry), but instead was commenting on the abuse of goto that was occurring in the industry, calling for a sober evaluation of where it is appropriate, but more importantly where it is not.
Nonetheless, the meme was created and has been reused and abused in innumerable Considered Harmful declarations since.
A month or so back the development webosphere was awash with references to Scott Hanselman's excellent blog, all excitedly linking (rel="titillating"?) to his piece titled "The Weekly Source Code 13 - Fibonacci Edition". This was particularly common in the .NET community, with many linkers describing it as an elucidating example of the many advantages of .NET 3.5 / C# 3.0.
I perused the entry, always eager to absorb that sort of information, but found it less than perfect. I withheld critical comment, hoping it would all just blow away.
Then this morning I opened up Visual Studio and happened to notice a link to his entry on the Start page.

Maybe it's been there for a while (the last date is pretty old) and I just didn't notice it before, but the title used on the Start page pushed me over the edge, coercing me to comment.
There are several issues I have with Scott's Fibonacci entry.
First, the C# 2.0 (henceforth I'm dumping the subversion precision on the language versions) version is oddly dumbed down: C# 2 also has ternary comparisons, and it even has anonymous functions (including closure functionality). Yet the demonstrations given contrast the simplest possible C# 2 implementation with the most obtuse C# 3 example.
Basically the only novel difference with the C# 3 example is that it uses a lambda, though of course it would be an absolutely terrible thing to use a lambda for.
It's not a very good example of the implementation differences between the versions, which is the claim made by the Visual Studio start page, and was the description often used during the dissemination of this piece.
I like C# 3, but this isn't a good demonstration of any advantage of the language.
Worse yet, the only place you'll ever see recursion used to calculate Fibonacci numbers is in "Recursion for Dummies" type examples. To understand why that is, consider Scott's C# 3 example, which he leads into with the statement "Now, here's a great way using C# 3.0".
Here's a logarithmic-scaled chart of the number of function calls necessary to calculate Fibonacci numbers in the C# 3 example Scott gave.

Obviously it gets unusable pretty quickly. Try calculating the 90th Fibonacci number using recursive algorithms...
In the same way that Goto can be harmful, the use of recursion is often a sign of badness, and this is no exception. Epic inefficiency is used instead of the obviously simple approach.
long CalcFibonacciNumber(long n)
{ long current = 1, previous = 0, swapholder; while (n-- > 1) { swapholder = previous; previous = current; current += swapholder; } return current; }
(Ignoring mathematical shortcuts)
A lot of readers will be rolling their eyes right about now, muttering something along the lines of "Awww, come on...you didn't seriously think anyone thought that recursion was a good way to calculate Fibonacci numbers, did you? This is beginner's stuff, and no one really thinks that's the right way to do it!"
I'm optimistic about the profession, so no, I didn't really think it was a serious example (though I do think it nonetheless deserves some serious warnings to ensure no one becomes misled).
WARNING: The Code Contained In This Example Will Rot Your Brain. Never Do Something Like This In Real Life. Don't Let Peers See You Looking At Code Like This. Suspend All Critical Thought While Reading This Piece.
Instead it's a sample of "here's a demonstration of how to do something absolutely terrible — almost felony worthy — in a variety of programming languages....".
This is still a serious problem.
The example given is so very wrong — even if it is what's used in Recursion for Dummies books — that it makes it close to impossible to focus on the actual point being made, even if it had used comparable features of each language to demonstrate how the same task could be accomplished in each.
It reminds me of many early web service tutorials and advocacy pieces: Many used absurd examples like "a web service to add two numbers" (and amazing variations such as subtract two numbers, multiply two numbers, divide two numbers, compute the Log10 of a number, and so on. You get my point — things for which a web service would be entirely unsuited).
Stop it!
Stop with the ridiculous no-one-would-(or rather should)-ever-do-it-this-way examples. It completely undermines the value of the examples.
Surely there are realistic examples that would be more appropriate for demonstrating the advantages of lambdas (recursion {is recursion}; [goto {is recursion}], so there isn't much enlightenment provided there). How about "how to build a rudimentary regular expression parser in a variety of languages", or for a web service "pulling weather data from a remote weather station".
Something that a developer isn't going to have to slog through with their brain fighting them on every line, demanding an explanation for the terrible design or algorithm they're supposed to accept at face value.
I recently opted to throw together my own blog software (after going through the standard Build or Buy analysis), expediting deployment as a means of forcing follow-thru. The goals of this micro-project were to improve the authoring and content management experience, to improve searchability of the content (without having to cast content out from the blog to a static form), and to improve the usability and navigation from the user's perspective (for instance the classic "date" navigator common on most blogs is something that I've opted to remove).
Despite having close to no time to allocate to this task, my tendency to over-engineer still showed through: The easiest option would have been a content-management system defined entirely in code (it's as easy for me to change and deploy code than it is to change templates and metadata), and of course to build it for a single author. Instead it supports many blogs through the same URLRedirector, blog aggregations (where a blog is a publication of a set of blogs, each with distinct authors) each using its own templates and configurations.
Which brings me to templates -- failing to find a decent Smarty-type templating system for .NET (basic ASPX is really a templating system, but I'm speaking more towards something that can enumerate sections, retrieving data based upon an object structure of relationships and containment).
So I had to build a basic templating system, yielding the templates that follow. The first for HTML output--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<title>{#blog.Title} {#docTitle}</title>
<link rel="stylesheet" type="text/css" media="screen, projection"
href="http://www.yafla.com/dforbes/style/css/blog.css"></link>
<script type="text/javascript" src="http://www.haloscan.com/load/dforbes"> </script>
</head>
<body>
<div class="clsHeader">
<div class="clsBlogHeader"><a href="{#blog.BaseUrl}">{#blog.Title}</a></div>
<div class="clsSubheader">{#blog.Description}</div>
</div>
<div class="clsBody">
{foreach $entry in $entries}
<div class="clsEntry">
<div class="clsDate">{#entry.EntryContent.PublishDateUTC|dddd, MMMM dd yyyy}</div>
<div class="clsTitle"><a href="{#entry.Permalink}">{#entry.EntryContent.EntryTitle}</a></div>
<div class="clsBody">{#entry.EntryContent.EntryContent}</div>
<div class="clsKeywords">{foreach $keyword in $entry.EntryKeywords}
<a href="{#blog.BaseUrl}{#keyword.KeywordText|escape}">{#keyword.KeywordText}</a> {/foreach}
</div>
<div class="clsPermalink">
<a href="javascript:HaloScan('{#entry.MappingId}');" target="_self">
<script type="text/javascript">postCount('{#entry.MappingId}'); </script></a>
<a href="{#entry.Permalink}">permalink</a>
</div>
{foreach $relatedentry in $entry.RelatedEntries}
{ifcond $LoopFirst = "True"}
<center>
<div class="clsRelatedEntries">
Related Entries
{/ifcond}
<div class="clsRelatedEntry">
<a href="{#relatedentry.Permalink}">{#relatedentry.EntryContent.EntryTitle}</a>
</div>
{ifcond $LoopLast = "True"}
</div>
</center>
{/ifcond}
{/foreach}
</div>
{/foreach}
<div class="clsAdBlock">
{#adBlockHorizontal}
</div>
<div class="clsNavigator">
<span class="clsNavigateEarlier">{#moveEarlier}</span><span class="clsNavigateLater">{#moveLater}
</span>
</div>
</div>
<br/>
<div class="clsAttribution">
<a href="mailto:{#entry.EntryContent.ContentAuthor.EmailAddress}">
{#entry.EntryContent.ContentAuthor.Name}
</a> -
{#entry.EntryContent.ContentAuthor.Description}
</div>
</body>
</html>
The next template is for RSS consumers--
<rss version="2.0">
<channel>
<title>{#blog.Title|escape}</title>
<link>{#blog.BaseUrl}</link>
<description>{#blog.Description|escape}</description>
<lastBuildDate>{#buildDate|r}</lastBuildDate>
<language>en-us</language>
{foreach $entry in $entries}
<item>
<title>{#entry.EntryContent.EntryTitle|escape}</title>
<link>{#entry.Permalink}</link>
<guid>{#entry.Permalink}</guid>
<pubDate>{#entry.EntryContent.PublishDateUTC|r}</pubDate>
<description><![CDATA[{#entry.EntryContent.EntryContent}]]></description>
</item>
{/foreach}
</channel>
</rss>
All in all, I think it works pretty good, and I can successfully run the W3C validations on the vast majority of generated pages and get the comforting green checkmark.
My original foray into the land of blogging was delayed while I stumbled towards the goal of building my own blogging software: like many software developers, I have a sometimes irrational desire to build it myself rather than admit “defeat” and use one of the many (and in the realm of blogging, there are many) available products.
I took a couple of stabs at building it myself originally, but due to another common foible – a tendency to over-engineer (I couldn’t simply write some blog software to post and publish my own thoughts. No…it had to be a full multi-author aggregation and collaboration suite, meaning that weeks went by while I mentally debated the database model for such a machination) – it just never seemed to get finished.
Other priorities always trumped it, and the little time I did allot towards this goal saw me solving absurd edge conditions.
I finally set a deadline for myself, and when I couldn’t find the time to finish anything before my marker (billable hours always came first), I went and bought a copy of Radio Userland and started publishing content the blog way.
That worked well enough for a while, but Radio Userland is a venerable publishing tool that is really showing its age. Authoring to it is a less than pleasant experience – which has been a huge contributor towards the dearth of content (it’s always a bit of a roll of the dice to see which characters it randomly replaces in posts, or which carefully authored HTML blocks it’s decided to mangle) – and simple tasks like cross-linking posts (e.g. a “related posts” sidebar to allow users to easily see follow-ups) was just far too manual to be worth the bother.
Now that I have a powerful, fully dedicated server, it’s also grossly under-featured for users, making the experience of consuming and navigating through the information far less usable than it should be.
So I’ve gone and built my own blogging software, this time quickly bringing it to a sort of beta release.
Given that this is the venue with which I will publicize a ton of changes elsewhere on the site, I really considered this a roadblock on the critical path to the release of other web application functionality elsewhere on yafla.
With some focus, it took only a couple of hours this time, mostly accomplished while putting my toddler son to bed over the weekend. It was so ridiculously quick and easy that I kick myself for not having done it sooner.
I’m extremely pleased about the functionality built out (hey it isn't rocket science, and definitely falls within the realm of "trivial", but there's lots of little "gotchas" with software like this), though most of the kudos go towards .NET 2 and SQL Server 2005: A couple of tools that make short work of what would once have been an enormous task, bringing a robust, secure, high performance web application to a usable stage in less time than it takes to watch the Lord of the Rings trilogy.
Right now you’ll probably notice that – at this moment at least – the HTML version of the blog looks absolutely terribly. That is somewhat by design (or rather an intentional time compromise)…momentarily. I’m working on the template (it’s of course parameterized template driven), and wanted to force myself to follow through by deploying (perhaps prematurely).
So what are the features of the blog software?
Well, firstly I migrated 100% of the old content over (including metadata such as categorization), running it all through Tidy first to try to make it a little more XHTML legitimate. Using an identifier mapping structure, every single link to the legacy content still works (which was important to me: I didn’t want to give link followers the frustrating “We moved everything so have fun trying to find it” 404 experience).
Everything works via URL remapping, and for now I’ve set it to redirect from old links to the new links where possible. E.g. http://www.yafla.com/dforbes/categories/softwareDevelopment/2005/09/28.html redirects to http://www.yafla.com/dforbes/Clean_Code. All new entries Will follow that more transparent and obvious structure.
But the URLs aren’t limited to just single documents – All entries in June of 2006 can be accessed via http://www.yafla.com/dforbes/2006/06. Add in a category and you can refine further – http://www.yafla.com/dforbes/2006/06/.NET (or http://www.yafla.com/dforbes/.NET/2006/06. Whatever makes you happy).
Want that in RSS form? http://www.yafla.com/dforbes/2006/06/.NET/rss.xml. Add in the day if you wanted to refine further.
Of course, no longer are entries limited to the archaic “categories”. Now they’re basically keywords, so if you want to see the posts where I’ve abused categories and multi-tagged, take a look at
http://www.yafla.com/dforbes/.NET/SQL/Blogging/SoftwareDevelopment/Personal/IT/
Yikes!
So the tagging will be much more logical now that there aren’t broad categories, and given that anyone can filter content however they want (stick rss.xml on the end and you can get a feed of whatever you want).
There’s also search, though I’m not comfortable enough with the finality of the API to publish anything about that.
Entries now have versioning, given that I want to be more transparent with edits that I make (I’m endlessly doing minor corrections and improving wording, and for those who consider that deceptive there’ll be a little version history to see what changed and when, along with a label of why the change was made). All links are auto-parsed and logged, so every entry has a list of posts that link into it, making for much more elegant self follow-ups without resorting to post-editing some “UPDATE: See also…“ notes into old entries, and without resorting to the ugliness of trackbacks.
Extensive caching ensures that it’s still spritely and capable of handling peak loads with no fuss.
Oh, and the system supports many blogs by many authors, including publishing multiple authors into one system…so I still over-engineered, but in the end it was workable and I’m extremely happy with the core structure.
Great things lie ahead.
One of my PCs is a bit of a Frankenstein, having gone through countless small upgrades over the years.
A video card here. Some memory modules there. A replacement primary harddrive here (thank you g4u). A supplementary hard drive there. Half a dozen different CD and then DVD and then Dual-Layer DVD burners.
Every now and then it'd see a larger upgrade that mandated a motherboard replacement alongside a new CPU. Often that would require new memory modules as well. Maybe even a new power supply as connection standards changed.
Motherboard replacements have always been the most disruptive, and it's been interesting to watch as each has negated the need for some add-in or other. First the USB+firewire board got punted, having been replaced by onboard functionality. Then the network card. Then the Soundblaster card. The only true add-in card usually needed nowadays is the video card, and I'm sure it's only a matter of time before the on-board video reaches a credible level of performance, eliminating even that.
I've pursued this piecemeal approach to upgrading primarily because it minimized the software disruption in my life, usually requiring just a quick module swap, some driver updates, and it's up and running again. I actually enjoy the modular, hybrid-PC pursuit, individually scoping out and replacing components with the best bang-per-dollar option available at the time. It's a bit of a hobby.
[Clearly I'm not alone: A local "Tiger Direct" store opened recently in my town, featuring a huge floorspace stocked with esoteric power supplies, mod cases, and other components for DIY builders. I'm surprized that the demand is still there, having thought that the self-builder was an endangered species]
I've been negligent, however. Over the past while this PC had seen little attention. Running on an extremely dated Athlon XP 1800+ (overclocked to equal a 2200+), with a "measly" 1GB of DDR1 RAM and a dated collection of complimentary components, it had fallen so far behind the times that it has dropped far off the current CPU charts. While it served its casual gaming task well (the video card is quite contemporary, and given that few games are constrained by the CPU, it held its own), and admirably provided the network storage for photos and videos, its anemic standings were a bit embarrassing. Sure, it didn't need to be decent given the various home and business laptops -- powerful, modern units that saw most of my computing activity -- but I felt like I was letting it down.
So following up the entry from a couple of weeks ago, I finally got around to ordering a new CPU and motherboard on Tuesday, ordering a retail boxed Intel Core 2 Quad Q6600 2.4Ghz processor from Direct Canada for the extraordinarily low price of $279.99 CAD. I'd been directed to their site from a search-engine yielded link to "Shopbot.ca", so I was a bit wary placing my order with this unfamiliar provider, but at 1pm the next day the box arrived at my door, amazingly delivered less than 24 hours after I ordered, coming from a shop 3000km away. I'm very satisfied with the price and speed. (I received no considerations for that comment, and know nothing about the shop beyond the fact that they sold me a killer piece of hardware at a great price, delivering it very quickly. Your mileage may vary.)
In the end I discovered that some new memory modules would be in order to fully yield the speed (going with 2GB to correlate with the oft claimed speed advantage that often flies in complete contradiction to actual memory usage metering). Oh, and a new case as it might make the whole process a little easier.
In the end, the only legacy pieces that made the migration to the "upgraded" box are the hard drives, and the video card.
Minutes later the full-retail copy of Windows was running the right drivers, and after a quick re-activation it was storming along.
I booted up.
In a word (and a punctuation) - Wow!
What a tremendous amount of computational power on the cheap. Day to day activity really feels no different than it did before -- browsing is the same fast browsing that it was before, and given that I don't try to use Excel as a warehousing database, Office seems the same as well. Battlefield 2 plays the same given that I have the same video card, albeit now with absolutely zero stutters or hiccups as other threads demanding timeslices are generally satisfied by one of the other cores.
For the things that actually keep me waiting -- encoding a home video from the MiniDV, or building firefox from CVS, as I do regularly -- the improvement is enormous. Not only are these operations massively sped up by the four cores available to them, better still I can configure them to only use one, two, or three threads of parallel executions (via the -j build option for Firefox, for instance), constraining them as a coarse fix for the deficiencies of the Windows scheduler. I can now run a full Firefox 3 build in just 12 minutes with full parallelism, or run it (or other demanding applications) with little or no impact in the usability and functionality of this PC for other tasks.
The build continued to speed up with more possible parallel operations, albeit with a decreased rate of return, with the fastest test build occuring in just over 12 minutes with the highest option tested: -j12. Having more parallel operations than cores can yield benefits when it increases the time utilization of a saturated resource, which in this case was the hard drive. At this point the cores were left twiddling their thumbs waiting for the storage to catch up.
Limiting the build process to two cores via the process CPU affinity had it CPU starved beyond -j2, yielding no benefit via more parallelism.
You can find a stacked graph detailing core processor usage for the above -j4 run (on 4 cores) at http://www.yafla.com/dforbes/images/Firefox_build_j4_4core.png. You can also look at a chart of building Firefox using the -j4 option, but setting the processor affinity to only allow the build access to two cores.
Not only is the build performance fantastic, but better still I can throttle it back to only run at most two parallel operations (-j2), getting a build in a still impressive 17 minutes while leaving two cores completely available for other tasks, like browsing the web with full responsiveness. I can even launch Battlefield 2, and remarkably it plays flawlessly...despite the fact that a full-scale, parallel build is going on in the background.
(Sidenote: Threads can still be left stalled, stranded waiting for a shared resource like the limited memory bandwidth and I/O paths, for instance. In the sample above my build was on a second harddrive -- a configuration that I recommend for all power users -- and clearly the other shared resources didn't impact the game to a perceivable degree)
What a revolution in computer usage. What a discount-priced computational powerhouse.
I'm going through the process of upgrading some Infragistics NetAdvantage 2007 v1 components to 2007 v2, one step in the upgrade process being the uninstallation of v1. The uninstaller has now been running for some 65 minutes, saturating both the hard drive and the CPU during the entirety of that time.
What possible explanation is there for this? Remove some registrations, delete some files and directories. Done. Where's the big complexity?
"But it's doing complex things!" a friend of MSIEXEC might retort (this is hardly the first time I've encountered outrageous installer times). Like what? Calculating the next Mersenne Prime?
In the time that it has run it could read and written my entire hard-drive several times over, and from a computational perspective it has now processed trillions of CPU operations. Trillions.
Given the basic metrics, there is simply no rational explanation beyond absolutely mind-boggling inefficiency. Par for the course, unfortunately.
yafla has moved to some new, dedicated hardware, opening up some tremendous possibilities.
Some very exciting changes are afoot!