Before you read on and jump quickly to scream foul, please make sure you realize that I write this with some humor, to try and get at the point that “In certain situations, de-normalization is good!” I am not suggesting that we throw away normalization! Are you kidding me? Might as well go play Russian Roulette! However, I think this is an interesting trend among large Web 2.0 companies that are heavy on read, and low on writing. With that said as my preface, read on and joy.
I’m just going through the process of designing a database architecture for a Web 2.0 site we will be launching in a few months, and the first thing the database architect talked to me about at a review was “Don’t worry, this design your about to see is going to rock, we have really normalized the data, its hot!”. I was happy he said this, but at the same time, it got me thinking about a lot of things. Synapses started flipping through my brain as they do at night when I can’t sleep (probably due to the Starbucks I was sipping on).
See, I have never seen eye to eye with database designers on their “Normalization” Zen. When I first started getting into databases and teaching database design and architecture, I always thought to myself “Why all this normalization?”. I know, I’m treading into dangerous territory here, but I feel it has to be said. What is dangerous about this topic? I guess the fact that I think we should NOT normalize data for the most part in a Web 2.0 world. Inexperienced DBA’s I guess will find this completely incorrect and question my brain capacity, but I would bet experienced DBA’s after thinking about what I’m saying, would agree with me. I’ll tell you why…
The Argument Pro Normalization
I won’t go into too much detail, because all of you know that a proper database is normalized. My guess is at least 80% of web sites out there are running on normalized relational databases. In the current world of relational databases, we have all been taught that you must normalize your data to ensure data integrity by minimizing duplicate data. By doing this, you are ensuring that no application can write data in one location, and not updating another location, and creating two different records, for the same data! This is supposed to also simplify software development, by putting all the data integrity logic at the database layer. You can user stored procedures, views, joins, and have fairly applications that rely on the database to keep the data integrity.
The Argument Against Normalization
How many freaking joins do you have to do to get all the data you want? When I first started my career, I could write ASP pages amazingly fast, with simple databases in the back end. As I started learning more and reading on “Normalizing” your data, I started feeling crappy about some of my designs. I didn’t use any joins, didn’t really have any transactions, and had very few tables. But, I wrote applications very fast, got results, and never really had any speed issues. I never really had issues bad data either, but whatever.
I Learned to Normalize
I started writing more “enterprise” applications, and had to do more joins, left, outer joins, had to worry about transactions, and had layers upon layers of data access to go through in order to add a new customer. I was told that “It’s the databases job to maintain data integrity”. Ok it all made sense, I did this. My code became way more complex, my database designs became way more complex, and you couldn’t put bad data into the database. Hot! But, it took FOREVER to write the applications to put the data into the database. Not only that, but now, if I needed to scale the application, I couldn’t simply add more web servers! I now had to worry about database replication, transactions, and indexes. I soon realized that doing joins took an INSANE amount of processing power.
Normalization Is Slow To Design
Are you still with me? Splitting up your data into small chunks and insuring that no data was stored twice, made writing even the simplest web forms super complicated. Sure I could call a stored proc, but if it wasn’t written, I had to get the DBA to write it, or write it myself. If I had to get new data in a certain way, I had to create views, which I was told was fast! Not (Ok, getting better… but still). For one simple form, or simple display page, you could have an INSANE amount of joins!
Normalized Applications Are Slow On A Huge Scale
Why couldn’t I just have a table, with all the data I needed for that page? Then for other pages, I could have the same data duplicated! This would actually take stress away from the JOIN, and just let me get data from either location. When an update was done, say the user updates his address, it would be updated in two locations. Is that all that bad? I don’t think it is. Just add some constraints, and make sure the application layer follows the constraints.
If Web 2.0 is all about sorting data, indexing our knowledge, sharing our knowledge, speed in development, speed in viewing Web 2.0 applications, and scaling your sites quickly should you’re site become the next YouTube, I challenge you to do this successfully with a normalized database full of joins, stored procs, views, and transactions. You might do it, but you will need an army of rocket scientists to do it.
Do whatever works for your environment. In an enterprise environment where you are working with financials and maybe have 1,000 people entering data and reading data, by all means, normalize your data. If you are writing an application in the web world, where one day you might have 100 users, and a month later, 10,000,000 users, you better have a quick way of scaling! If you have kept your database de-normalized, and just have tables, and your database is basically a storage area, and has no business logic whatsoever, you will be able to quickly scale your web application by adding servers, replicating code across your web farm, and even putting in cool appliances like BIG-IP in place to cache data and load balance your web farm.
Good luck doing this in a normalized database. You will be able to, you will just be crying the whole way.
So, What Should We Do?
If you are doing more writing than reading, stick to normalized databases. If you are doing more reading than writing, still normalize… but then denormalize your data! Look at the web forms and pages you will have, and see what data will always be displayed, and if you have 18 tables needed to display a users profile, denormalize the data! Imagine what would happen if you have 10,000,000 users reading profiles and running joins on 18 tables.
One of the Reasons for Web 1.0 Failure
I would argue that a big reason Web 1.0 companies failed was because you had an army of people getting money from VC’s, getting expensive DBA’s, armies worth, to write complex normalized databases, then getting web guys to create the application layers for these complex databases. Why is this the reason for failure? You had slow performing sites that didn’t scale. You had tons of bugs, bad performance, and sites and ideas that never even launched because they were “in development” forever! In Web 2.0 you have things like Ruby on Rails, AJAX, and companies like eBay that rely on denormalized databases to scale. You have good looking, fast performing, scalable web sites that are built fast, and not on overly normalized databases.
If you have read this far, I should share some sites with you that give me some legs to stand on for my arguments:
- Normalization is for Sissies
Pat Helland – Database Guru from Microsoft, also had a stint at Amazon.
- Data normalization, is it really that good?
Let the comment bashing on denormalization begin!