Keep it lean and keep it clean

When most applications get deployed their databases start off at a moderate/normal/average/boring size, somewhere between 20MB and 5GB is not unusual.  Some large systems such as Oracle’s E-Business Suite and SAP have larger starting points (100+GB) but new applications or web sites normally start off with relatively small databases which hold application config data and sit waiting for user data to begin pouring in post go-live.

These small and agile databases are great for DBAs, they not only require small amounts of disk space (even once t/log and backup dumps are done) but backups are also fast, copies to other servers easy and housekeeping tasks relatively painless.  We also see efficient IO access by SQL Server and indexing isn’t so much of a problem – a pain free life!

Roll forward 2 years and we start seeing these small and agile databases become large and bloated.  Web application databases have stored every quote a client requested for the last 2 years, application audit logs detailing every change go  back 3 years etc etc.  I’ve often heard tales from clients of their 2GB database becoming a 150GB database after a couple of years of “successful” use, but backups now take hours, database copies to dev systems can’t be done in a night anymore and the thought of re-indexing the complete database becoming complex.

The diagram below shows a strategy which some may want to consider when they consider their data retention policies.

Every company now has to keep every piece of data they generate for one reason or another, people just like keeping it in case they need it.  But that doesn’t mean to say it has to be kept in the same physical data files or even database as the operational OLTP data.  Splitting the data out can have operational benefits for the DBA, reducing the amount of data which needs to be backed up every night, the amount of data which needs to be transferred for dev systems to be refreshed as well as the amount of time needed to re-index etc.

There are two strategies to physically splitting these large amounts of data up: data level and storage level.

Splitting the data at a data level can involve archiving off old data into a second database which distances it from the operational OLTP tables.  Application code will need to know how to find this new data but there is now both a logical and physical separation between the active and inactive data, it’s not all stored in a single million row table any more.  The second database can still keep the data in the same table format as the original but benefit from its own indexing strategy and backup schedules.

Splitting the data at a physical level requires functionality which is relatively specific to your database platform.  SQL Server has the concept of file groups which allows you to place specific tables on specific physical files.  Again, some logical re-distribution of data and application re-direction maybe required however you can now physically separate your operation and archived data, but within the same database.  File group level backups again help keep database backups and dev system refreshes swift.  Oracle has similar functionality using tablespaces and data files, the ability to logical dictate where data is physically held.

This strategy may not be suitable for every application but if you built the application you should at least consider it.  Your DBA will love you, backup times remain low, copying databases remains swift and housekeeping remains straightforward.


3 thoughts on “Keep it lean and keep it clean

  1. Matt Whitfield

    Is it worth considering the ‘half way house’? I.e. using filegroups to split historical data from it’s current counterpart. You can split the I/O easily, backups can still be fast by backing up individual filegroups, and you aren’t then putting in cross-database dependencies when you write a query that needs to access both sets of data – something that can make it a real pain to move databases around. Just a thought.

    1. gavinpayneuk Post author

      Hi Matt,

      You’re right that there are internal options available within the actual database platforms to scale out the storage of large amounts of data, I should probably follow up to mention that. The reason I didn’t initially is because the options vary depending on database platform (the client I saw recently was using Oracle for example), although the concept of multiple physical “partitions” is common across all, thanks for the reply.

      1. Matt Whitfield

        Ha – Sorry I read this and, being a bit of a self-confessed SQL Server fan boy, assumed it was talking about SQL Server! Definitely a nice take though, thanks.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s