The Great Crash of 2011

Firstly, welcome back to BlogShares! It has been a while.

tl;dr Version (for those who hate reading long blog posts)

  • The game is back up
  • We recovered all the data
  • We think we fixed all the server move-related bugs
  • A new game (‘BS3′) is coming
  • 4 months free Premium for anyone who was premium when we crashed
  • Please report any bugs/problems in The Forums
For the in-depth crash analysis, please Read More.

The Background

BlogShares is a very large, very complex game. At the front – the bit the players see – it all seems relatively simple. A few dozen screens are available to the player, and the game-play concept sits within these 30-40 screens you can access via buttons and links.

On the back end though there is a database server handling upwards of 1,000 queries per second, holding 50gb of table data, 300-odd tables and 300-million rows of information. There’s the web server, servicing Player and Non-Player requests, parsing and caching 80,000 lines of game code. And there’s the Spider server handling ‘the java spiders‘ that can consume anywhere between 5-15MBit/s of bandwidth alone.

On or around December 19, 2011, the database server crashed. This was a machine that had been in service since 2007, and it has ultimately reached the end of its serviceable life. We were placing more and more demands on it with new game tools and code, and it had run out of room to expand. Eventually, on that fateful night, it hung up its will to operate any longer and summarily crashed dead in the water.

After a couple of weeks of no news, I (Lee / The Architect) made an offer to Jay to take over hosting the game. It was clear that he no longer had the time to be able to fix and run BlogShares along with everything else going on at Santa Cruz Tech and it was likely 2011 would be the end of BlogShares for good.

This isn’t the first crash in the 9 year history of the game. This might be the 5th of any great note in terms of ‘time taken to fix’. We are always breaking it subtly when making upgrades or introducing new tools, but I don’t count those!

After a few days, Jay accepted my offer but could not immediately proffer access to the system as he was a 2-hour round trip away from the data center where the game was hosted. There was not even any guarantee that there was any data left to salvage.

Eventually he did find the time to travel to San Jose, and the database server was limped back into life, temporarily. It had a failed drive in the RAID array, and the crash had badly damaged the database.

The Recovery

With the Help of Rob / SubWolf, after 3 or 4 days we managed to get the database repaired on the injured server. It was a mix of technical skill, blind luck, and a lot of optimism. We even managed to ‘dump’ it out so we could move all the data off the stricken server. Unfortunately the complexity of databases means it isn’t as simple as just being able to copy it ‘as is’ from one place to another.

Less than 72 hours later, it appears likely another drive failed in its degraded RAID array, and the server died for good. We were lucky. We got out what we needed.

Rest In Peace, 10.0.0.2.

The BlogShares development environment dates back to 2003. Because it is such a complex beast of interconnected systems, there had always been a reluctance to upgrade any one element of the backend servers (be that software versions, operating systems, and so on) purely because of a mortal fear it would break something and we’d not be able to revert. You have to remember the average hard drive size in 2003 was just 37 gigabytes, and with game data around 50 gigabytes, RAID was a must to be able to hold it all. That meant expense. That also meant backing up was a huge issue, and it was rarely done due to the cost involved in then storing that backup.

(Today, things are much different. The average hard drive size is 750 Gigabytes. The new server has 2,000 Gigabytes of disk space on it.)

With the old setup as it was, the game couldn’t continue. We had no choice but to move it elsewhere, using the very latest versions of everything. We just had to hope it would have a “hope in hell” of running. Sure, we encountered issues along the way – some of them very time consuming to fix or patch or code around – but for the most part, it was a resounding success. As I write, we have overcome every move-related technical issue outstanding on my list.

In the end, we didn’t lose a single chip, a single dollar, or a single blog in the entire recovery effort. In emergency last-ditch-attempt data recovery scenarios we call that “probably unlikely”, and we achieved almost the impossible.

The Now

By the time you read this, the final piece of the puzzle will have fallen into place. The last vestige of the ‘old system’ – the domain name – will have been transferred, and have become active once again. The game will hopefully be blindingly fast – in comparison to how it used to be, at least – and your ‘Board of Directors’ will be visible, active, and here for you, the players.

If you find problems, be sure to point them out in the Forums (which we also have also copied over). We’ve tested extensively, but probably haven’t found every combination of play you use, or particular search queries, etc. We’ll try to reproduce the problem you experience and get it fixed as quickly as possible.

Every player that had a Premium Membership at the time of the crash will have their membership extended by 4 months from the date of re-launch. If we have missed yours, drop us a line either in the forums, or on IRC, or in the comments and we’ll get right on it.

Tell your friends. BlogShares lives! And everyone is welcome back. An email to everyone will go out in the coming days with an enticing offer for old players to return as well, now that the game is under new management.

The Future

Work has already begun on what we have termed internally as ‘BS3′, or BlogShares version 3. It’s a bit of a misnomer technically, because it doesn’t involve Blogs at all. But something far more exciting, and far more tied to the real world and thus far more useful and fun to make a game out of. It’ll adhere to the core principles of BlogShares but without the Blogs part. In their place, will be something much, much more dynamic.

In conclusion, I want to thank you all for your patience, and your positivity on the Facebook group during the Great Crash of 2011. It’s likely we’d never have persevered as long as we did without you, because on quite a few occasions the thought did our minds that we were going beyond what any reasonably sane person would do to recover data, or fix a seemingly unfixable problem.

Yours,

The Volunteers of BlogShares.

8 thoughts on “The Great Crash of 2011

  1. Minor niggles remain.. we are working our way through them as we speak. Just to re-iterate, please please please post any problems you discover in the forums so we can take a look. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>