The disaster recovery disaster and the file that never was…

This past week was the company’s annual Disaster Recovery exercise, where we simulate our mainframe system going into complete system failure and forcing us to restore the system from back up tapes at a secondary site. It is a very important exercise that any big tech companies do on regular basis. It would start on Monday, as the tapes would be driven to the secondary site and all the back ups would be loaded in the secondary mainframe, and by Tuesday afternoon we should have everything setup to process transactions again.

From what I had been told, things usually go according to plan and end before the end of the week. Filling optimistic (and we all know how that usually ends, don’t we?), I decided to sign up for the 3 to midnight shift on Tuesday with a couple of my colleagues. One one them did the exercise a couple years ago and the other was a fairly new employee who had been working just a couple more month than I. I thought I was in good hands.

But things wouldn’t be fun if there was no challenge, now would it?

  • Monday:

I came to the office still slightly annoyed that I couldn’t login in Sunday to discover that, as I mention in the post before last, things didn’t go well on Sunday. They were hopeful that would be able to fix the issue before Tuesday’s run of the program, but we didn’t share their enthusiasm. More importantly, I was reading the status report of the Disaster Recovery (DR) stating that they were falling behind schedule when they discovered some of the tapes had seemingly expired…

A foreboding sign of what was to come…

  • Tuesday:

Being the ever punctual person that I am, I woke up as usual at 6:30. Not the best things to do if your trying to stay up late, I might say. I took the opportunity to do a couple chore before heading to Com Site two, our backup system location. The name sounds like some kind military code name, but the actual building takes the cake. It looks like a bunker. It’s on a hill, there was a ten feet tall concrete wall surrounding the place, a field large enough for a helicopter or two to land and the building has this weird airlock door you have to use. They even have one-way mirror/glass in the entrance before the airlock for security to inspect you or whatever.

Anyways, as I entered the main room, Everyone was looking busy, but not the good kind of busy. Files were missing, recovery jobs were going down and we were at least three hours behind schedule.

Now it is important to note something: the exercise assumes that the incident that would have brought the main system of line would have happened sometime on Monday morning, meaning that all the system back ups would be from Sunday. On Sunday, my ID was expired. Which meant that when I tried to log into the DR mainframe, It wouldn’t let me…

After getting my access restored, I did do much to be honest. Technically, we were supposed to be here to make sure that the batch cycles were running properly and creating Incident Reports for any jobs that went down. The obvious problem was that batch cycles were not running yet. As we were trying desperately to recover files, a pattern became to emerge. The directory tool which kept track of where each file was archived on which tape was out of whack and some of the tapes were incorrectly labeled as scratch and overwritten…

While that was going on, I was also CC’d on an email thread about on RTD-37, the procedure that I worked on for the install that has been giving us such hard time. As it turns out, the other team (the one that uploads the input file for our programs) had thought they fixed the problem from Sunday, but it turned out that they didn’t. They thought the problem was with the internal layout of the file while the actual problem was the record length. The actual data was formatted correctly but was uploaded to the wrong length, causing the record to “wrap around” as it were. Each record should take exactly one line of the file, but because the line were too short, each record spilled over unto the next line, causing everything to misalign into a jumbled mess.

As we approached midnight, things were looking gloomy…

  • Wednesday:

After a short night of sleep (maybe like 5 and a half hour), it was time to go back to the office. Not surprisingly, the situation over the DR exercise had not improved over night. They attempted to start one of the processing cycle only to discover that file were still missing. The client side of the exercise was not going well either, with many of them not being able to access our software.

Things were quiet on the RTD front however, and that got the project manager somewhat worried. As it turns out, the other team seem to be fairly tightlipped when asked how much progress they made. They still seemed optimistic they could fix the problem before the next day. Just like the three previous day, we ran an empty file.

  • Thursday:

Business as usual: Nothing was working as it should for the disaster exercise and we still were not producing the reports from our procedure. By midday thing were starting to move on the DR front. Some decided that it was not worth it to try and recover stuff from the tapes and decided instead to pull directly from the production cycles (something you could not do in a real disaster because there would be no production cycle running…). Ergo the exercise was a complete a resounding failure. Or not, depending on who you asked. According to some of the upper management, the problems with the scratch tapes would not had happened because… reasons… and voila! Everything is fine. Even my manager thought that was some serious bullsh*t excuses.

By the end of the day, we also decided on a “fix” for the file upload problems. The other team would process and format the file and, instead of uploading it automatically, they would send it to us and have us manually upload it into the mainframe until they figured out what is wrong with their system…

SSSIIIIIIIGGGGGGHHHHHH………

Seriously though, this program/procedure has been running for the better part of a decade, It can’t be that hard to fix!

When I was in college, I remembered reading this post about how being a programmer can be a thing of nightmare, and the sad truth is that it is. Working on a project for a comp-sci class in college will never prepare you for the reality of the real world. You don’t write programs from scratch, you build on someone else’s work. I can look at each line of a program and understand what it does, but it would be a hopeless task trying to figure out the complex logic, the bigger picture, that drives said program. I am starting to see why so many small tech start up are popping all over the place. You don’t have to deal with multiple compartmentalized programming team, you can start from scratch rather than maintain decade old system technology and software, and probably most importantly it would be much more satisfying work.

Aahhhh…. Things should go better next week, but I not holding my breath to much.

Advertisements

One thought on “The disaster recovery disaster and the file that never was…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s