Sunday, February 3, 2013

Work-flow Diagram for Data Centre Relocation

I wrote here about the work-flow for planning and executing the move of a group of one or more servers from one data centre to another. Here's the picture:

Work-flow for a Data Centre Relocation

I've relocated a couple of data centres, and I've just started working on another. The last one moved over 600 servers, about half physical and half virtual. We moved over five months, counting from when the first production workload went live in the new data centre. Our team consisted of five PMs working directly with the server, network and storage admins, and application support teams.

[Update: Check out the visual representation of this post here.]

We knew we had a lot of work to do in a short time, and we were working in a diverse and dynamic environment that was changing as we tried to move it. We needed a flexible and efficient way to move the data centre. One thing that really helped was a work-flow for the PMs to work with the various technical and user teams that allowed teams to focus on doing what they needed to do.

Early in the project we collected all the inventory information we could to build up a list of all the servers, whether they were physical or virtual, make and model, O/S, etc., and put it in the Master Device List (MDL). We then did a high-level breakdown into work packets or affinity groups in consultation with the application support folks. These works packets were what was doled out to the individual PMs.

Each PM then began the detailed planning process for the work packet. Starting from a template, the PM began building the relocation plan, which was simply a spreadsheet with a few tabs:
  • One tab was the plan itself, a minute-by-minute description of the tasks that had to be done, and who was responsible for doing them, over the course of the time immediately around the time of the relocation. Many also included the prerequisite tasks in the days preceding the relocation
  • Another tab was the list of servers, and the method by which they would be moved. We had a number of possible move methods, but basically they boiled down to virtual-to-virtual -- copying a virtual machine across the network, lift and shift -- physically moving a server, and leap frog -- copying the image from a physical server across the network to another, identical physical server
  • The third tab was a list of contact information for everyone mentioned in the plan, along with the approvers for the hand-over to production, escalation points, and any other key stakeholders
At this point many PMs also nailed down a tentative relocation date and time for the work packet and put it in the relocation calendar, a shared calendar in Exchange. The relocation calendar was the official source of truth for the timing of relocations. Some PMs preferred to wait until they had more information. My personal preference is to nail down the date early, as you have more choice about when to move.

The PM then got the various admins to gather or confirm the key information for the server build sheet and the server IP list.

The server build sheet contained all the information needed to build the new server in the new data centre. For a virtual machine, this was basically the number and size of mounted storage volumes including the server image itself. This information was key for planning the timing of the relocation, and in the case of VMs with extra attached storage volumes, made sure that everything got moved.

For physical servers the build sheet had everything needed for a VM, plus all the typical physical server information needed by the Facilities team to assign an available rack location and to rack and connect the server in the new data centre.

The server IP list simply listed all the current IPs used by the server, and their purpose. Most of our servers had one connection each to two separate redundant networks for normal data traffic, along with another connection to the backup network, and finally a fourth connection to the out-of-band management network ("lights-out operation" card on the server). Some servers had more, e.g. for connections to a DMZ or ganging two connections to provide more throughput.

The PM iterated through these documents with the admins and support staff until they were ready. One thing that often changed over the course of planning was the list of servers included in the work packet. Detailed analysis often discovered dependencies that brought more servers into the work packet. Or the volume of work proved to be too much to do in the available maintenance window and the work packet had to be split into two. Or the move method turned out to be inappropriate. We encouraged this, as our goal of minimizing or eliminating downtime and risk was paramount.

When the plan was done the Facilities team took the server build sheet and arranged for the physical move and connection of servers. The Network team took the server IP list and used it to assign the new IPs, and prepare the required network configuration and firewall rules.

The network admins put the new IPs into the same server IP list sheet, which was available to everyone, so for example the server admins could assign the new IPs at the time of the relocation.

At the time of the relocation, everyone did their tasks according to the relocation plan, and the PM coordinated everything. For simple single server, single application relocations, the team typically moved and tested the server without intervention from the PM.

Finally, the Backup and Monitoring teams used the server list in the relocation plan to turn backups and monitoring off for the relocated servers at the old data centre, and to turn  backups and monitoring on for the relocated servers at the new data centre.

It wasn't all roses. We had a few challenges.

We set a deadline for the PMs to have the server build sheets and server IP lists completed two weeks before the relocation, to give time for the Facilities team to plan transport and workloads for the server room staff, and for the Network team to check all the firewall rules and ensure that the new configuration files were right. We often missed that deadline, and were saved by great people in the Facilities and Network teams, but not without a lot of stress to them.

There was some duplication of information across the documents, and it could be tedious to update. As an old programmer, I had to stop myself several times from running off and building a little application in Ruby on Rails to manage the process. But we were a relocation project, not a software development project, so we sucked it up and just worked with the tools we had.

In summary, we had a repeatable, efficient work-flow that still allowed us to accommodate the unique aspects of each system we were moving. We needed five key documents:
  • Master device list (MDL), a single spreadsheet for the whole project
  • Relocation calendar, a single shared calendar in Exchange
  • Relocation plan, per work packet
  • Server build sheet, per server, or per work packet with a tab per server
  • Server IP list, a single document for the whole project (which grew as we went)
The PMs were working with various teams that knew how to do, and were very efficient at, certain repeatable tasks:
  • Communicating outages to the user base (Communication Lead)
  • Moving a physical server and connecting it in the new data, or installing a new server as a target for an electronic relocation of a physical server (Facilities team)
  • Moving a virtual machine or a physical machine image, and its associated storage (Server and Storage team)
  • Reconfiguring the network and firewall for the relocated servers, including DNS changes (Network team, although for simple moves the server admin often did the DNS changes)
  • Acceptance testing (Test Lead who organized testing)
  • Changing backups and monitoring (Backup team and Monitoring team)

Saturday, December 29, 2012

Fixing a Crash in Team Fortress 2

My son wanted Team Fortress 2 for Christmas. So far we've been mostly blessed with not having to feed a relentless video game appetite (aside from Minecraft). But I looked into it, and the game was free, with a very recently released Linux version. So I thought, "what the heck. It would probably only take me a few hours of fooling around to make it work."

Well, it was more than a few hours, but mostly because of my insistence on doing things "right".

TF2 is an interesting game. It runs in an environment, or framework, or something, called Steam. Steam supports many other games. In fact, it appears to be a whole ecosystem of games and communities around the games. There's a .deb to install Steam on Debian-derived LInuxes, and that's the first thing I installed.

I followed the Ubuntu forum for the installation, specifically using the experimental nVidia driver. I have a 9300, which is less than the forum says I need (9600 and above). Using the experimental driver allowed me to get Steam to run.

You have to sign up to the Steam community to use it. You can do so in the game.
 
To install TF2, I started Steam and found it in the on-line store. It's a long download. I think it took five or six hours on my reasonably fast ADSL. (I usually get 250-300 KB/s).

Finally, I could run the game under my user.

Here's where my insistence on doing it "right" first caused issues. My son has his own Linux user on his computer, which is not the user that installed Linux. His user was created as an ordinary non-admin user. My son doesn't have any special privileges on his computer, which is fine for me at his age. I don't want him to be able to mess up the configuration of his computer.

TF2 gets installed under the user's home directory, so I had to download again for my son. (You could probably just copy the appropriate directory or directories from one user to the other, but that would make the problem of getting the game running even harder if it didn't work the first time, which it didn't.)

Trying to run the game from my son's user name caused some disk activity and a few progress dialogues to appear, but then I'd just end up staring at the Steam home page after a few minutes. Running Steam from the command line allowed me to see all sorts of output, including the report of a "Segmentation fault" at the time the disk activity stopped.

Many hours of thrashing about and googling followed. Finally, it dawned on me that the only real difference between the users (mine and my son's) had to be the groups that they were in. (The Linux security model allocates some privileges to "groups" rather than directly to users. You then assign the user to a group to allow them the privileges of the group.)

Some trial and error fairly quickly determined that the user running TF2 has to be in the "sambashare" group. I logged in as me, the user who installed Linux. Then, in a Terminal, I could have typed:

sudo adduser user sambashare

However, I got intrigued that I couldn't find the GUI do manage users and groups. I discovered that it doesn't come installed by default on Linux Mint 13. So I installed the Gnome system tools:

sudo apt-get install gnome-system-tools

With the Gnome system tools installed, I:
  1. Went to Menu-> Administration-> Users and Groups
  2. Selected my son's user name
  3. Clicked "Advanced Settings"
  4. Entered my password
  5. Clicked the "User Privileges" tab
  6. Checked the box beside "Share files with the local network"
  7. Clicked OK all the way out again.

Note that I did all the above as myself, the user who installed Linux, not as my son.

Now I logged out of my session and logged in as my son and TF2 ran. Woo hoo!

Note that the LInux version of Steam and/or TF2 is very new right now (end of December, 2012). I found a lot of info on the net was no longer applicable, because of the evolution of the game and the platform. Even the contents of the Ubuntu forum for Steam changed drastically in the few days that I was working off and on to get the game running.

Off topic, but of interest to my geek friends: Here's a blog post about how the Steam effort is contributing to better graphics support in the Linux world.

Sunday, September 30, 2012

Cinnamon Performance -- It was Chrome's Fault

I've written lately about my struggles with sluggish Ubuntu and Mint desktops. Finally, I discovered that Chrome was the problem. At one point in my ramblings, I recommended using Mate instead of Cinnamon. Well, I'm happy to report that my slow Dell Vostro 1440 runs Cinnamon just fine, as long as I'm not running Chrome.

Long Fat Networks

Long fat networks are high bandwidth, high latency networks. "High latency" is relative, meaning high latency compared to a LAN.

I ran into the LFN phenomena on my last data centre relocation. We moved the data centre from head office to 400 kms from head office, for a round trip latency of 6 ms. We had a 1 Gbps link. We struggled to get a few hundred Mbps out of large file transfers, and one application had to be kept back at head office because it transferred large files back and forth between the client machines at head office and its servers in the data centre.

I learned that one can calculate the maximum throughput you can expect to get over such a network. The calculation is called the "bandwidth delay product", and it's calculated as the bandwidth times the latency. One way to interpret the BDP is the maximum window size for sending data, beyond which you'll see no performance improvement.

For our 1 Gbps network with 6 ms latency, the BDP was 750 KB. Most TCP stacks in the Linux world implement TCP window scaling (RFC1323) and would quickly auto tune to send and receive 750 KB at a time (if there was enough memory available on both sides for such a send and receive buffer).

SMB 1.0 protocols used by most anything you would be doing on pre-Windows Vista are limited to 64 KB blocks. This is way sub-optimal for a LFN. Vista and later Windows use SMB 2.0, which can use larger block sizes when talking to each other. Samba 3.6 is the first version of Samba to support SMB 2.0.

We were a typical corporate network in late 2011 (read, one with lots of Windows machines), and they were likely to suffer the effects of a LFN.

Note that there's not much you can do about it if both your source and destination machines can't do large window sizes. The key factor is the latency, and the latency depends on the speed of light. You can't speed that up.

We had all sorts of fancy WAN acceleration technology, and we couldn't get it to help. In fact, it made things worse in some situations. We never could explain why it was actually worse. Compression might help in some cases, if it gets you more bytes passing through the window size you have, but it depends on how compressible your data is.

(Sidebar: If you're calculating latency because you can't yet measure it, remember that the speed of light in fibre is only about 60 percent of the speed of light in a vacuum, 3 X 10^8 m/s.)

There are a couple of good posts that give more detail here and here.

Sunday, September 16, 2012

Ubuntu and Mint Very Slow

I've been struggling for some time with poor performance of Ubuntu, and now Mint, on my Dell Vostro 1440. Admittedly it's a cheap laptop, but in this day and age a Linux desktop should run decently on pretty much anything, as long as you're not using a lot of fancy desktop effects.

Running top I was seeing a lot of wait time. When the performance was really bad, I'd see over 90 percent wait time. Typically I'd be dipping into swap space when performance was bad, but it would be bad without swapping (I "only" have 2 GB of RAM). I would see this when running only Thunderbird and Chrome, although Chrome with a lot of tabs open.

I spent many frustrating hours Googling for performance issues on Ubuntu or Mint and didn't find anything really promising.

Finally, last weekend I was dropping off some old computer gear for recycling at our local Free Geek and saw a pretty sweet Dell laptop for sale. I started playing with it, partly to see how it performed. They sell used computers with Ubuntu, and Ubuntu comes with Firefox. Firefox was snappy as all get out, and on a lower powered CPU than mine at home.

So I went home and tried Firefox. It works great. So I started Googling performance problems with Chrome on Linux and got all sorts of hits. This one looks like it's turning into a bit of an omnibus bug report, but has some good info and links to other places.

It looks like one factor is that Google has made its own Flash viewer, since Adobe is no longer supporting new versions of Flash on Linux. Many people report disabling the Google Flash viewer helps, but it didn't work for me.

Others report that it is indeed due to memory usage of Chrome with many tabs. Others report that it has something to do with using hardware graphics rendering, that the hardware is actually slower. Still others report issues with Chrome scanning for devices, and particularly webcams.

My gut says it's a combination of things -- perhaps all of the above are involved, but you only see the performance problem when two or more of the factors coincide.

I haven't found a solution that works for me yet, so I'm somewhat reluctantly using Firefox. It's certainly a lot faster than it was two years ago. However, I miss the combined link and search field in Chrome, amongst other things. It does seem like Firefox has stolen most of Chrome's good ideas, so it's not as hard as I thought it might be to readjust.

Installing Ruby on Rails on Linux Mint 13

A few months after my last post about installing Ruby on Rails, and much has changed. There was an issue with zlib so I had to flail around a bit. The following instructions are what I think I should have done:
  1. Run "sudo apt-get install zlib1g zlib1g-dev" in a terminal window
  2. Install the Ruby Version Manager (rvm) from these instructions
  3. Run "rvm requirements" in a terminal window
  4. Install all the packages the output of "rvm requirements" tells you to install (apt-get install...). You must do this before you start installing any rubies with rvm. If you don't, you may have all sorts of problems crop up later, like weird messages from irb ("Readline was unable to be required, if you need completion or history install readline then reinstall the ruby.")
  5. Do the following in a terminal window:
rvm install 1.9.3-p194
rvm --default 1.9.3-p194
gem install rails 
sudo apt-get install sqlite 
sudo apt-get install libsqlite3-dev libmysqlclient-dev
sudo apt-get install nodejs 

Now create an application to test:

rails new testapp 
cd testapp 
rails server 

Browse to localhost:3000 and you should see the Rails default page.