Sometimes a real-world object is the answer to a virtual-world problem.
We manage data backups for one of our clients. If you have ever done that job for even a small network, you know it can be a challenge. The ideal situation is for it to quietly copy all the important data each night, and leave you alone. But if something goes wrong, it should let you know so you can fix it. Sometimes you have to reset a process, reboot a computer, or (if you acquired so much bad karma in a previous life that you have to use backup tapes in this one) reformat a tape.
The system we’ve built for them has been undergoing continuous improvement for many years, and now runs quite smoothly. With one exception: offsite backups.
The backup server receives new data from all the systems each night. As part of the disaster recovery plan, we periodically write a copy of backup data to an external disk drive which is stored in an offsite vault. We rotate a few of these offsite disks so there is always one offsite with a pretty recent copy of the company data. This protects against various scenarios: fire or other physical damage to the backup server, a natural disaster that disables the data center, etc.
The idea is that a human loads a disk in the external drive dock, later that night the system copies data to that disk, and the next morning the human removes that disk and inserts the next one in sequence. Most days this process works just fine. Ideally the person swapping the disk will check the local web page which shows the status of the offsite disk to make sure it is OK to swap it. Most days, there’s no problem. The copy happened over night and the system has released the device so it’s OK to unplug it. But once in a while, something glitches, and the system doesn’t let go of the device. If you remove the device when it’s in that state, it can get corrupted. It doesn’t happen every time, but about 1-2 times per year we would end up with a corrupted disk and have to deal with that. (It’s not a huge hardship, but it takes time to trouble-shoot and ultimately re-initialize the offsite disk).
In a way, the system is a victim of its own success. It works perfectly so many nights in a row that one naturally stops being concerned about the possibility of failure, and eventually it bites you. There’s a great book about this phenomenon called The Logic of Failure which I heartily recommend.
Our solution is a nifty gadget called the ThingM Blink(1) USB RGB LED. It’s a $30 programmable LED light that you plug in to the USB port of any computer. Don’t be bothered by the “programmable” – you don’t have to be a programmer to use it. It comes with software for Mac, Windows, and Linux that can tell your Blink(1) to react to different events. It can react to email, changes to local files, URLs (either local or on the Internet), and integration with the If This Then That (IFTTT) service.
One popular use is for email. You plug the USB stick into your laptop, and tell it to flash a specific color when you get certain kinds of email. Messages from the boss can flash red, messages from the family green, and from an important customer maybe blue. You can get more details at the ThingM Blink(1) web site.
If you ARE a programmer, you’ll be pleased to know that the Blink(1) comes with full developer support. The code is all open source and freely available on github. For our project, we used the “blink1-tool” command line program to change the color based on the status of the offsite disk. We chose 3 possible states:
- Quick Flashing Red – Backup is in progress.
- Slow Pulsing Red – Backup is not running, but the device is mounted and unsafe to remove.
- Solid Green – It is safe to remove the disk.
We plugged the Blink(1) into the backup server, extending it with the provided USB cable, then used velcro to attach it to the external drive dock. Then we has the backup software change the state of the Blink(1) to match the current state of the backup system. Now there’s no confusion at all: if the light is green, it’s OK to swap the disks. If it’s red, wait for it to finish or have someone look at the system to troubleshoot.
Note that we double-encoded the meaning. Each of the 3 states is indicated by color and a flashing or solid pattern, so even someone who is red/green colorblind can use it.
The documentation from ThingM is quite good. We did not know anything about this device or how it worked before we got it, and it only took three hours from opening the package to having it fully integrated with our custom backup software.
$30 extremely well-spent!