Last November on the 8th, UAA’s IT service temporarily failed. This lead to the failure of UAA’s website, employee email, WiFi services, Blackboard and more. Both students and teachers dismayed while the system was down, as they could neither deliver important messages nor receive them.
The outage affected mostly students’ ability to receive and complete class assignments, but also had a major effect on teachers’ ability to conduct classes.
The IT outage occurred after computer equipment, an Enterprise Virtual Array (EVA), crashed. The EVA stores electronic information for IT services including, but not limited to, Blackboard and employee email. The EVA UAA uses holds 96 terabytes of information, which is about 48,000 times the amount an average computer contains. It’s about the size of a refrigerator and contains 48 disks, each of which contains information.
Every few years, one of the disks fails. Normally when a disk fails, it can be replaced without any delay or outage of the system. Starting in mid October, IT noticed an increase in disk failures. IT contacted Hewlett-Packard (HP), who sells the EVAs, and was told that about six other similar cases had been reported, and that a software update could solve the problem.
After installing the software, the failures continued. There were six other small failures before the big outage on Nov. 8. At approximately midnight on the eighth, there was a disk failure. The failure was in the 24-hour process of being fixed when another disk failed, creating, as Mike Driscoll described in the IT Service Disruption Review Committee Results, the “perfect storm.” The EVA, after both disks failed, went into a self-preservation mode and shut down, resulting in the full on outage.
A local engineer came in and parts of the system were running again at around 4:30 a.m. on the ninth, only to fail again. Parts of the system, including the Content Managing System (CMS), were finally up and running by 9:00 a.m. later that day. This allowed for important information to be posted on the universities website, and some employees said that their email was working too.
Blackboard’s restoration failed on the Nov. 10 due to some corrupt files, resulting in a longer down period.
IT has come up with several ways to prevent a severe outage like this to happen in the future, including “Reboot EVA after every firmware upgrade as Standard Operating Procedure.” Also, according to the review, it is important to “prioritize the restoration process with an emphasis on communications and those areas with a high degree of impact.” The downtime of restoration may have been reduced to fewer than two days if IT knew then what they know now.
The IT Service Disruption Review Committee held a poll, asking students, faculty, and staff what effected them the most about the outage. Of the 751 students who took the poll, 88 percent said that the outage affected them most in the completion of assignments. Of the 240 faculty polled, 64 percent said that it effected their ability to teach. 91 percent of all 1246 people polled were affected by Blackboard being down. 83 percent of staff members were affected by the problem. According to the IT Service Disruption Review Committee Results, “departments that rely on e-mail for daily business had the greatest issues.”