System Level Website Failures – Technical, Process, and Organization

In BC, we recently had a windstorm that knocked out power in the province for over 700,000 people, some for 4 days.  One of the most difficult parts of the outage was that the BC Hydro Website that provides outage updates also went out at the same time. This made it very difficult for people without power decide on what to do, where to go, what to do with the food in the refrigerator, etc. and made for many unhappy customers.

Many critical websites are complex systems, and fail more often than desired.  A good example was the failure of the ObamaCare’s HealthCare.gov website launch where there were serious technical problems at the rollout, which has subsequently taken about 6 months to fix the major issues. On launch day, as soon as the website hit about 2,000 simultaneous users, the website performance became unusable, which was an issue since on the first day, 250,000 simultaneous users tried to get access to the website. There are many other problems with the Healthcare.gov, as that project had large budget overruns, with $1.7 Billion dollars spent, which is about 10 times more than budget and what it should have cost. There are also lasting data and security problems with the website and internal database.

healthcare.gov-crash-1

The majority of the root causes of the Healthcare.gov failure were systems-level failures in all three major dimensions of any complex system delivery: technical, process, and organization.

  1. Technical: The system design used an outdated 1990’s database server model that doesn’t scale well with many concurrent users, as opposed to using a more typical e-commerce server model that can scale with users.
  2. Process: The system development process used a waterfall approach to build most of the website and then test it, vs. an agile approach where you test the important parts all throughout the development process.  Additionally there was very little testing during the development.  They were even off by a factor of five on the concurrent user requirement.
  3. Organization: The organizational system of the Government and the Contractor were poor with too many delays, last minute changes, poor subcontracting, poor reporting, and poor coordination.

BC Hydro is conducting a root case investigation of their website failure.  Perhaps the root cause was a simple and isolated issue, but I am interested to hear when the investigation is done on whether the failure had similar systems-level causes like the HealthCare.gov launch failure. For any complex interrelated technical, process and organizational complex problem, the Systems Approach is the best way to develop a solution that satisfies the overall needs and meets the expected behaviours of the system.