System Failure Case Study – Ice Falling from Port Mann Bridge Cables

Background:  The Port Mann Bridge is a ten-lane cable-stayed bridge that spans the Fraser River east of Vancouver, B.C.  At the time of its opening in 2012, it was the widest bridge in the world at 65 m (213 ft).  It is a key artery for Metro Vancouver, with about 100,000 crossings per day.

What Happened:  Shortly after opening, during a winter storm in December 2012, “ice bombs” – clumps of ice and snow, dropped from the overhead cables, hitting and damaging over 300 vehicles.  Windshields were broken, and dents were left in roofs and along the sides of cars.  A few people suffered injuries during the incident.

Proximate Cause:  Insufficient measures were in place to mitigate the hazard of ice and snow falling from the overhead cables, which, in the case of the Port Mann bridge, cross over the roadway.  The contract between Transportation Investment Corporation (TI Corp.), the Crown corporation responsible for the Port Mann, and Kiewit/Flatiron General Partnership, included the requirement that “Cables and structure shall be designed to avoid ice build-up from falling into traffic.”  The bridge’s designer maintained that measures including spacing the cable anchors away from traffic, using central pylons which avoid large cross frames over traffic, and using HDPE stay pipes were sufficient to minimize the risks associated with snow and ice accumulation.

Underlying Issues: There were many parties involved in the issue – the owner/operator, the design-build contractor, the maintenance contractor, and consultants.  An analysis of the documentation and internal emails on the issue (from a freedom of information request done by an independent journalist) shows that none of the parties really “owned” the issue, but rather they assumed that either the risk was negligible or at least sufficiently managed, or that someone else would take care of the risk mitigation.

The issue of falling ice and snow from cable-supported bridges is well known.  Examples can be found with bridges in Denmark, Sweden, Japan, the U.K., and in the U.S., including the Tacoma Narrows Bridge in Washington state.  Methods to manage the risk are also well documented, although in many cases there are few practical solutions beyond traffic closure.

Ice and snow accretion removal systems can be mechanical, thermal, or passive in nature.  In addition, localized monitoring of weather and ice/snow build-up can be used to pre-emptively close individual traffic lanes or the entire bridge when necessary.  Finally, the basic choice of having vertical cable planes instead of inclined cable planes (“fanned” cables suspended over traffic) can reduce the risk.

 Aftermath: B.C.’s transportation minister blamed the design-build contractor, Kiewit/Flatiron, stating that the bridge did not meet the requirements.  The provincial automobile insurance agency, ICBC, paid $400,000 in claims, and lawsuits were filed by people who were injured during the incident.  In response to one of the civil claims, TI Corp. and Kiewit/Flatiron stated “The buildup and subsequent release of ice and snow from the bridge structure was the result of a confluence of extreme environmental conditions, both unforeseen and unforeseeable to the defendants or any of them and was the inevitable result of an Act of God”.

Despite that statement, cable collars were installed in the fall of 2013 to help mitigate the hazard.  These are manually released from the top of the towers and slide down the cable stays to dislodge accumulated snow and ice.  The intent is to do this frequently enough to ensure the dislodged pieces are small enough to avoid damage to vehicles below.

In addition, a weather station and cameras were installed to allow for monitoring of conditions that could lead to ice and snow accumulation on the cables.  Operating procedures were put in place to shut down the bridge if dangerous conditions were detected.

The Port Mann collar system is said to be one of the most successful systems in the world.  However, it is not fool-proof.  Snowfalls during December 2016 have so far resulted in about 50 insurance claims from falling ice and snow.  This may have been a result of the operators not deploying the collars early enough during rapidly-changing weather conditions.

Interestingly, over the same period, 95 claims resulted from the same situation occurring on another Vancouver area bridge, the Alex Fraser, which has cables that do not hang over the bridge deck.  High winds together with rising temperatures can blow ice and snow off the cables into traffic.  The bridge was shut down for several hours on two days, due to “ice bombs”.  Falling ice also occurred in 2005, 2008, and 2012, but overall the Alex Fraser is less prone to these incidents than the Port Mann.  As a temporary measure, the Ministry of Transportation indicated that for the Alex Fraser, it will use a heavy-lift helicopter to blow snow accumulations off the cables.

The experiences on the Port Mann as well the Alex Fraser bridges will be applied to the upcoming bridge to replace the George Massey Tunnel.  In particular, the cable stays will not be allowed to cross over traffic and cable collars or an improved alternative will be required.

 Lessons Learned:

  • The requirements were written in a way that left the criticality and expectations somewhat open to interpretation. For some, the word “avoid” means you must completely ensure no risk of an incident.  For others, avoid means “try” or “use typical means and methods”.  One other section in the listed standards specified “practical” solutions were acceptable.  Yet, unless specifically defined, the word “practical” allows for many different measures of acceptable implementation.  The expectations around acceptable public safety were not met in this case, and best practices around requirements for public safety are typically better defined than what existed for the Port Mann Bridge.     
  • Designers should account for the complete system and its operating environment. The micro-climates of Vancouver are well known, and hazardous when heavy wet snow is mixed with freezing and thawing conditions.
  • Repeated questions raised regarding the risks of falling ice and snow could have resulted in a risk analysis leading to a more effective ice hazard mitigation strategy, rather than simply assuming the original design would be adequate.
  • Public safety issues need to be considered carefully and critically, and receive considerable attention from management.
  • Ultimately, the Owner is most likely to be held responsible for the performance of the implemented design and its impact on third parties (in this example, motorists being hit by “ice bombs”). While many projects have multiple contractors and parties involved in the design, construction, operation and maintenance, the Owner team needs to ensure that there is sufficient capability, staffing, mandate, and expertise held by these parties to be able to ensure quality in requirements definition and in the design, build, verification, validation, and operation and maintenance stages to mitigate issues.

Michael Eiche, P.Eng. Principal, SysEne Consulting

Why Conventional FMEAs fail too often, and why the Absolute Assessment Method FMEA is much better.

(Failure Modes and Effects Analysis)

amagasaki-0

On Oct 1, 2016, a commuter train crashed in New Jersey killing one and injuring 108 with high speed being a factor. The root cause of the crash is under investigation.

A similar crash happened in Amagasaki, Japan in April 2005 where 106 were killed and 562 injured, and high speed around a curve was a factor.  The conventional explanation of the root cause of the Amagasaki crash was corporate pressure on the driver to be on time.  Drivers would face harsh penalties for lateness, including harsh and humiliating “training” programs which included weeding and grass cutting duties.  In this case, the driver was speeding.  The resulting countermeasure in Amagasaki has been to put in an expensive $1-billion-dollar train speed control system on the small line to help mitigate a potential accident.

There have been many other high speed passenger train derailments, such as the Santiago de Compostela derailment in Spain in 2013 (79 dead, 139 injured out of 218 passengers), and the Fiesch derailment in Switzerland in 2010 (1 dead, 42 injured).  The root cause explanation of these accidents tends to focus on the drivers driving faster than they should, and countermeasures tend to focus on semi-automated systems to control train speed.

Do we really know the root cause of these accidents, and are the countermeasures both effective and economic?

One of the best root cause analyses I’ve seen on the Amagasaki crash comes from Unuma Takashiro, and his conclusion is unconventional.  Unuma-san is a Failure Modes and Effects Analysis (FMEA) consultant from Japan. FMEA is one of the best methods to analyze a design to help prevent failures.  FMEA was developed in the aviation and space industries in the 1960’s, adopted by the automotive industry in the 1990’s, and is now prevalent in many industries including health care.

Unuma-san argues that in the case of the Amagasaki crash, the speed control system is expensive and not fail-safe.  One economic and effective countermeasure would be to add a $250,000 guard rail, which at the very least would likely prevent a recurrence, and definitely be useful as an additional layer of countermeasure.  The advantage of low-cost and effective countermeasures is that they can be widely-deployed.

amagasaki-1

He argues the real root cause of this failure is that the overall engineering and management approach to mitigating failures was not adequate – both initially to prevent the accident in the first place, and subsequently after the crash by putting in the speed control system but not (also) the guard rail.

Unuma-san has a very interesting and useful website on FMEA practices, and uses the Amagasaki crash as one of many examples.  He promotes a FMEA method that uses an absolute evaluation method of countermeasures, as compared to the conventional FMEA which uses a relative evaluation method of countermeasures.  The problem with the relative evaluation method is that it can easily miss important failure modes that do not make an arbitrary priority cutoff.  Missing important failure modes often leads to unexpected incidents.

He also analyzes the conventional FMEA approach and teachings, and points out many problems seen in industry:

  • ineffective because of missing failure modes,
  • done too late in the design process, making it more difficult and less likely to implement countermeasures,
  • led by team members from other departments that are not responsible for the design, which both lowers the effectiveness of the analysis and can allow the designer to not be held fully accountable for the FMEA results,
  • doesn’t promote economical countermeasures, and
  • many of the common FMEA teachings contain flaws that promote the above problems.

Unuma-san shows that many FMEAs confuse failure mechanisms (the physical, chemical, thermal, electrical, biological, or other stresses leading to the failure mode) and the actual failure modes (ways a product or process can fail), leading to missing failure modes.  If a failure mode is missed, then there may be no countermeasure identified, and subsequently incorporated into the design.

He points out that the relative evaluation FMEA method promotes doing the FMEA on the entire design when enough of the design is done, then once the FMEA is done to a certain level, all of the issues are prioritized, and then acted upon.  The problem with this approach is that FMEAs take a lot of time, and by the time the results are done, the recommended changes to the design can be too late to be easily implemented.  He promotes instead that the designers do the FMEA as they are doing the design in a very concurrent and “local” manner, while evaluating the countermeasures in an absolute manner against the individual failure mode.  This more easily allows for countermeasures to get into the design of the product or process in the early stages.

When non-designers take too much of the FMEA responsibility and scope, the effectiveness of the FMEA is reduced and the results are available late in the design process.  The effectiveness is reduced because non-designers are unable to know all the key information in the heads of the designers, and the designers may feel less accountable for the FMEA quality.  Results are delayed because instead of countermeasures being considered at the time of the design decision, they are made available after the design decision has been made and it is then more difficult and less likely to have any countermeasure implemented.

Unuma-san’s method is simpler than many FMEAs, by using a four-point scale to the third power (64 ratings), vs. many conventional approaches using of a 10-point scale to the third power (1000 ratings).  He promotes determining countermeasures per failure mode, evaluating the likely success of those countermeasures, and whether there is opportunity for optimization and lower costs from reducing overdesign.

Unuma-san goes on to analyze the common teachings of FMEA by referring to many of the most common reference material available in books, training material, websites, etc. and he shows many flaws, inconsistencies, interpretation issues etc. that tend to exacerbate the above issues.  Much of the trouble with conventional FMEAs can be traced to poor teachings.

Unuma-san has consulted for a very long and impressive list of Japanese companies on FMEA in the transportation, health care, manufacturing, and consumer goods industries.

I’ve been both a lead designer of multiple complex systems, and I’ve been helping clients improve their product development processes, including FMEA.  The teachings of Unuma-san resonate strongly with me.  Too often I have seen poorly done FMEAs that miss critical failure modes, late FMEAs whose recommendations are too late to be useful, and FMEA study teams that don’t have enough participation by the design team.  The absolute evaluation method FMEA is a substantial improvement over the relative evaluation method, mostly because it evaluates the likely success of countermeasures.  I highly recommend his webpage on FMEAs, and it is linked here.  It is a little hard to read as the website translation to English isn’t the best, but worthwhile.

I think one of the reasons why FMEA teachings have many issues is that few FMEA teachers have been skilled design engineers, but are instead people that gravitate to process design.  The idea behind the FMEA is good and includes teaching early and effective analysis, unfortunately much of the applied practice falls short.  A skilled design engineer naturally considers failure modes and tries to design them out, while simultaneously considering many other design tradeoffs, such as performance, function, economics, aesthetics, ergonomics, etc.  I think that since the conventional FMEA trainers developed the applied practice of the FMEA, they have continued to build upon the original process of the relative assessment method, and have struggled to develop effective practices that overcome the conventional process shortcomings.  In my experience many design engineers have found FMEA to be a good idea but too slow, too time-consuming, and not effective enough to really embrace it.

What I like about Unuma-san’s method it is practical, effective, time-efficient, and evaluates the likely success of countermeasures.  It can be very useful to have FMEA experts, trained in this method, who can help designers with training, facilitation, documentation, review, etc.

There are a few other improved FMEA methods available that are trying to address some of the effectiveness and lateness problems with conventional FMEAs, such as “FMMEA”, (Failure Modes, Mechanisms, and Effects Analysis), and there are good teachings in these methods as well.  I have found Unuma-san’s method to be among the best and really resonates with me.

FMEA is one of the best methods to help avoid failures.  By making the method more effective, products, processes, projects, and infrastructure can have less problems and be more economical.  I highly recommend further study on this topic for engineers and managers delivering any system.

Craig Louie, P.Eng., Co-Founder, SysEne Consulting

Staying ahead of upcoming restrictive drone regulations

drone

How can drone developers avoid being shut down by an accident?

With the gradual increase in the use of commercial and consumer drones, we constantly hear about near-collisions and other incidents. Some of these incidents might be over-dramatized by the media or by whoever reported them, yet the overall risk is rising. In the UK,multiple close encounters of the drone kind have been reported recently, with some very close calls between small drones and large passenger jets. In a recent incident in LA, a helicopter was reportedly struck by what was probably a small drone, fracturing the windshield. Not only aircraft and their passengers are at risk – people on the ground can also get injured. In another incident, an innocent hobbyist’s drone clipped a tree and dropped towards the ground, causing serious eye injury to his friend’s son, a young toddler. In the Netherlands a small drone lost contact with its operator and flew away, gradually running out of battery, then eventually descended onto a busy highway. Although the incident did not result in any damage to people or equipment, it did damage the local drone industry, which was shortly thereafter subjected to extensive flight restrictions. In Vancouver, there have been a number of cases of drones reported near the final approach to YVR and at least a couple of cases when commercial drones crashed and caused minor damage to parked vehicles. YVR airport has recently launched a drone awareness program.

Small drones can help reduce emissions (by replacing larger aircraft for similar operation), save lives through search & rescue operations, replace manned aircraft in dangerous operations, aid in “precision agriculture” that helps produce better yields, help inspect smoke stacks and wind-turbines without the need for downtime, and many other possible applications. When designed and used appropriately, the utility of drones is enormous, and this utility is the essence of why the commercial drone industry has been growing so quickly. Drones can do a lot to advance society.

One of the major barriers to the full public acceptance of drones is that they pose safety concerns to the public. Typical root causes of incidents to date include:

  • Irresponsible operation. Not all drone operators are as experienced, cautious, or responsible as the best commercial operators. The fact that consumer drones are relatively inexpensive makes it possible to start a small business using drones at relatively low (apparent) risk, or for a user to purchase a unit for hobby purposes, with little experience or knowledge. Remote control airplane hobbyists are generally responsible, knowledgeable, and have the required skill to safely fly their models; however, modern drones and particularly the multirotor type are easy to operate anywhere, even for the uninformed or inexperienced operator. Owning and operating a drone safely, requires knowledge, skill, and responsibility. One can begin by taking not-too-costly drone courses, either online or in class.
  • Technical problems associated with performance, reliability, or other shortfalls, for example:
    • Drone flyaway, where a drone suddenly flies away due to a broken communication link with the remote control unit (because of remote control failure or radio interference), a software glitch, operator error, design issues, etc.
    • Engine failure. A variety of solutions have been developed that allow a multirotor drone to recover from engine failure, while many of the products currently on the market do not have such recovery capability.   With fixed wing drones, loss of an engine is generally easier to recover from.
    • Loss of control, for example because of interference confusing the unit’s compass, GPS, or inertial sensors, or due to incorrect orientation or calibration of the unit’s compass (“toilet bowl effect”).
    • Power failure due to battery failure or wiring issues.

Drone technology improvements are enabling more capable and lower cost drones, increasing the numbers of drones and the overall safety risk. Improvements include enabling technologies such as lithium polymer batteries, flight control and ground control station software, small-size (low-weight) cameras and other sensors, and small flight controllers based on solid state electronic components.   There are also numerous technologies and techniques that enhance the reliable operation of drones, such as monitoring battery hours/cycles/performance, setting parameters properly, performing calibration, etc. Flight control has become more affordable with software-enabled augmentation of lower-accuracy inertial sensors.

There is significant opportunity for improving the safety and reliability of small drones to the necessary level, by developing more robust system architectures, improving operational procedures and operator qualifications, technology innovation, improved regulation, and by following more rigorous techniques in the design and manufacturing of these products. Most drones do not fully employ the proven and robust approaches used in the manned aircraft industry in the areas of design, testing, regulation, maintenance, inspection, and other best practices. While some of these approaches are more rigorous and expensive than necessary, and can be relaxed somewhat to be suitable for the drone industry, there is high value in many of these approaches that can lead to both adequate safety and risk profiles, and low enough cost and weight.

Regulations are developing worldwide, and all in the direction of more restrictive or higher required capability and proof. In some countries, there is a move towards restricting the operation of drones near built-up areas and air-fields to “compliant systems” only, which is a challenge for all drone developers, and will likely leave some of the lower-cost manufacturers behind. Manufacturers who are proactive in economically developing reliable, safer, compliant products will be the most successful in the marketplace, as they will be able to operate where others cannot and will avoid reliability issues in the marketplace. Even consumer drones are complex products, and their safe operation an even more complex challenge. One major incident caused by technical failure, or by a design that does not prevent user error, could result in a damaging effect on the national or global drone market.

A systems approach to this complex issue that combines proactive strategy, careful risk analysis, economically innovative solutions, and best practices tailored to the drone industry will enable leading drone developers to get ahead on this issue. Such effort may seem costly, but is by far outweighed by the potential repercussions of failure to prevent an incident, even if it has been caused by “operator error”. It you think safety is too costly, try an accident!

A more useful Requirements Process Maturity Model

A useful diagnostic tool to help determine problem areas and areas for improvements are maturity models.  They can be used by both the client and consultant to determine the current level of performance.  The target level of performance doesn’t need level 5 (highest capability) for everything, as that is likely too expensive or difficult to achieve, or not necessarily needed.

One of the best ways of improving technology and product development is for your organization be good at developing and managing requirements.  About 70% of problems in technology and product development come from requirements and system interaction errors, and fixing these problems at the final acceptance test or in the customer’s hands costs about 100 times more than fixing them in the requirements development and management phases of the project.  Basically build the right thing, build things right, and find problems early.

For requirements development and management, there are a few maturity models published, but I have found them too specific to an industry (like for business analysts in the software industry), cover only certain aspects, or don’t cover integration, training, or culture well enough.  So I’ve developed the above model based on similar models from consulting houses, CMMI, Six Sigma, Model Based Systems Engineering, PLM, and my own background.  I think this can apply to all kinds of systems, from hardware-oriented (manufacturing, construction), software-oriented, or combinations of both.

Requirements Process Maturity Model

(click for full size)

Using this tool can then help structure the problem, ask the right questions and prioritize opportunities. Where does your organization stack up?

If you have comments or questions on the model, or have ideas for improvements, please contact me.

 

 

 

 

System Level Website Failures – Technical, Process, and Organization

In BC, we recently had a windstorm that knocked out power in the province for over 700,000 people, some for 4 days.  One of the most difficult parts of the outage was that the BC Hydro Website that provides outage updates also went out at the same time. This made it very difficult for people without power decide on what to do, where to go, what to do with the food in the refrigerator, etc. and made for many unhappy customers.

Many critical websites are complex systems, and fail more often than desired.  A good example was the failure of the ObamaCare’s HealthCare.gov website launch where there were serious technical problems at the rollout, which has subsequently taken about 6 months to fix the major issues. On launch day, as soon as the website hit about 2,000 simultaneous users, the website performance became unusable, which was an issue since on the first day, 250,000 simultaneous users tried to get access to the website. There are many other problems with the Healthcare.gov, as that project had large budget overruns, with $1.7 Billion dollars spent, which is about 10 times more than budget and what it should have cost. There are also lasting data and security problems with the website and internal database.

healthcare.gov-crash-1

The majority of the root causes of the Healthcare.gov failure were systems-level failures in all three major dimensions of any complex system delivery: technical, process, and organization.

  1. Technical: The system design used an outdated 1990’s database server model that doesn’t scale well with many concurrent users, as opposed to using a more typical e-commerce server model that can scale with users.
  2. Process: The system development process used a waterfall approach to build most of the website and then test it, vs. an agile approach where you test the important parts all throughout the development process.  Additionally there was very little testing during the development.  They were even off by a factor of five on the concurrent user requirement.
  3. Organization: The organizational system of the Government and the Contractor were poor with too many delays, last minute changes, poor subcontracting, poor reporting, and poor coordination.

BC Hydro is conducting a root case investigation of their website failure.  Perhaps the root cause was a simple and isolated issue, but I am interested to hear when the investigation is done on whether the failure had similar systems-level causes like the HealthCare.gov launch failure. For any complex interrelated technical, process and organizational complex problem, the Systems Approach is the best way to develop a solution that satisfies the overall needs and meets the expected behaviours of the system.

Why do so many industrial projects underperform these days?

shutterstock_155515529

About seven out of ten industrial projects underperform in production, operability, and/or have significant  cost or schedule overruns. Everyone working on the project, including the sponsors, want a successful, on budget, and on schedule project.  There are thousands of reference projects that have been done in the past decade, yet why is it so hard to learn from experience?

There are many reasons, and I’d like to comment on a few of the key ones that are of heightened risk because of today’s environment.  There is much material available on industrial project underperformance, and we have talked to many in industry, and unfortunately we hear painful stories too often.  As a general background:

  • Project complexity increases daily, with more difficult to reach resources resulting in a continual need to deploy new combinations of technologies, more difficult environmental regulations, more difficult community relations, etc.
  • We have been through 15 years of an economic boom in the Global Industry, and even the slowdown blip in 2008 was just a short term 10 month bust cycle with one of the fastest rebounds of industrial activity in 2009. During this boom, the underlying cost structure for engineering and construction services has increased much faster than inflation.  On a typical project in the North Sea, companies are having to pay $300/hr for mediocre quality engineering – mediocre since after 15 years of boom, engineering companies have been often taking on less and less capable staff in recent years.
  • In boom times, many unhealthy projects still make money.
  • For many years now, many owner companies have been shedding internal experts in the technical functions, and they try to offload work and risks to EPC(m) or other contracting firms. But much of the work and risk cannot be transferred from owner to contractor because they are structurally different. Owners make money from the capital asset and they can still survive a budget overrun.  Contractors cannot afford to take any financial liability of an underperforming project.  Many owners often try to offload their project management or technical work to the contracting firm, but this is can be problematic mostly owners and contractors have very different perspectives.  Owner’s teams need to be able to provide enough business and technical direction, and also provide contractor oversight.  When they struggle to do so because they don’t have the resources to do so, the whole project suffers.  Owner companies also struggle with internal coherence between all their internal departments and managers when they don’t have enough project resources.
  • Engineering and EPC(m) firms are always in search of the next project and don’t provide or develop enough long term continuity, R&D, productivity, or innovative support to the project over its entire life-cycle, or to the next project. These contractors cannot hold specialty resources or afford to invest in innovation. Engineering and EPC(m) firms are more service firms than total solutions firms – in part because this is what owner’s ask of them through the procurement process.
  • Much of the supply base, where much of the innovation does happen, struggle to afford or acquire all the necessary expertise needed to develop reliable and cost-effective solutions.

And now the Global economic macro-environment has weakened, especially in Canada’s Energy and Resource sectors.

With today’s drop in energy and commodity prices, and a general shortage of industrial capital financing, industrial companies are slashing their technical and project teams and departments to reduce their operating expenses.  Until mid-2014 or so, production was King.  Now we see significant consolidations, downsizing, and a focus on industrial company survival.    An overly-lean team without enough access to critical skills is going to make current and future industrial projects even more difficult to meet expectations, budget and schedule.

With weakened balance sheets, industrial companies are going to need successful projects more than ever.

Keys for Improvement

We need to do better and we can do better with an improved application of management, strategy, approaches, and more respect for the complexity of today’s industrial projects.  While all key stakeholders have to improve, the greatest leverage is with the project sponsors.  They control the highest level need, budget, scope, risk profile, etc., and so they have the largest leverage on the outcome.

  • There needs to be a common understanding by both business and technical professionals on why there are so many issues with these projects, and going forward, how these projects should be developed, governed, and executed.
  • The project team needs to have the right skills, adequate staffing levels, and then a robust training program on how to best manage and implement the industrial project
  • The up-front design and planning work needs to be adequately funded and given enough time. A weak design and/or poor plan causes too many problems downstream when the activity and capital spend ramps up.
  • The right contracting strategy should be chosen, and the overall team constructed in a complete way and consistent with the strategy. The owner’s team must have the right skills and do all the scoping, concept work, requirements development, and overall management that is typical of successful contracts.  The contracting must be done so that the professional service firms deliver quality and get paid well enough for doing so.
  • Experienced and systematic approaches to the:
    • technical solution,
    • process of doing the project,
    • build and organization of the team

While the above roadmap seems obvious, the root cause of the problematic projects are issues in the above five points, in either the understanding, approach, strategy, or implementation.  Furthermore, they have to be done well enough to the sophisticated level required by the complexity in today’s projects.

When owner’s companies become more open to a longer term value and improved partnering with the contracting firms and the supply base, it can enhance productivity and innovation from their products and services to the owner’s projects over the life of the asset.  For example, engineering firms could provide more long term asset support.  They have significant data on all the projects from the design phase, and can get operational data from the currently operating assets.  Currently after the project build is finished, the engineering contractor moves their resources onto other projects (or if it is slow lets them go).  The owner’s operating department of the asset struggle without the contractor engineering support, design models, people continuity, etc., and often the result is the asset does not operate to its potential. There can be a great business case to further optimization and operational improvements to the operating asset that could be turned into a long term support contract.  Everyone wins.

We must change the way we do things for a better outcome, and the ways do exist.

 

 

Tailoring Product Development Processes

There is a wide spectrum of product development processes, from stage gate to spiral processes.  Stage gate processes are able to stage scope and investment decisions and are typically employed in capital intensive industries.  Spiral processes take advantage of many repetitions of the design-build-test cycle and are typically employed in software development.  There are many variants in between.

PDP

 

To best tailor the product development process for the organization, it is important to understand the:

  • business and strategy of the organization
  • architecture and complexity of the product
  • product/project schedule, budget, and requirements
  • risks and uncertainties
  • needed iterations in the process
  • capability and culture of the organization, including Global aspects
  • customers, stakeholders, and suppliers
  • best practices

The resulting product development process is then “systems engineered” as it is an integration of systems and systems elements – technical, process, and people.

There are many useful methods to choose from during this design process, including:

  • Design Structure Matrix (Eppinger)
  • Agile Methods
  • Lean Methods
  • Model Based Engineering
  • Collaborative Supplier Integration
  • Risk-based Planning
  • Quality Approaches

A key aspect of product development is dealing with all the risks and uncertainties, which means iteration is inherent in the process.  There are both planned iterations and unplanned iterations (to fix it when it’s not right).  It is important to understand the linkages, interactions, and drivers behind how the iterations will happen.  From that understanding, iteration can be accelerated through information technology, coordination techniques, or decreased coupling.  After that, by prioritizing risks, planning the needed iterations, planning the integration and test activities, and scheduling reviews to control the process, the project risks can be addressed.

The process must also be tailored to the organization, specific people, and key stakeholders.  This is probably the most difficult part, as it is all about dealing with people, managing change, and shifting cultures.  It is important to pick and choose the most important methods, implement them, and sustain them, in a practical way. Too many processes fail because they are not used, unwieldy, inflexible, not fully coherent,  too conservative, too bureaucratic, take too many resources, or are only partly implemented.  Beyond process definition, there is training, coaching, fine-tuning, and ensuring the team sees that the change is in their self-interest to adopt, and really “owns” any new processes.

While overall improving the process is a complex and difficult initiative, having a competitive Product Development process is key to quality products, low costs, speed to market, satisfied customers, and good business.

How to Dramatically Improve Health Care; Speed, Quality, Costs


After my recent in-depth experiences with both the Japanese and Canadian Health Care systems, I’ve continued my investigation why the Japanese system has dramatically reduced wait times, better outcomes, and lower costs as compared to the Canadian system.  I have included the US system as well.  It is clear to me that applying systems engineering to health care will both improve the system and lower its costs.  When I experienced the Japanese health care system, I was so shocked at how much better and faster it was than the Canadian system, I wrote a post on it in Sept 2013, and I repeat one of the key tables here:

healthcomp

In Canada, the wait times for critical diagnoses are getting worse (see this recent article in the Globe and Mail).  Cancer diagnosis can take 1 to 6 months (!!!) in BC whereas in Japan it can often be done in one day.

The US is getting serious about applying Systems Engineering to their Health Care System.  The White House’s President’s Council of Advisors on Science and Technology (PCAST) has published an excellent report in May of 2014 called “Report to the President, Better Health Care and Lower Costs: Accelerating Improvement through Systems Engineering”.  The Report is an excellent read and is surprisingly bold in its recommendations.   One of the main recommendations is to transition from a fee-for-service model, which is a disincentive to efficient care, to one that pays for value instead of volume.

pcast

Health Care Systems are very complex, with evolving medical science and technology, multiple stakeholders, increased specialization, and rising expectations of what can be done to treat illnesses, and a lot of realpolitik.  Systems engineering has been used successfully and widely in many other complex industries, such as manufacturing or aviation.  Systems engineering has also been used to good effect in health care, but too rarely and not widely, and barely at all on the macro scale.

The need to improve health care is required, with increased population, aging, and budgetary pressures.  The opportunity for improvement is massive.  In the US, approximately 33% of health care costs are wasted, 20-33% of hospitalized patients experience a medical error with about half of them preventable, many quality issues, and caregivers and patients do not have enough necessary information when needed.  In Canada, we see many of the same issues as the US, and while we have Universal coverage, wait times for necessary diagnostics or treatment are unnecessarily and often crazily long.  Even in Japan, with its worldwide overall best outcomes, low costs, and low wait times, significant improvements are possible in overall efficiency, information flow, costs, and caregiver conditions.

Examples of how systems engineering can improve health care include:

  • Denver Health saving $200 million in 2006 by doing a systems redesign of their operations. As an example to reduce waste, one industrial engineer found the trauma surgery resident physicians walk 8.5 miles in a 24 hour shift!
  • Kaiser Permanente identified 3x as many sepsis cases and cut mortality from sepsis by 50%
  • Virginia Mason has the lowest rate of serious medical infections and falls and reduced medical malpractice liability by 40%

 Systems Engineering Processsystengprocess

How impactful could systems engineering be if applied at all levels of Health Care? The promise is outcomes as good as Japan or the top tier American care, Universally applied, and lower costs to Government and Patients, with essentially no wait times.  It won’t happen overnight, but with the right strategy we could get there in 3-5 years. It is very feasible – if others can do it, we can too. That will then allow us to also be prepared for the greying of our populations.  It is good business too, as the improved systems can be exported to other parts of the world.  When you have a good system, look how it can dominate the market – like Amazon with its great portal, logistics, and network; or the Internet, with its scalability and extensibility, or air travel with its convenience, low costs, widespread usage, and high safety.

The best studies on how to improve health care by applying systems engineering tools and principles comes from the US.  An excellent paper and collection of studies was published in 2005 by the combined efforts of the National Academy of Engineering and the Institute of Medicine called “Building a Better Delivery Systems, a New Engineering/Health Care Partnership”.  I highly recommend this paper.  Much of this material formed the basis for the 2014 PCAST report to President Obama.  Yet one of the last papers in this collection highlights the real difficulties with making improvements in the US Health Care system by analyzing and giving painful examples of the political difficulty, especially with so many interests, organizations and the huge amount of money in the Health Care systems.

Other barriers include:

  • Misaligned incentive structure – fee-for-service vs fee-for-outcomes or value
  • Availability for data and relevant analytics
  • Limited technical capabilities, especially in small practices that make up the bulk of health care
  • Workforce competencies – limited knowledge of systems engineering tools and practices
  • Leadership / culture / politics

Yet while difficult, governments, organizations, and people around the world understand the need for change, the urgency for change, and that there will be change in Health Care.  It is hard work, it will take time, and there are many barriers, especially politically.  Slowly and steadily, I expect systems engineering tools, principles, and activities to be applied into the Health Care system.  You can help by reading the PCAST and other reports and supporting the application of Systems Engineering to Health Care.

For me, I am approaching Industry, Government and Academic leaders with this message and analysis, participating in consultations, etc.

 

Improving Systems Education and Research at Canadian Universities

In today’s world, products and processes are becoming more complex, and systems engineering is the best method to manage change and complexity.  Students that have academic and experiential capability in systems engineering will be more useful and attractive to potential employers.  Universities that provide a strong program in Systems will attract better students and improve academic and industry collaborations.  Industry and Government will benefit by improved systems development.

Worldwide

Engineering education worldwide has begun to broaden from preparing students for technical careers in a particular discipline to also prepare technical leaders that will develop complex systems or have their “subsystem” fit better into the next higher level system.  Engineers today are expected to be capable in management concepts and social science that encompass supply chains, politics, economics, and customers.  The leading Universities have made cross-functional organizations that often combine engineering, management, and social science into “Engineering Systems” systems-oriented schools.  These organizations can better cut across the more siloed traditional disciplines to offer integrated systems education and research which benefits from discipline fusion.

The forefront of the Engineering Systems Education and Research Universities include MIT ESD, Georgia Tech ISyE, Stevens SSE and SERC, Keio SDM, TUDelft TPM, and others.  There is a Council of Engineering Systems Universities (CESUN) that helps coordinate the development of this field of study, with about 60 universities as members.  SFU and the University of Waterloo are members of CESUN.

engineering systems

Overall I find much of the best Systems content comes from MIT Engineering Systems Department and associated community, such as from Steven Eppinger, or their book on Engineering Systems by de Weck, Roos, and Magee.  There is a lot of other great material out there from many others, but if I had to choose the best Engineering Systems University program, it would be MIT’s ESD program.  MIT’s ESD Strategic Plan is a worthwhile read.  To also see that other regions are also at the forefront of Systems education, the “SDM in Two Minutes” video from Keio University’s program is also worthwhile.

There is also strong Systems Engineering Professional Education Programs available from places like Caltech or Georgia Tech, as many organizations send mid-career engineers, project managers, business analysts and management to these programs.  INCOSE, the International Council of Systems Engineering also provides links to training and certification as a Systems Engineering Professional, again primarily for professionals in the workforce.

The Systems Engineering discipline primarily came from Industry and Government, especially Defense and Aviation, and is now grown to be applied to develop and manage the complex systems in Energy, Transportation, Health Care and other industries.  Both the Systems Engineering Professional Education and the University Education in Engineering Systems are complementary and synergistic.

Universities that provide Systems education provide Undergraduate programs, Graduate programs, or Professional Certificate programs, or a combination of all three.  Undergraduates with Systems education are able to become useful as a Systems Engineer right away. Charles Wasson makes a great argument for comprehensive systems engineering training at the undergraduate level to all engineers in this paper. At the same time, it can also be good to become well educated in one of the disciplines, like Mechanical or Software Engineering, and then take a Graduate degree in Systems, often with some work experience in between.  Many engineers in the workforce find that their background in one of the disciplines is not enough for being a leader in developing complex multi-disciplinary systems, so they return to get either a Graduate degree or take Professional courses in Systems.  The average age of students in MIT’s System Design and Management Program is 34, reflecting more mature students.

Canada

The Canadian University Programs in Engineering Systems or Systems Engineering are not as well developed as the leading Universities in this field.  UBC and SFU have undergraduate programs in Integrated Engineering and Systems Engineering respectively, and both are a good first step towards multi-disciplined engineering, but neither school has a Graduate Level or Professional Programs, and the current curriculum does not generally include the Systems Engineering fundamentals or have the same level of fusion with social sciences or management science as in other leading Universities.  SFU’s program is more of a Mechatronics program than what Systems Engineering is typically known for.  The University of Waterloo has perhaps one of the best Systems program in Canada, with their System Design Engineering program, which is both Undergraduate and Graduate level, though it has a flavour of more “subsystems engineering” than “macro systems engineering”.  Concordia also seems to have a good Systems program, graduate level, and focused on Information Systems.   U of T has a graduate certificates in global engineering or multidisciplinary engineering final project programs, but the bulk of instruction is still in the traditional disciplines, and there isn’t the same level of Systems education or Research as the leading Universities.  Overall for Canadian Universities there is a good start but there is much room for improvement.

Note there is a large diversity in the naming of these “Systems” programs, as to a certain degree, each University likes to brand their program as unique.

In my home region of Vancouver, there are many local companies that heavily use systems engineering in their development.  They include MDA, Westport, Ballard, and many small tech start-ups.  They have all had to teach the Systems Engineering discipline by bringing in external resources, as BC graduates don’t come with much Systems educational background.  For future BC developments, such as a new LNG plant, or improving our Health Care System, Systems Engineering is of great benefit.  In the rest of Canada, we have world leading companies like Bombardier, GE Canada, SNC-Lavalin, Cisco, and Blackberry that all heavily use Systems Engineering.

Canada is shifting from a more Resource-centric economy to more of a Knowledge-based economy.  One of the most effective pillars to do that is to ensure Canada has a very strong systems-centric engineering education at our academic institutions to complement the traditional disciplines.  Canadian Universities must improve their Systems education and Research.  There are great examples by the leading Universities that Canadian Universities can incorporate.

While these changes are difficult to do, because it requires organizational changes, there can be tenure and political issues, there are fixed budgets and five year plans already in place, and it can be hard to fuse departments between different faculties of Engineering, Management, and Social Science – the incredible benefits of improved Systems education to Canada, the Provinces, Industry, Students, and the Universities is well worth the investment.

Insight and Heuristics in System Architecting

 One insight is worth a thousand analyses

iron man simul

-Engineering and Art: Iron Man 3

Systems Architecting is as much art as it is science.  The best book on this subject is from Maier and Rechtin, and I highly recommend it.

ArtArchitecting

-Maier and Rechtin, The Art of Systems Architecting, second edition, CRC Press, 2000

One of the best section of the book deals with using the method of Heuristics in architecting.  Insight, or the ability to structure a complex situation in a way that greatly increases one’s understanding of it, is strongly guided by lessons learned from one’s own or others’ experiences and observations.  Given enough lessons, their meaning can be codified into “heuristics”.  Heuristics are an essential complement to analytics.

As in the previous post, where the system engineer is to consider the whole and apply wisdom, Maier and Rechtin also promote the use of wisdom but they note that “Wisdom does not come easy”

  • Success comes from wisdom
  • Wisdom comes from experience
  • Experience comes from mistakes

While required mistakes can come from the profession as a whole, or from predecessors, it also highlights the importance of systems engineering education from those skilled in the art.

Examples of heuristics are:

  1. Don’t assume that the original statement of the problem is necessarily the best, or even the right one
  2. In partitioning, choose the elements so that they are as independent as possible; that is elements of low external complexity and high internal complexity
  3. Simplify. Simplify. Simplify.
  4. Build in and maintain options as long as possible in the design and implementation of complex systems.  You will need them.
  5. In introducing technological and social change, how you do it is often more important than what you do
  6. If the politics don’t fly, the hardware never will.
  7. Four questions, the Four Whos, need to be answered as a selfconsistent set if a system is to succeed economically; namely, who benefits?, who pays? and, as appropriate, who loses?
  8. Relationships among the elements are what give systems their added value
  9. Sometimes it is necessary to expand the concept in order to simplify the problem.
  10. The greatest leverage in architecting is at the interfaces.

-taken from Maier and Rechtin, The Art of Systems Architecting, second edition, CRC Press, 2000

 Heuristics are tools, and must be used with judgement.  The ones presented in the book are trusted and time-tested.  They may not apply specifically to your complex systems architecting work, though I think you will find most of them do.