Why Conventional FMEAs fail too often, and why the Absolute Assessment Method FMEA is much better.

(Failure Modes and Effects Analysis)

amagasaki-0

On Oct 1, 2016, a commuter train crashed in New Jersey killing one and injuring 108 with high speed being a factor. The root cause of the crash is under investigation.

A similar crash happened in Amagasaki, Japan in April 2005 where 106 were killed and 562 injured, and high speed around a curve was a factor.  The conventional explanation of the root cause of the Amagasaki crash was corporate pressure on the driver to be on time.  Drivers would face harsh penalties for lateness, including harsh and humiliating “training” programs which included weeding and grass cutting duties.  In this case, the driver was speeding.  The resulting countermeasure in Amagasaki has been to put in an expensive $1-billion-dollar train speed control system on the small line to help mitigate a potential accident.

There have been many other high speed passenger train derailments, such as the Santiago de Compostela derailment in Spain in 2013 (79 dead, 139 injured out of 218 passengers), and the Fiesch derailment in Switzerland in 2010 (1 dead, 42 injured).  The root cause explanation of these accidents tends to focus on the drivers driving faster than they should, and countermeasures tend to focus on semi-automated systems to control train speed.

Do we really know the root cause of these accidents, and are the countermeasures both effective and economic?

One of the best root cause analyses I’ve seen on the Amagasaki crash comes from Unuma Takashiro, and his conclusion is unconventional.  Unuma-san is a Failure Modes and Effects Analysis (FMEA) consultant from Japan. FMEA is one of the best methods to analyze a design to help prevent failures.  FMEA was developed in the aviation and space industries in the 1960’s, adopted by the automotive industry in the 1990’s, and is now prevalent in many industries including health care.

Unuma-san argues that in the case of the Amagasaki crash, the speed control system is expensive and not fail-safe.  One economic and effective countermeasure would be to add a $250,000 guard rail, which at the very least would likely prevent a recurrence, and definitely be useful as an additional layer of countermeasure.  The advantage of low-cost and effective countermeasures is that they can be widely-deployed.

amagasaki-1

He argues the real root cause of this failure is that the overall engineering and management approach to mitigating failures was not adequate – both initially to prevent the accident in the first place, and subsequently after the crash by putting in the speed control system but not (also) the guard rail.

Unuma-san has a very interesting and useful website on FMEA practices, and uses the Amagasaki crash as one of many examples.  He promotes a FMEA method that uses an absolute evaluation method of countermeasures, as compared to the conventional FMEA which uses a relative evaluation method of countermeasures.  The problem with the relative evaluation method is that it can easily miss important failure modes that do not make an arbitrary priority cutoff.  Missing important failure modes often leads to unexpected incidents.

He also analyzes the conventional FMEA approach and teachings, and points out many problems seen in industry:

  • ineffective because of missing failure modes,
  • done too late in the design process, making it more difficult and less likely to implement countermeasures,
  • led by team members from other departments that are not responsible for the design, which both lowers the effectiveness of the analysis and can allow the designer to not be held fully accountable for the FMEA results,
  • doesn’t promote economical countermeasures, and
  • many of the common FMEA teachings contain flaws that promote the above problems.

Unuma-san shows that many FMEAs confuse failure mechanisms (the physical, chemical, thermal, electrical, biological, or other stresses leading to the failure mode) and the actual failure modes (ways a product or process can fail), leading to missing failure modes.  If a failure mode is missed, then there may be no countermeasure identified, and subsequently incorporated into the design.

He points out that the relative evaluation FMEA method promotes doing the FMEA on the entire design when enough of the design is done, then once the FMEA is done to a certain level, all of the issues are prioritized, and then acted upon.  The problem with this approach is that FMEAs take a lot of time, and by the time the results are done, the recommended changes to the design can be too late to be easily implemented.  He promotes instead that the designers do the FMEA as they are doing the design in a very concurrent and “local” manner, while evaluating the countermeasures in an absolute manner against the individual failure mode.  This more easily allows for countermeasures to get into the design of the product or process in the early stages.

When non-designers take too much of the FMEA responsibility and scope, the effectiveness of the FMEA is reduced and the results are available late in the design process.  The effectiveness is reduced because non-designers are unable to know all the key information in the heads of the designers, and the designers may feel less accountable for the FMEA quality.  Results are delayed because instead of countermeasures being considered at the time of the design decision, they are made available after the design decision has been made and it is then more difficult and less likely to have any countermeasure implemented.

Unuma-san’s method is simpler than many FMEAs, by using a four-point scale to the third power (64 ratings), vs. many conventional approaches using of a 10-point scale to the third power (1000 ratings).  He promotes determining countermeasures per failure mode, evaluating the likely success of those countermeasures, and whether there is opportunity for optimization and lower costs from reducing overdesign.

Unuma-san goes on to analyze the common teachings of FMEA by referring to many of the most common reference material available in books, training material, websites, etc. and he shows many flaws, inconsistencies, interpretation issues etc. that tend to exacerbate the above issues.  Much of the trouble with conventional FMEAs can be traced to poor teachings.

Unuma-san has consulted for a very long and impressive list of Japanese companies on FMEA in the transportation, health care, manufacturing, and consumer goods industries.

I’ve been both a lead designer of multiple complex systems, and I’ve been helping clients improve their product development processes, including FMEA.  The teachings of Unuma-san resonate strongly with me.  Too often I have seen poorly done FMEAs that miss critical failure modes, late FMEAs whose recommendations are too late to be useful, and FMEA study teams that don’t have enough participation by the design team.  The absolute evaluation method FMEA is a substantial improvement over the relative evaluation method, mostly because it evaluates the likely success of countermeasures.  I highly recommend his webpage on FMEAs, and it is linked here.  It is a little hard to read as the website translation to English isn’t the best, but worthwhile.

I think one of the reasons why FMEA teachings have many issues is that few FMEA teachers have been skilled design engineers, but are instead people that gravitate to process design.  The idea behind the FMEA is good and includes teaching early and effective analysis, unfortunately much of the applied practice falls short.  A skilled design engineer naturally considers failure modes and tries to design them out, while simultaneously considering many other design tradeoffs, such as performance, function, economics, aesthetics, ergonomics, etc.  I think that since the conventional FMEA trainers developed the applied practice of the FMEA, they have continued to build upon the original process of the relative assessment method, and have struggled to develop effective practices that overcome the conventional process shortcomings.  In my experience many design engineers have found FMEA to be a good idea but too slow, too time-consuming, and not effective enough to really embrace it.

What I like about Unuma-san’s method it is practical, effective, time-efficient, and evaluates the likely success of countermeasures.  It can be very useful to have FMEA experts, trained in this method, who can help designers with training, facilitation, documentation, review, etc.

There are a few other improved FMEA methods available that are trying to address some of the effectiveness and lateness problems with conventional FMEAs, such as “FMMEA”, (Failure Modes, Mechanisms, and Effects Analysis), and there are good teachings in these methods as well.  I have found Unuma-san’s method to be among the best and really resonates with me.

FMEA is one of the best methods to help avoid failures.  By making the method more effective, products, processes, projects, and infrastructure can have less problems and be more economical.  I highly recommend further study on this topic for engineers and managers delivering any system.

Craig Louie, P.Eng., Co-Founder, SysEne Consulting

There isn’t a dilemma in autonomous vehicles having to choose between harming their passengers or others.

bloga

A recent study in Science “The Social Dilemma of Autonomous Vehicles” has highlighted how self-driving cars need to have algorithms to decide on actions in extreme situations – and even having to choose between protecting the passengers vs. pedestrians.  The study results indicate that participants favor minimizing the overall number of public deaths even if it puts the vehicle in harm’s way.  But when asked about which cars they would actually buy, participants would choose a car that would protect them first.  This study highlights an apparent conflict between morality and autonomy.

I like the study, as it raises good questions, and it describes part of one of the many issues in autonomous vehicles.  I also like the many news articles that have been written on this topic based on the study, as it helps raise awareness of the complexity of the issue – that it is both social and technical.  At the same time, for the sake of being newsworthy, and controversial, most narratives I read on the topic frame the study and topic as a social dilemma.  Yet when examined through a technical perspective, we will have dramatically safer situations for both passengers and pedestrians with autonomous vehicles, and there isn’t any dilemma.

Traffic related death rates are over 1.25 million deaths worldwide per year, and with aging drivers, distracted driving, higher speeds, prevalence of substance abuse all contributing to stubbornly keep the rate high.  For every person killed in a motor-vehicle accident, 8 are hospitalized and 100 are treated and released from emergency rooms.  Autonomous driving, when implemented well, will easily reduce this by 90%, and perhaps by 99% when fully implemented.   The response time, sensing, spatial awareness, decision-making, and reliability of an autonomous vehicle will be better than most of us, except perhaps for highly trained and talented drivers, and definitely infinitely better than too many of our driving population that cause most accidents (distracted, drunk, inexperienced, tired, reduced reflexes, etc.).  The autonomous capability allows us to have a safer response for both the passenger and pedestrian.

Consider that the autonomous vehicle can respond faster than most humans.  I have the lane departure warning system on my car, and it is much faster than me.  An autonomous vehicle will be able to brake faster, more optimally, and steer a better adaptive path that is more likely to minimize injury to both passenger and pedestrian.  Most drivers can’t brake as fast, or optimize the braking pressure, or optimize the steering adjustments during the emergency maneuver as well as a well-implemented autonomous vehicle.  The following picture shows a better braking and adaptive steering path with the best overall outcome for both passenger or pedestrian.  In the event of a collision, the overall speed, impact angle, etc. will be reduced.

blog1

With autonomous vehicles, there will still be accidents, and there will be cases where it will be determined that the autonomous vehicle did not make the best decision.  But the overall absolute level of safety will go up so dramatically, that the question will not be “isn’t this the wrong car to buy because it may decide wrongly in an extreme case?” but “isn’t this the right car to buy because it is overall so much safer for me and everyone else?”.  The moral path is to embrace autonomous vehicles, and work towards a proper system design and implementation in industry, government, and with consumers.

A more useful Requirements Process Maturity Model

A useful diagnostic tool to help determine problem areas and areas for improvements are maturity models.  They can be used by both the client and consultant to determine the current level of performance.  The target level of performance doesn’t need level 5 (highest capability) for everything, as that is likely too expensive or difficult to achieve, or not necessarily needed.

One of the best ways of improving technology and product development is for your organization be good at developing and managing requirements.  About 70% of problems in technology and product development come from requirements and system interaction errors, and fixing these problems at the final acceptance test or in the customer’s hands costs about 100 times more than fixing them in the requirements development and management phases of the project.  Basically build the right thing, build things right, and find problems early.

For requirements development and management, there are a few maturity models published, but I have found them too specific to an industry (like for business analysts in the software industry), cover only certain aspects, or don’t cover integration, training, or culture well enough.  So I’ve developed the above model based on similar models from consulting houses, CMMI, Six Sigma, Model Based Systems Engineering, PLM, and my own background.  I think this can apply to all kinds of systems, from hardware-oriented (manufacturing, construction), software-oriented, or combinations of both.

Requirements Process Maturity Model

(click for full size)

Using this tool can then help structure the problem, ask the right questions and prioritize opportunities. Where does your organization stack up?

If you have comments or questions on the model, or have ideas for improvements, please contact me.

 

 

 

 

System Level Website Failures – Technical, Process, and Organization

In BC, we recently had a windstorm that knocked out power in the province for over 700,000 people, some for 4 days.  One of the most difficult parts of the outage was that the BC Hydro Website that provides outage updates also went out at the same time. This made it very difficult for people without power decide on what to do, where to go, what to do with the food in the refrigerator, etc. and made for many unhappy customers.

Many critical websites are complex systems, and fail more often than desired.  A good example was the failure of the ObamaCare’s HealthCare.gov website launch where there were serious technical problems at the rollout, which has subsequently taken about 6 months to fix the major issues. On launch day, as soon as the website hit about 2,000 simultaneous users, the website performance became unusable, which was an issue since on the first day, 250,000 simultaneous users tried to get access to the website. There are many other problems with the Healthcare.gov, as that project had large budget overruns, with $1.7 Billion dollars spent, which is about 10 times more than budget and what it should have cost. There are also lasting data and security problems with the website and internal database.

healthcare.gov-crash-1

The majority of the root causes of the Healthcare.gov failure were systems-level failures in all three major dimensions of any complex system delivery: technical, process, and organization.

  1. Technical: The system design used an outdated 1990’s database server model that doesn’t scale well with many concurrent users, as opposed to using a more typical e-commerce server model that can scale with users.
  2. Process: The system development process used a waterfall approach to build most of the website and then test it, vs. an agile approach where you test the important parts all throughout the development process.  Additionally there was very little testing during the development.  They were even off by a factor of five on the concurrent user requirement.
  3. Organization: The organizational system of the Government and the Contractor were poor with too many delays, last minute changes, poor subcontracting, poor reporting, and poor coordination.

BC Hydro is conducting a root case investigation of their website failure.  Perhaps the root cause was a simple and isolated issue, but I am interested to hear when the investigation is done on whether the failure had similar systems-level causes like the HealthCare.gov launch failure. For any complex interrelated technical, process and organizational complex problem, the Systems Approach is the best way to develop a solution that satisfies the overall needs and meets the expected behaviours of the system.

Tailoring Product Development Processes

There is a wide spectrum of product development processes, from stage gate to spiral processes.  Stage gate processes are able to stage scope and investment decisions and are typically employed in capital intensive industries.  Spiral processes take advantage of many repetitions of the design-build-test cycle and are typically employed in software development.  There are many variants in between.

PDP

 

To best tailor the product development process for the organization, it is important to understand the:

  • business and strategy of the organization
  • architecture and complexity of the product
  • product/project schedule, budget, and requirements
  • risks and uncertainties
  • needed iterations in the process
  • capability and culture of the organization, including Global aspects
  • customers, stakeholders, and suppliers
  • best practices

The resulting product development process is then “systems engineered” as it is an integration of systems and systems elements – technical, process, and people.

There are many useful methods to choose from during this design process, including:

  • Design Structure Matrix (Eppinger)
  • Agile Methods
  • Lean Methods
  • Model Based Engineering
  • Collaborative Supplier Integration
  • Risk-based Planning
  • Quality Approaches

A key aspect of product development is dealing with all the risks and uncertainties, which means iteration is inherent in the process.  There are both planned iterations and unplanned iterations (to fix it when it’s not right).  It is important to understand the linkages, interactions, and drivers behind how the iterations will happen.  From that understanding, iteration can be accelerated through information technology, coordination techniques, or decreased coupling.  After that, by prioritizing risks, planning the needed iterations, planning the integration and test activities, and scheduling reviews to control the process, the project risks can be addressed.

The process must also be tailored to the organization, specific people, and key stakeholders.  This is probably the most difficult part, as it is all about dealing with people, managing change, and shifting cultures.  It is important to pick and choose the most important methods, implement them, and sustain them, in a practical way. Too many processes fail because they are not used, unwieldy, inflexible, not fully coherent,  too conservative, too bureaucratic, take too many resources, or are only partly implemented.  Beyond process definition, there is training, coaching, fine-tuning, and ensuring the team sees that the change is in their self-interest to adopt, and really “owns” any new processes.

While overall improving the process is a complex and difficult initiative, having a competitive Product Development process is key to quality products, low costs, speed to market, satisfied customers, and good business.