Why Conventional FMEAs fail too often, and why the Absolute Assessment Method FMEA is much better.

(Failure Modes and Effects Analysis)

amagasaki-0

On Oct 1, 2016, a commuter train crashed in New Jersey killing one and injuring 108 with high speed being a factor. The root cause of the crash is under investigation.

A similar crash happened in Amagasaki, Japan in April 2005 where 106 were killed and 562 injured, and high speed around a curve was a factor.  The conventional explanation of the root cause of the Amagasaki crash was corporate pressure on the driver to be on time.  Drivers would face harsh penalties for lateness, including harsh and humiliating “training” programs which included weeding and grass cutting duties.  In this case, the driver was speeding.  The resulting countermeasure in Amagasaki has been to put in an expensive $1-billion-dollar train speed control system on the small line to help mitigate a potential accident.

There have been many other high speed passenger train derailments, such as the Santiago de Compostela derailment in Spain in 2013 (79 dead, 139 injured out of 218 passengers), and the Fiesch derailment in Switzerland in 2010 (1 dead, 42 injured).  The root cause explanation of these accidents tends to focus on the drivers driving faster than they should, and countermeasures tend to focus on semi-automated systems to control train speed.

Do we really know the root cause of these accidents, and are the countermeasures both effective and economic?

One of the best root cause analyses I’ve seen on the Amagasaki crash comes from Unuma Takashiro, and his conclusion is unconventional.  Unuma-san is a Failure Modes and Effects Analysis (FMEA) consultant from Japan. FMEA is one of the best methods to analyze a design to help prevent failures.  FMEA was developed in the aviation and space industries in the 1960’s, adopted by the automotive industry in the 1990’s, and is now prevalent in many industries including health care.

Unuma-san argues that in the case of the Amagasaki crash, the speed control system is expensive and not fail-safe.  One economic and effective countermeasure would be to add a $250,000 guard rail, which at the very least would likely prevent a recurrence, and definitely be useful as an additional layer of countermeasure.  The advantage of low-cost and effective countermeasures is that they can be widely-deployed.

amagasaki-1

He argues the real root cause of this failure is that the overall engineering and management approach to mitigating failures was not adequate – both initially to prevent the accident in the first place, and subsequently after the crash by putting in the speed control system but not (also) the guard rail.

Unuma-san has a very interesting and useful website on FMEA practices, and uses the Amagasaki crash as one of many examples.  He promotes a FMEA method that uses an absolute evaluation method of countermeasures, as compared to the conventional FMEA which uses a relative evaluation method of countermeasures.  The problem with the relative evaluation method is that it can easily miss important failure modes that do not make an arbitrary priority cutoff.  Missing important failure modes often leads to unexpected incidents.

He also analyzes the conventional FMEA approach and teachings, and points out many problems seen in industry:

  • ineffective because of missing failure modes,
  • done too late in the design process, making it more difficult and less likely to implement countermeasures,
  • led by team members from other departments that are not responsible for the design, which both lowers the effectiveness of the analysis and can allow the designer to not be held fully accountable for the FMEA results,
  • doesn’t promote economical countermeasures, and
  • many of the common FMEA teachings contain flaws that promote the above problems.

Unuma-san shows that many FMEAs confuse failure mechanisms (the physical, chemical, thermal, electrical, biological, or other stresses leading to the failure mode) and the actual failure modes (ways a product or process can fail), leading to missing failure modes.  If a failure mode is missed, then there may be no countermeasure identified, and subsequently incorporated into the design.

He points out that the relative evaluation FMEA method promotes doing the FMEA on the entire design when enough of the design is done, then once the FMEA is done to a certain level, all of the issues are prioritized, and then acted upon.  The problem with this approach is that FMEAs take a lot of time, and by the time the results are done, the recommended changes to the design can be too late to be easily implemented.  He promotes instead that the designers do the FMEA as they are doing the design in a very concurrent and “local” manner, while evaluating the countermeasures in an absolute manner against the individual failure mode.  This more easily allows for countermeasures to get into the design of the product or process in the early stages.

When non-designers take too much of the FMEA responsibility and scope, the effectiveness of the FMEA is reduced and the results are available late in the design process.  The effectiveness is reduced because non-designers are unable to know all the key information in the heads of the designers, and the designers may feel less accountable for the FMEA quality.  Results are delayed because instead of countermeasures being considered at the time of the design decision, they are made available after the design decision has been made and it is then more difficult and less likely to have any countermeasure implemented.

Unuma-san’s method is simpler than many FMEAs, by using a four-point scale to the third power (64 ratings), vs. many conventional approaches using of a 10-point scale to the third power (1000 ratings).  He promotes determining countermeasures per failure mode, evaluating the likely success of those countermeasures, and whether there is opportunity for optimization and lower costs from reducing overdesign.

Unuma-san goes on to analyze the common teachings of FMEA by referring to many of the most common reference material available in books, training material, websites, etc. and he shows many flaws, inconsistencies, interpretation issues etc. that tend to exacerbate the above issues.  Much of the trouble with conventional FMEAs can be traced to poor teachings.

Unuma-san has consulted for a very long and impressive list of Japanese companies on FMEA in the transportation, health care, manufacturing, and consumer goods industries.

I’ve been both a lead designer of multiple complex systems, and I’ve been helping clients improve their product development processes, including FMEA.  The teachings of Unuma-san resonate strongly with me.  Too often I have seen poorly done FMEAs that miss critical failure modes, late FMEAs whose recommendations are too late to be useful, and FMEA study teams that don’t have enough participation by the design team.  The absolute evaluation method FMEA is a substantial improvement over the relative evaluation method, mostly because it evaluates the likely success of countermeasures.  I highly recommend his webpage on FMEAs, and it is linked here.  It is a little hard to read as the website translation to English isn’t the best, but worthwhile.

I think one of the reasons why FMEA teachings have many issues is that few FMEA teachers have been skilled design engineers, but are instead people that gravitate to process design.  The idea behind the FMEA is good and includes teaching early and effective analysis, unfortunately much of the applied practice falls short.  A skilled design engineer naturally considers failure modes and tries to design them out, while simultaneously considering many other design tradeoffs, such as performance, function, economics, aesthetics, ergonomics, etc.  I think that since the conventional FMEA trainers developed the applied practice of the FMEA, they have continued to build upon the original process of the relative assessment method, and have struggled to develop effective practices that overcome the conventional process shortcomings.  In my experience many design engineers have found FMEA to be a good idea but too slow, too time-consuming, and not effective enough to really embrace it.

What I like about Unuma-san’s method it is practical, effective, time-efficient, and evaluates the likely success of countermeasures.  It can be very useful to have FMEA experts, trained in this method, who can help designers with training, facilitation, documentation, review, etc.

There are a few other improved FMEA methods available that are trying to address some of the effectiveness and lateness problems with conventional FMEAs, such as “FMMEA”, (Failure Modes, Mechanisms, and Effects Analysis), and there are good teachings in these methods as well.  I have found Unuma-san’s method to be among the best and really resonates with me.

FMEA is one of the best methods to help avoid failures.  By making the method more effective, products, processes, projects, and infrastructure can have less problems and be more economical.  I highly recommend further study on this topic for engineers and managers delivering any system.

Craig Louie, P.Eng., Co-Founder, SysEne Consulting

There isn’t a dilemma in autonomous vehicles having to choose between harming their passengers or others.

bloga

A recent study in Science “The Social Dilemma of Autonomous Vehicles” has highlighted how self-driving cars need to have algorithms to decide on actions in extreme situations – and even having to choose between protecting the passengers vs. pedestrians.  The study results indicate that participants favor minimizing the overall number of public deaths even if it puts the vehicle in harm’s way.  But when asked about which cars they would actually buy, participants would choose a car that would protect them first.  This study highlights an apparent conflict between morality and autonomy.

I like the study, as it raises good questions, and it describes part of one of the many issues in autonomous vehicles.  I also like the many news articles that have been written on this topic based on the study, as it helps raise awareness of the complexity of the issue – that it is both social and technical.  At the same time, for the sake of being newsworthy, and controversial, most narratives I read on the topic frame the study and topic as a social dilemma.  Yet when examined through a technical perspective, we will have dramatically safer situations for both passengers and pedestrians with autonomous vehicles, and there isn’t any dilemma.

Traffic related death rates are over 1.25 million deaths worldwide per year, and with aging drivers, distracted driving, higher speeds, prevalence of substance abuse all contributing to stubbornly keep the rate high.  For every person killed in a motor-vehicle accident, 8 are hospitalized and 100 are treated and released from emergency rooms.  Autonomous driving, when implemented well, will easily reduce this by 90%, and perhaps by 99% when fully implemented.   The response time, sensing, spatial awareness, decision-making, and reliability of an autonomous vehicle will be better than most of us, except perhaps for highly trained and talented drivers, and definitely infinitely better than too many of our driving population that cause most accidents (distracted, drunk, inexperienced, tired, reduced reflexes, etc.).  The autonomous capability allows us to have a safer response for both the passenger and pedestrian.

Consider that the autonomous vehicle can respond faster than most humans.  I have the lane departure warning system on my car, and it is much faster than me.  An autonomous vehicle will be able to brake faster, more optimally, and steer a better adaptive path that is more likely to minimize injury to both passenger and pedestrian.  Most drivers can’t brake as fast, or optimize the braking pressure, or optimize the steering adjustments during the emergency maneuver as well as a well-implemented autonomous vehicle.  The following picture shows a better braking and adaptive steering path with the best overall outcome for both passenger or pedestrian.  In the event of a collision, the overall speed, impact angle, etc. will be reduced.

blog1

With autonomous vehicles, there will still be accidents, and there will be cases where it will be determined that the autonomous vehicle did not make the best decision.  But the overall absolute level of safety will go up so dramatically, that the question will not be “isn’t this the wrong car to buy because it may decide wrongly in an extreme case?” but “isn’t this the right car to buy because it is overall so much safer for me and everyone else?”.  The moral path is to embrace autonomous vehicles, and work towards a proper system design and implementation in industry, government, and with consumers.

Staying ahead of upcoming restrictive drone regulations

drone

How can drone developers avoid being shut down by an accident?

With the gradual increase in the use of commercial and consumer drones, we constantly hear about near-collisions and other incidents. Some of these incidents might be over-dramatized by the media or by whoever reported them, yet the overall risk is rising. In the UK,multiple close encounters of the drone kind have been reported recently, with some very close calls between small drones and large passenger jets. In a recent incident in LA, a helicopter was reportedly struck by what was probably a small drone, fracturing the windshield. Not only aircraft and their passengers are at risk – people on the ground can also get injured. In another incident, an innocent hobbyist’s drone clipped a tree and dropped towards the ground, causing serious eye injury to his friend’s son, a young toddler. In the Netherlands a small drone lost contact with its operator and flew away, gradually running out of battery, then eventually descended onto a busy highway. Although the incident did not result in any damage to people or equipment, it did damage the local drone industry, which was shortly thereafter subjected to extensive flight restrictions. In Vancouver, there have been a number of cases of drones reported near the final approach to YVR and at least a couple of cases when commercial drones crashed and caused minor damage to parked vehicles. YVR airport has recently launched a drone awareness program.

Small drones can help reduce emissions (by replacing larger aircraft for similar operation), save lives through search & rescue operations, replace manned aircraft in dangerous operations, aid in “precision agriculture” that helps produce better yields, help inspect smoke stacks and wind-turbines without the need for downtime, and many other possible applications. When designed and used appropriately, the utility of drones is enormous, and this utility is the essence of why the commercial drone industry has been growing so quickly. Drones can do a lot to advance society.

One of the major barriers to the full public acceptance of drones is that they pose safety concerns to the public. Typical root causes of incidents to date include:

  • Irresponsible operation. Not all drone operators are as experienced, cautious, or responsible as the best commercial operators. The fact that consumer drones are relatively inexpensive makes it possible to start a small business using drones at relatively low (apparent) risk, or for a user to purchase a unit for hobby purposes, with little experience or knowledge. Remote control airplane hobbyists are generally responsible, knowledgeable, and have the required skill to safely fly their models; however, modern drones and particularly the multirotor type are easy to operate anywhere, even for the uninformed or inexperienced operator. Owning and operating a drone safely, requires knowledge, skill, and responsibility. One can begin by taking not-too-costly drone courses, either online or in class.
  • Technical problems associated with performance, reliability, or other shortfalls, for example:
    • Drone flyaway, where a drone suddenly flies away due to a broken communication link with the remote control unit (because of remote control failure or radio interference), a software glitch, operator error, design issues, etc.
    • Engine failure. A variety of solutions have been developed that allow a multirotor drone to recover from engine failure, while many of the products currently on the market do not have such recovery capability.   With fixed wing drones, loss of an engine is generally easier to recover from.
    • Loss of control, for example because of interference confusing the unit’s compass, GPS, or inertial sensors, or due to incorrect orientation or calibration of the unit’s compass (“toilet bowl effect”).
    • Power failure due to battery failure or wiring issues.

Drone technology improvements are enabling more capable and lower cost drones, increasing the numbers of drones and the overall safety risk. Improvements include enabling technologies such as lithium polymer batteries, flight control and ground control station software, small-size (low-weight) cameras and other sensors, and small flight controllers based on solid state electronic components.   There are also numerous technologies and techniques that enhance the reliable operation of drones, such as monitoring battery hours/cycles/performance, setting parameters properly, performing calibration, etc. Flight control has become more affordable with software-enabled augmentation of lower-accuracy inertial sensors.

There is significant opportunity for improving the safety and reliability of small drones to the necessary level, by developing more robust system architectures, improving operational procedures and operator qualifications, technology innovation, improved regulation, and by following more rigorous techniques in the design and manufacturing of these products. Most drones do not fully employ the proven and robust approaches used in the manned aircraft industry in the areas of design, testing, regulation, maintenance, inspection, and other best practices. While some of these approaches are more rigorous and expensive than necessary, and can be relaxed somewhat to be suitable for the drone industry, there is high value in many of these approaches that can lead to both adequate safety and risk profiles, and low enough cost and weight.

Regulations are developing worldwide, and all in the direction of more restrictive or higher required capability and proof. In some countries, there is a move towards restricting the operation of drones near built-up areas and air-fields to “compliant systems” only, which is a challenge for all drone developers, and will likely leave some of the lower-cost manufacturers behind. Manufacturers who are proactive in economically developing reliable, safer, compliant products will be the most successful in the marketplace, as they will be able to operate where others cannot and will avoid reliability issues in the marketplace. Even consumer drones are complex products, and their safe operation an even more complex challenge. One major incident caused by technical failure, or by a design that does not prevent user error, could result in a damaging effect on the national or global drone market.

A systems approach to this complex issue that combines proactive strategy, careful risk analysis, economically innovative solutions, and best practices tailored to the drone industry will enable leading drone developers to get ahead on this issue. Such effort may seem costly, but is by far outweighed by the potential repercussions of failure to prevent an incident, even if it has been caused by “operator error”. It you think safety is too costly, try an accident!

A more useful Requirements Process Maturity Model

A useful diagnostic tool to help determine problem areas and areas for improvements are maturity models.  They can be used by both the client and consultant to determine the current level of performance.  The target level of performance doesn’t need level 5 (highest capability) for everything, as that is likely too expensive or difficult to achieve, or not necessarily needed.

One of the best ways of improving technology and product development is for your organization be good at developing and managing requirements.  About 70% of problems in technology and product development come from requirements and system interaction errors, and fixing these problems at the final acceptance test or in the customer’s hands costs about 100 times more than fixing them in the requirements development and management phases of the project.  Basically build the right thing, build things right, and find problems early.

For requirements development and management, there are a few maturity models published, but I have found them too specific to an industry (like for business analysts in the software industry), cover only certain aspects, or don’t cover integration, training, or culture well enough.  So I’ve developed the above model based on similar models from consulting houses, CMMI, Six Sigma, Model Based Systems Engineering, PLM, and my own background.  I think this can apply to all kinds of systems, from hardware-oriented (manufacturing, construction), software-oriented, or combinations of both.

Requirements Process Maturity Model

(click for full size)

Using this tool can then help structure the problem, ask the right questions and prioritize opportunities. Where does your organization stack up?

If you have comments or questions on the model, or have ideas for improvements, please contact me.

 

 

 

 

System Level Website Failures – Technical, Process, and Organization

In BC, we recently had a windstorm that knocked out power in the province for over 700,000 people, some for 4 days.  One of the most difficult parts of the outage was that the BC Hydro Website that provides outage updates also went out at the same time. This made it very difficult for people without power decide on what to do, where to go, what to do with the food in the refrigerator, etc. and made for many unhappy customers.

Many critical websites are complex systems, and fail more often than desired.  A good example was the failure of the ObamaCare’s HealthCare.gov website launch where there were serious technical problems at the rollout, which has subsequently taken about 6 months to fix the major issues. On launch day, as soon as the website hit about 2,000 simultaneous users, the website performance became unusable, which was an issue since on the first day, 250,000 simultaneous users tried to get access to the website. There are many other problems with the Healthcare.gov, as that project had large budget overruns, with $1.7 Billion dollars spent, which is about 10 times more than budget and what it should have cost. There are also lasting data and security problems with the website and internal database.

healthcare.gov-crash-1

The majority of the root causes of the Healthcare.gov failure were systems-level failures in all three major dimensions of any complex system delivery: technical, process, and organization.

  1. Technical: The system design used an outdated 1990’s database server model that doesn’t scale well with many concurrent users, as opposed to using a more typical e-commerce server model that can scale with users.
  2. Process: The system development process used a waterfall approach to build most of the website and then test it, vs. an agile approach where you test the important parts all throughout the development process.  Additionally there was very little testing during the development.  They were even off by a factor of five on the concurrent user requirement.
  3. Organization: The organizational system of the Government and the Contractor were poor with too many delays, last minute changes, poor subcontracting, poor reporting, and poor coordination.

BC Hydro is conducting a root case investigation of their website failure.  Perhaps the root cause was a simple and isolated issue, but I am interested to hear when the investigation is done on whether the failure had similar systems-level causes like the HealthCare.gov launch failure. For any complex interrelated technical, process and organizational complex problem, the Systems Approach is the best way to develop a solution that satisfies the overall needs and meets the expected behaviours of the system.

Tailoring Product Development Processes

There is a wide spectrum of product development processes, from stage gate to spiral processes.  Stage gate processes are able to stage scope and investment decisions and are typically employed in capital intensive industries.  Spiral processes take advantage of many repetitions of the design-build-test cycle and are typically employed in software development.  There are many variants in between.

PDP

 

To best tailor the product development process for the organization, it is important to understand the:

  • business and strategy of the organization
  • architecture and complexity of the product
  • product/project schedule, budget, and requirements
  • risks and uncertainties
  • needed iterations in the process
  • capability and culture of the organization, including Global aspects
  • customers, stakeholders, and suppliers
  • best practices

The resulting product development process is then “systems engineered” as it is an integration of systems and systems elements – technical, process, and people.

There are many useful methods to choose from during this design process, including:

  • Design Structure Matrix (Eppinger)
  • Agile Methods
  • Lean Methods
  • Model Based Engineering
  • Collaborative Supplier Integration
  • Risk-based Planning
  • Quality Approaches

A key aspect of product development is dealing with all the risks and uncertainties, which means iteration is inherent in the process.  There are both planned iterations and unplanned iterations (to fix it when it’s not right).  It is important to understand the linkages, interactions, and drivers behind how the iterations will happen.  From that understanding, iteration can be accelerated through information technology, coordination techniques, or decreased coupling.  After that, by prioritizing risks, planning the needed iterations, planning the integration and test activities, and scheduling reviews to control the process, the project risks can be addressed.

The process must also be tailored to the organization, specific people, and key stakeholders.  This is probably the most difficult part, as it is all about dealing with people, managing change, and shifting cultures.  It is important to pick and choose the most important methods, implement them, and sustain them, in a practical way. Too many processes fail because they are not used, unwieldy, inflexible, not fully coherent,  too conservative, too bureaucratic, take too many resources, or are only partly implemented.  Beyond process definition, there is training, coaching, fine-tuning, and ensuring the team sees that the change is in their self-interest to adopt, and really “owns” any new processes.

While overall improving the process is a complex and difficult initiative, having a competitive Product Development process is key to quality products, low costs, speed to market, satisfied customers, and good business.