David Haber

Building mission-critical AI with Lakera. Based in Zurich, Switzerland.

Seeing the forest for the trees: A more disciplined approach for AI

Data-driven decision-making systems in production today exhibit serious conceptual flaws. Deployed as part of high-stakes applications, they not only risk to compromise individual safety and a company’s financial performance but threaten economic and political stability at scale.

To achieve reliable performance of AI in the real world, we need to shift the focus from the development of ML models towards a more systematic discipline to create fail-safe systems and life cycles.

Cars go through rigorous stress testing as part of the normal development workflow. Fail-safe engineering is still underdeveloped in AI compared to more conventional technologies used in transportation, healthcare, finance, or other industries. - Photo: Mercedes-Benz

Cars go through rigorous stress testing as part of the normal development workflow. Fail-safe engineering is still underdeveloped in AI compared to more conventional technologies used in transportation, healthcare, finance, or other industries. - Photo: Mercedes-Benz

While many still argue that AI’s impact has been modest, the importance to address this now becomes clear if we look at the air balls that we already managed to throw.

In the midst of recent wildfires in the US, one of Google’s algorithms recommended outdoor exercise to people1, exposing their respiratory systems to harmful amounts of smoke and putting their health at severe risk. While the company might have acted in good faith (exercise is good after all!), they didn’t take into account that a wildfire event would invalidate their data and an otherwise correct recommendation. The real world is always full of surprises.

In 2019, the Wall Street Journal reported that a group of researchers found racial bias in a hospital algorithm which meant that black patients were less likely to receive the medical help they needed than white patients2. The reasons, it was explained, included that the algorithm would rank patients according to healthcare costs and the fact that “health-care spending for black patients was less than for white patients with similar medical conditions”. With better transparency, this issue would have been found more quickly, likely before deployment.

During the current pandemic, headline news claim that AI imaging systems can detect COVID-19 from chest x-rays. While this is all early work, a team at the University of Washington showed that the methodology in building those systems was terribly flawed. The models–which had been reported to have astounding accuracies in the first place–seem to have learned “spurious shortcuts” and imaging artifacts rather than medical pathology3. As one of the consequences, physically moving the patient up in the x-ray “increased the model’s predicted odds that the patient has COVID-19”. It’s a beautiful–and alarming–illustration of the gap between news headlines, early prototypes and the complexities of operating these in hospital rooms.

Clinical decision-making systems recommend procedures and drug treatments to doctors and patients every day. The cost of poor decision-making often only becomes apparent when we do the math at scale. While false predictions could pose small risks at an individual level, we forget that they needlessly harm a number of human beings every day when decision-making systems are deployed and used at scale - across offices, hospitals, and countries.

At the same time, academic AI papers claim to outperform humans in a fascinating race to beat state-of-the-art performances. Yet, it is clear to all that our machine learning (ML) models which work well in a Jupyter notebook4, perform less impressively in the real world. A set of metrics, ground truth and predictions is all that is needed to arrive at the “my system works well on the test set” conclusion. While there is nothing wrong with that, the problem is that–despite being neither a complete nor a sound assessment–we erroneously extrapolate that performance to complex, dynamic and real-world environments all too often.

ML code is only a small part of the solution.
Andrew Ng, Co-founder of Coursera and Adjunct Professor at Stanford University5

The truth is that the development of complete AI systems is extremely challenging and the organizational effort required to drive such projects to completion is orders of magnitude higher than developing proof-of-concept or prototype models. Especially as the challenges of real-world AI transcend data science and code. Complete systems need to be analyzed in the context of the intended application, how they interact with and are influenced by their environment and any hardware that is used. Failure situations need to be examined through failure mode and effects analyses6 and influence the design and procedures which are used during development and operation. Finally, these systems will be used by humans with different backgrounds, skills, and their own individual ways of interacting with technology. As a consequence, AI development requires great care in the design of human-computer interfaces and the consideration of human factors7. It is a cross-disciplinary effort that needs to involve multiple stakeholders and domain expertise to truly understand the end user we want to serve.

What is the result of this? Businesses around the world struggle to realize a return on their AI investments - too many projects and even entire companies are stuck in the “pilot trap”.

Having gone through the development of complex, physical AI systems with strict safety guarantees ourselves, we understand the challenges and the need to rethink our strategies to develop real-world AI systems. This is particularly true in the context of evolving regulatory standards which will create additional pressure for AI developers and companies in all major industries over the next few years. But it is also relevant beyond regulations - for anyone looking to realize the impact that AI has long been promising.

So, what does it take? We need to better understand how to design, develop and operate AI from a system’s perspective. Reasoning about systems rather than models opens the toolboxes that safety and systems engineers have been using to build robust systems for decades. We have the tools. What’s missing is our own toolbox.

The building blocks are in place, the principles for putting these blocks together are not, and so the blocks are currently being put together in ad-hoc ways…What we’re missing is an engineering discipline with principles of analysis and design.
Michael Jordan, Professor at the University of California, Berkeley8

A more disciplined approach to AI development would not only create better products but also enable completely new applications. It would help us achieve the small failure probabilities that healthcare, transportation, finance, and other industries require.

Most importantly, it would help convert the AI term from an intellectual wildcard into something more fundamental, something that we can understand and reason about. Only then can we have meaningful discussions around safety, ethics, and regulations. Only then can we nudge AI projects out of the “pilot trap” and deploy them safely & robustly in our complex world.

What does this all mean for startups and corporations? How can they establish development life cycles, create AI products and structure their organizations? What does it mean for you? Bear with us! We will present some thoughts around these questions in future articles.

Thanks to Rui , Moritz , Matthias , Mateo , Anna and Andy for reading drafts of this article, providing feedback and discussing with me how to build better AI.


  1. How bad is Sacramento’s air, exactly? Google results appear at odds with reality, some say (The Sacramento Bee, 2018) ↩︎

  2. Researchers Find Racial Bias in Hospital Algorithm (The Wall Street Journal, 2019) ↩︎

  3. DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. “AI for radiographic COVID-19 detection selects shortcuts over signal.” medRxiv (2020). ↩︎

  4. A popular programming environment to run ML experiments, mostly at smaller scale. ↩︎

  5. Ng, Andrew. “Bridging AI’s Proof-of-Concept to Production Gap” (2020) ↩︎

  6. Wikipedia article: Failure mode and effects analysis ↩︎

  7. A term used by the European Union Aviation Safety Agency (EASA) to describe “anything that affects human performance”. ↩︎

  8. Jordan, Michael I. “Artificial intelligence—the revolution hasn’t happened yet.” Harvard Data Science Review 1.1 (2019). ↩︎