The Importance of Engineering Rigor, Discipline, and Testing in High-Consequence Applications

Jul 26, 20247 min read

By Andrew Park | 2024-07-26

In the world of software development, the drive for rapid deployment and continuous updates often overshadows the critical need for rigorous engineering practices, discipline, and comprehensive testing. This imbalance can have disastrous consequences, especially for applications where failure can lead to significant disruptions or damage. The recent incident involving CrowdStrike serves as a stark case study illustrating the potential pitfalls of prioritizing speed over stability.

DevOps Research and Assessment (DORA) Philosophy and Its Risks

Since the publication of The Phoenix Project in 2013, DevOps has transformed the software development landscape by emphasizing rapid deployment and frequent updates for increased speed and agility in incremental software releases. In 2018, Accelerate further propelled the movement by popularizing DevOps Research and Assessment (DORA) metrics. These metrics have since become benchmarks for many organizations. The four DORA metrics are:

Deployment Frequency (higher is better),
Lead Time for Changes (lower is better),
Mean Time to Recovery (lower is better), and
Change Failure Rate (lower is better).

While optimizing DORA metrics aims to drive faster incremental innovation in response to market demands, it introduces significant risks for applications with high consequences of failure. The focus on DORA metrics often incentivizes development teams to prioritize higher statistics over rigorous code reviews and thorough testing. Unfortunately, this emphasis has led to a sharp decline in rigorous code reviews and thorough testing over the past decade, as both practices are seen as obstacles to optimizing DORA metrics. Additionally, the pressure to frequently deploy and reduce lead time discourages software developers from engaging in deep, rigorous thinking about all the ways critical software can fail and taking the necessary measures to avoid these risks, ultimately compromising the reliability and robustness of the software.

The CrowdStrike Incident: A Case Study

On July 19, 2024, CrowdStrike released a sensor configuration update that caused a system-wide crash on 8.5 million Windows devices running Falcon sensor version 7.11 and above. The update, intended to target newly observed malicious named pipes, inadvertently triggered a logic error, resulting in operating system crashes and blue screen errors that led to a global IT outage.

The effects of this outage were profound, impacting several critical industries:

Aviation: Over 5,000 flights were canceled globally, disrupting travel plans for thousands of passengers and causing significant financial losses for airlines.
Healthcare: Hospitals and medical facilities experienced interruptions in their systems, delaying patient care and leading to critical safety concerns.
Finance: Banking systems faced operational issues, resulting in delays and errors in financial transactions.
Telecommunications: Service providers experienced disruptions affecting emergency response efforts and communications for businesses and individuals.
Energy: Energy grid management systems faced disruptions affecting millions of people and critical services dependent on a consistent power supply.

According to a recent report by insurer Parametrix, the massive outage is projected to cost Fortune 500 companies more than $5.4 billion, with banking and healthcare companies taking the brunt of the hit, along with major airlines.

CrowdStrike detected and acknowledged the error within two hours, attributing the outage to buggy test software. However, they were unable to rectify the damage. The logic error from the update caused severe system crashes that could not be reversed or fixed without manual intervention on each affected device. This outage exposed the fragile nature of modern technological infrastructures, where a single flawed update can disrupt operations worldwide.

This incident underscores the fundamental mismatch of DORA metrics in applications with high consequences of failure, highlighting the need for enhanced engineering design rigor, discipline, and testing strategies in critical sectors.

CrowdStrike’s disastrous software update would have boosted the first three DORA metrics above. Although this update represented a Change Failure, it is likely that CrowdStrike’s DORA statistics on July 19, 2024, did not look much different from any other day since the disastrous software update represented only one change failure that day.

This incident underscores the critical need for greater discipline, design rigor, and thorough testing in engineering software applications where the consequences of failure are severe. For high-consequence software, the primary goal should be ensuring that each update is reliable and risk-free, rather than over-prioritizing speed and agility. Unfortunately, rigorous design is woefully underemphasized in most DevOps teams today. Despite the widespread adoption of automated testing in DevOps CI/CD systems, the weak link remains human developers, who frequently lack the skills, discipline, and time to create rock-solid, comprehensive unit test frameworks. Additionally, DevOps CI/CD automated test frameworks typically have sparse coverage at higher levels because DevOps engineers have been conditioned by the Agile Testing Pyramid philosophy, which emphasizes unit and component tests while de-emphasizing integration and end-to-end tests.

The CrowdStrike software team placed too much confidence in their automated CI/CD unit tests, while also operating with critical gaps in their integration testing and end-to-end testing. CrowdStrike’s Preliminary Post Incident Review of this event reveals:

3 Lapses in CrowdStrike’s Integration Testing

Inadequate Validation: The engineering team failed to detect a bug in the Content Validator, allowing problematic content to pass through.
Insufficient Stress Testing for Instances: The engineering team did not perform thorough stress testing on individual Template Instances.
Lack of Continuous Integration Monitoring: The engineering team did not implement adequate continuous integration monitoring to detect issues in real-time.

3 Lapses in CrowdStrike’s End-to-End Testing

Failure in Scenario Testing: The engineering team did not conduct complete real-world simulation testing.
Absence of Staggered Deployment: The engineering team did not employ a phased rollout, leading to widespread crashes.
Insufficient Rollback Mechanism: The engineering team lacked effective rollback mechanisms, delaying the response to the issue.

CrowdStrike management has committed to addressing all these integration and end-to-end testing gaps to prevent future issues, but their DevOps engineering teams will have to shift focus away from solely chasing DORA metrics. Many modern software organizations prioritize speed and agility at the expense of managing risk, but high-consequence software requires a balanced testing philosophy. It is clear that DORA statistics are not suitable targets for optimizing high-consequence software applications. Companies working on such critical applications should enhance engineering discipline by imposing rigorous code reviews by qualified staff and strengthening their integration and end-to-end testing to prevent similar catastrophes.

The Need for Higher Standards in High-Consequence Software

Software applications with high consequences of failure require a higher level of discipline and rigor, focusing heavily on robustness and reliability rather than rapid delivery. If you are developing applications where failure has severe consequences, Agile and DevOps methodologies are not sufficient due to their lack of emphasis on risk mitigation. Below is a list of software applications, ranked from highest to lowest consequences of failure. For applications in the upper half of this list, greater engineering rigor, discipline, and testing are paramount, surpassing what Agile and DevOps typically provide, as these methodologies inherently prioritize speed over risk mitigation.

Nuclear Power Plant Control Systems: Failures can cause radiation leaks, environmental disasters, and significant human casualties.
Military and Defense Systems: Failures can compromise national security and lead to unintended military actions.
Aviation Software: Failures can lead to catastrophic accidents and loss of life.
Medical Devices and Healthcare Software: Failures can lead to incorrect diagnoses, treatment errors, and even patient death.
Space Exploration Software: Failures can result in mission failures, loss of expensive equipment, and scientific data loss.
Energy Grid Management: Failures can cause blackouts, affecting millions of people and critical services.
Industrial Control Systems: Failures can lead to production shutdowns, environmental harm, and safety hazards.
Banking Systems: Failures can result in significant financial loss, market instability, and legal issues.
Telecommunications Infrastructure: Failures can disrupt communication and hinder emergency response efforts.
Automotive Systems: Failures can lead to serious accidents and fatalities.
Patch & Vulnerability Management: Failures can leave systems unpatched and exposed to known vulnerabilities, leading to security breaches.
Data Loss Prevention (DLP): Failures can result in unauthorized data exfiltration and breaches.
Identity and Access Management (IAM) & Multi-Factor Authentication (MFA): Failures can result in unauthorized access to critical systems and data.
Network, Web, Email Security: Failures can lead to widespread security breaches and malware infections.
Security Information and Event Management (SIEM): Failures can lead to undetected security incidents and delayed responses.
VPN/ Encryption Software: Failures can lead to insecure connections, data interception, & exposure of sensitive data.
Endpoint Security: Failures can lead to compromised devices and data breaches.
IDS/IPS & Firewalls: Failures can result in undetected or unprevented cyber attacks, unauthorized access, and security breaches.
Antivirus Software: Failures can leave systems vulnerable to malware, causing data loss and operational disruptions.
E-commerce Platforms: Failures can lead to significant financial losses, data breaches, and loss of consumer trust.
Logistics and Supply Chain Management Software: Failures can disrupt the flow of goods and services, leading to delays, increased costs, and customer dissatisfaction.
Payment Processing Systems: Failures can result in payment delays, financial discrepancies, and customer dissatisfaction.
Reservation Systems for Hospitality: Failures can cause significant inconvenience and financial loss for both businesses and customers.
Inventory Management Systems: Failures can lead to supply chain disruptions, overstocking, or stockouts, affecting business operations.
Workforce Management Systems: Failures can cause payroll errors, scheduling issues, and workforce management challenges.
Customer Relationship Management (CRM) Systems: Failures can impact sales, customer service, and relationship management, leading to potential revenue loss and customer dissatisfaction.
Content Management Systems (CMS): Failures can disrupt business operations and online presence, affecting brand reputation and revenue.
Project Management Tools: Failures can disrupt project timelines, collaboration, and resource allocation, impacting project delivery and efficiency.
Smart Home Control Systems: Failures can cause inconvenience and potential security issues.
Booking Apps: Failures can lead to inconvenience and logistical issues but are usually not critical.
Shopping Apps: Failures can cause customer inconvenience and potential revenue loss but are not typically life-threatening.
E-learning: Failures can disrupt learning activities but generally do not have severe impacts.
Navigation Apps: Failures can lead to travel delays and route planning issues.
Food Delivery Apps: Failures can cause inconvenience and logistical issues with food orders.
Social Media: Failures can disrupt user interactions and content sharing but typically do not have severe impacts.
Weather Apps: Failures can cause inconvenience but usually do not have severe consequences unless critical weather information is missed.
Casual Games: Failures cause minor user inconvenience.
Music Streaming: Failures usually lead to user inconvenience rather than serious consequences.
Note-Taking: Failures can lead to minor productivity losses.
Fitness Tracking: Failures result in minor disruptions to personal routines.

Conclusion

For applications where failure has severe consequences, Agile and DevOps methodologies, with their focus on rapid delivery over risk mitigation, may not be sufficient. It is clear that DORA statistics are not suitable targets for optimizing high-consequence software applications. Additionally, adherence to the Agile Testing Pyramid is also a mismatch for such applications due to its insufficient emphasis on integration testing and end-to-end testing. Organizations need to move beyond these philosophies and adopt a customized approach that ensures the necessary level of engineering rigor, discipline, and testing. The required amount of rigor and testing should correspond to the severity of the consequences of failure. Ensuring each update is reliable and risk-free must be the primary goal. Companies developing high-consequence applications should re-evaluate their Agile and DevOps practices to incorporate sufficient rigor, discipline, and testing to prevent catastrophic failures.

The Importance of Engineering Rigor, Discipline, and Testing in High-Consequence Applications

Recent Posts