Engineering quality checks

April 27, 2026 ยท View on GitHub

This is part of a broader quality framework and forms a key part of the metrics section.

Summary

Quality checks are at the heart of good engineering, and are essential for rapid and safe delivery of software changes. This page provides an index of the various quality checks described within our principles, patterns and practices.

The checks are classified here based on the concerns they help to address:

  • Functionality
  • Security
  • Resilience
  • Maintainability

Usage

All applicable quality checks should be applied. Not all checks are universal, i.e. applicable in all contexts. For example, accessibility testing is only applicable to applications with a user interface.

The majority of these checks should be automated via continuous integration / continuous deployment: the optimal sequencing of these checks within CI/CD pipelines will depend on the project's branching strategy, deployment strategy, etc.

All of these checks are important, even where their purpose overlaps with other checks in the list. For example, comprehensive functional testing could be achieved without unit testing, instead only using the other functional test types on this list - this would result in a very long-running and inefficient test suite, precluding fast feedback and impeding rapid and safe delivery of software changes. For further details please see test practices, and especially the test pyramid.

Although we separate resilience checks into their notional components, when it comes to implementation these may be joined up where useful. For example, an automated test suite may test the application for peak load and then drop down to sustained load in order to conduct a soak test.

RAG scale

We rate our applications against each of these checks as follows:

  • Green = the quality check is applied frequently and consistently (in practice this typically means automated via continuous integration / continuous deployment), the output of the check is a quality gate (as opposed to just a warning / for information), and the tolerances for that quality gate (e.g. code coverage %) are agreed and understood.
  • Amber = the quality check is applied, but not all conditions for green are met - for example the check generates warnings that may or may not be acted on, or is executed on an ad-hoc basis and cannot offer a consistent quality gate guarantee.
  • Red = the quality check is not applied.
  • N/A = the quality check is not applicable.

Tracking progress

We recommend tracking progress on an Engineering Quality dashboard, for example:

Example Dashboard

Details

Quality checkClassificationApplicabilityWhat it meansWhy we careTolerances for greenEndorsed tools / configurationFurther details
Accessibility testOtherUniversalThe practice of making applications usable by as many people as possible.It is a regulatory requirement that our applications are accessible by as many people as possible. Catching accessibility failures up front is essential to maximise the accessibility
API / contract testFunctionalityContextualCheck whether the API interface adheres to the agreed contractAny API interface is an integration point with another component or a software systems. An extra care has to be taken to ensure compatibility and stability of that integration are maintained so that we don't break applications that depend on our APIsBuilds fail if any tests failPostmanAutomate Your API Tests with Postman
Capacity testResilienceContextualIdentify the application's breaking point in terms of an increasingly heavy load. Degradation may manifest itself as
throughput bottlenecks
increasing response times
* error rates rising
Without this test, we don't know how much load the application can handle before the application breaks or degrades
Chaos testResilienceContextualCause failures in a system to test the resiliency of that system and its environment, and our ability to respond to failuresGive the team confidence that failures in a given environment will not lead to unplanned downtime or a negative user experience

Ensures that the team has visibility (e.g. dashboards and alerts) to be able to identify issues

Surface performance bottlenecks, weaknesses in system design and tipping points that aren't visible through other types of testing.


Help the team to understand their mean time to recovery (MTTR) and to build muscle memory & confidence for recovery activities
Regular (at least every couple of months) game days, and:

Builds fail if any test fails - note, these tests are slow, and are likely to be part of an infrequently-triggered (e.g. overnight) build

The tests cover whether the system self-heals, auto-scales, and alerts as expected
aws-fis
Code reviewOtherUniversalA second person manually checking a code changeQuality check by a human, as opposed to via a toolEnforced & audited step within workflowTBCCode review guidance
Code smellsMaintainabilityUniversalCheck whether the software code adheres to the principles, patterns and practices of writing clean codeThe code is written once but read and executed many times. If the code is not clean, the cost and risk of making software changes both increase exponentially over timeMust use SonarQube's default rules, profiles and gateways

Build pipeline must fail if gateway is not met
SonarQubeClean Code: Smells and Heuristics
Code coverageMaintainabilityUniversalThe proportion of the application code which is executed (in this context: during testing)The higher the code coverage, the more thorough the testing, and therefore the higher the likelihood of detecting functional issues earlyMust use SonarQube's default rules, profiles and gateways

Build pipeline must fail if gateway test coverage is not met

For new code, must meet coverage specified in default Sonarqube gateway.
For legacy code, coverage amount can be initially lowered to deal with historic tech debt. But must have a plan for increasing coverage over time.
SonarQube (in conjunction with testing frameworks)
Dead codeMaintainabilityUniversalDetecting unused code and files that are not neededCode is written once but read and executed many times. The more code you have, the greater the risk of somethign going wrong
Dependency scanSecurityUniversalCheck for security issues and vulnerabilities in dependent areas of code that are outside of our direct controlWithout this we have no way of knowing of any issues or security vulnerabilities of third party components that we are not responsible forMust check against CVE database

Must check dependencies of dependencies

Must fail build if any High severity vulnerabilities are found

It should be easy to determine why the build failed: which vulnerability it was, and in which top-level dependency

Tools must include ability to exclude accepted vulnerabilities. These should include a date at which the exclusion expires and the build fails again. These should include a description of why they are excluded
One option is (other options are being added): dependency-check-maven
Duplicate code scanMaintainabilityUniversalCheck whether the same code is used in multiple placesDuplicate code increases the cost and risk of making software changes and will increase exponentially over timeMust use SonarQube's default rules, profiles and gateways

Build pipeline must fail if gateway is not met
SonarQube
Integration testFunctionalityUniversalCheck interactions with other components and dependant systems. e.g. across microservices, authentication layers, database, third-party systems. Ideally includes full end-to-end testing across all componentsWhen components are developed in isolation, it's vital that we regularly test them working together. Changes in one component can break the whole systemBuilds fail if any tests fail
Performance testResilienceContextualCheck whether application performance is acceptable at different levels of load. This may include:
Baseline test (one-off) - to establish how the system interacts
Smoke test - to establish that the key functionality is working before performing longer tests
Regression test - run a suite of repeatable test cases to validate existing functionality
Load test - to understand the system behaviour under an expected load
Without these tests, we don't know how load will affect the performance of the application, or whether existing functionality has been broken.The performance of the system must be scored at build time so that it can be tracked

Build pipeline must fail if performance does not meet the acceptable level
One option is to use APDEX to quantify performance to a numeric value, and to use this value to pass/fail the build pipelinePerformance test practices
Secret scanSecurityUniversalCheck for secrets (e.g. passwords, API keys, certificates or tokens) accidentally included in software codeThis protects us against accidentally leaking secrets in source code, commit history, logs and configuration, which could compromise the security of the applicationReview the detection rules and exclusions regularly

... then:

Full repository (including history) scan, and all secrets removed

And:

Local scanning to block commits containing the patterns

And:

Server-side scanning within the code repository for new commits containing the patterns
GitLeaks guidance
Security scanSecurityUniversalCheck for indications of possible security issues (for example injection weaknesses)This gives fast feedback about security issues.

Code analysis is not as thorough as security testing in terms of finding complex weaknesses or issues that only manifest themselves at runtime, but it has much greater coverage. It's a better option for finding simple weaknesses and it's much quicker to execute.

Security code analysis and security testing are both important to achieve rapid and thorough security testing.
If using SonarQube, must use SonarQube's default rules, profiles and gateways

Build pipeline must fail if gateway is not met
One option is SonarQube. For the purpose of security code analysis, Developer Edition or higher is required as it includes advanced OWASP scanning.
Security testSecurityContextualCheck for security issues (for example injection weaknesses)More thorough than security code scanning, but much slower to execute, so both are important to achieve both rapid and thorough security testing
Soak testResilienceContextualCheck whether sustained heavy load for a significantly extended period causes a problem such as memory leaks, loss of instances, database failovers etc.Without this test, we don't know if application performance will suffer under prolonged heavy load, how stable the system is, how it performs without interventions.
Stress testResilienceContextualCheck how the system performs under stress including
a level load near the maximum capacity for a prolonged period
sudden spikes in load with a lower baseline load
Without this test, we don't know if the application will begin to fail as a result of memory leaks, connection pool blocking etc. or will fail under a sharp increase in load triggered by adverts, news coverage or TV tea breaks
Tech radar checkOtherUniversalChecking that the tools to be used are in line with organisational / team standardsTo prevent the unnecessary proliferation of a wide variety of tools and technologies, which would have a negative impact on overall effectiveness
UI testFunctionalityContextualCheck that the user interface components behave as expected, particularly checking the visual elements to verify that they are functioning according to requirementsAs the only aspects of software that end users come into contact with, it is essential that these elements behave as expected and allow users to get only what they need from our software applicationsBuilds fail if any tests fail
Unit testFunctionalityUniversalLogic tests for individual blocks of code, e.g. individual methodsThis is the quickest (to execute) type of functional test, so these are essential to achieve both rapid and thorough functional testingBuilds fail if any tests fail-Test practices

Publishing code

All code should be treated the same and treated well (please see everything as code), but code that is being published (i.e. made available outside of NHS Digital, for example in a public repository on GitHub) incurs additional considerations.

For example, it's never good to include credentials and other secrets in source code, but the impact of this is obviously greater if the code is made available to the public.

Therefore, for published code, the following minimums are required:

  • Unit tests: GREEN
  • Integration tests: AMBER (where applicable)
  • API / contract tests: AMBER (where applicable)
  • UI tests: AMBER (where applicable)
  • Secret scanning: AMBER (including removal of any secrets)
  • Security code analysis and Security testing: AMBER for at least one of these
  • Dependency scanning: AMBER
  • Code coverage: AMBER
  • Duplicate code scan: AMBER
  • Code smells scan: AMBER
  • Dead code scan: AMBER
  • Code review: GREEN
  • Accessibility tests: AMBER (where applicable)