Imagine that you’re working in a government subdepartment that tracks coronavirus infections and traces the patients’ contacts. You’ve just discovered that nearly 16,000 positive tests didn’t get into the system for a week. The people with the coronavirus know they have it, but their contacts don’t know. What do you do?
We don’t have to imagine it. In the United Kingdom, 15,841 people with coronavirus infections went unreported to Serco Test and Trace, the company running the U.K. government’s coronavirus tracing program, in the week from Sept. 25 to Oct. 2, with 48,000 of their possible contacts not warned—because of errors in a data import.
Patients were told they had tested positive, but their details weren’t passed to contact tracers until 1 a.m. U.K. time on Oct. 3, and the cases were not added to official totals until Oct. 5. The cause was simple: a mistake between different formats of Excel files. The U.K.’s response to COVID-19 is widely regarded as scattershot and haphazard. So how did they get here?
Excel is a top-of-the-line spreadsheet tool. A spreadsheet is good for quickly modeling a problem—but too often, organizations cut corners and press the cardboard-and-string mock-up into production, instead of building a robust and unique system based on the Excel proof of concept.
Excel is almost universally misused for complex data processing, as in this case—because it’s already present on your work computer and you don’t have to spend months procuring new software. So almost every business has at least one critical process that relies on a years-old spreadsheet set up by past staff members that nobody left at the company understands.
That’s how the U.K. went wrong. An automated process at Public Health England (PHE) transformed the incoming private laboratory test data (which was in text-based CSV files) into Excel-format files, to pass to the Serco Test and Trace teams’ dashboards.
Unfortunately, the process produced XLS files—an outdated Excel format that went extinct in 2003—which had a limit of 65,536 rows, rather than the around 1 million-row limit in the more recent XLSX format. With several lines of data per patient, this meant a sheet could only hold 1,400 cases. Further cases just fell off the end.
Technicians at PHE monitoring the dashboards noticed on Oct. 2 that not all data that had been sent in was making it out the other end. The data was corrected the next day, and PHE announced the issue the day after.
It’s not clear if the software at PHE was an Excel spreadsheet or an in-house program using the XLS format for data interchange—the latter would explain why PHE stated that replacing it might take months—but the XLS format would have been used on the assumption that Excel was universal.
And even then, a system based on Excel-format files would have been an improvement over earlier systems—the system for keeping a count of COVID-19 cases in the U.K. was, as of May, still based on data handwritten on cards.
PHE was initially blamed for the data disaster. But the question to ask when a procedure leads to disaster is: What problem was the procedure supposed to solve in the first place? PHE’s process was broken in multiple ways; why did it exist at all?
The process that went wrong was a workaround for a contract issue: The government’s contract with Deloitte to run the testing explicitly stipulated that the company did not have to report “Pillar 2” (general public testing) positive cases to PHE at all.
Since a test-and-trace system is not possible without this data, PHE set up feeds for the data anyway, as CSV text files directly from the testing labs. The data was then put into this system—the single system that serves as the bridge between testing and tracing, for all of England. PHE had to put in place technological duct tape to make a system of life-or-death importance work at all.
Right now, Public Health England has worked around the present problem: Serco Test and Trace still takes an Excel 2003-formatted XLS spreadsheet as part of the data pipeline—but the process now uses multiple sheets, so the files don’t overflow again.
The U.K. government has used wide-ranging outsourcing to the private sector to build a COVID-19 response system quickly. But outsourcing is fraught with hazards of exactly this kind, where faulty systems are created to get around regulatory problems or cut corners
A shortage of home testing kits led to the government worrying that people would request multiple kits. Identity verification was outsourced to the credit agency TransUnion—which implemented it by running credit checks. People who couldn’t pass TransUnion’s particular credit checks were refused tests, including many who had bank accounts and were on the electoral roll, and so should have been entirely verifiable as being real people. There are 5.8 million U.K. residents who have bad credit histories and so could be refused a test kit, which disproportionately impacts the already poor and disadvantaged—with obvious implications for public health.
Many of those rejected by the credit check were told to visit testing centers at the other end of the country, such as London residents being told to go to Scotland for a test. One London resident was advised by staff at a testing center to say she lived in Aberdeen, Scotland—and this got her a test in London.
The Brookings Institution report Doomed: Challenges and solutions to government IT projects lists factors to consider when outsourcing government information technology. The outsourcing of tracking and tracing is an example where the government has assumed all of the risk, and the contractor assumes all of the profit. PHE did one thing that you should never do: It outsourced a core function. Running a call center or the office canteen? You can outsource it. Tracing a pandemic? You must run it in-house.
If you need outside expertise for a core function, use contractors working within a department. Competing with the private sector on pay can be an issue, but a meaningful project can be a powerful incentive.
This sort of error doesn’t have to happen. There’s a set of basic principles to apply for vital functions when you’re working within an unreliable organization. Site Reliability Engineering principles are lifesavers when you’re short on both human and technical resources.
The first step is fully understanding your task. Document what you need. Document whether you can get it or not. Set explicit priorities, and get signoff. Document when there are requests to drop a priority because of lack of resources.
If the data import from the labs had been checked daily, the problem would have been noticed and solved immediately. Are you cross-checking your numbers? Do you have a list of the numbers that need to be verified—and do you know what “normal” numbers look like? If you can’t get the people you need to do this work, have you documented the refusal?
Consider your software support. In the U.K., the Government Digital Service, part of the Cabinet Office, uses LibreOffice, the current version of the old OpenOffice. LibreOffice is free, though you can buy support easily. The Government Digital Service is on the advisory board of the Document Foundation, the nonprofit behind LibreOffice. The LibreOffice spreadsheet also automatically corrects math bugs in Excel—a function that gives the same wrong answer as Excel will have a corresponding function that gives the correct answer.
Document your duct tape stopgap solutions. Document and request the support that you need to do the job robustly—and document it being refused. If spreadsheets are all you have, send your staff on training courses. Training and caution are much cheaper than a widely publicized disaster. Particularly when lives hang in the balance.