There’s no uniformity in the approach to data collection and more importantly, disease is a stigma that Indians love to hide from their neighbours

Jeevan Prakash Sharma,  Siddharth PremkumarJune 2020

So we have a bio-earthquake with shifting epicentres in Covid-19, still rumbling unpredictably over our demographic plates. How do you measure it on a real-­time basis, count the damages, predict the next building to fall, the next neighbourhood to be hit, and evacuate everyone safely? Not easy at all, as it turns out. Our need to quantify has loosed on us a vast trove of unruly statistics—data itself gone viral, a tsunami in the wake of a quake. Its sheer scale and speed posed challenges. But there’s an additional qualification. The virus, for all the conspiracy theories around it, is an apolitical being—class-neutral and ideology-agnostic.

Data, on the contrary, is deeply political—and not merely at the level of nations or governments. As a unit of knowledge, it is in its very DNA to segue into power—and to bend to its whims. It is owned, it is withheld, it is manipulated. Knowledge, when shared, grows—or so goes an old Indian dictum. But deriving power by knowledge secretly is tradition too. Democracy enjoins upon its keepers an oath to transparency on anything that affects the public; opacity means you don’t even know what questions to ask—a handy thing during one of human history’s most momentous events.

There is no basic data that all states generate, so that conclusions can be deduced from them, policy framed and common people forewarned.

Data is also an infinitely plastic element—clay in the hands of the potter. And that’s not a uniquely Indian affliction. Witness the Surgisphere scandal in the US. Or wait…it’s an Indian-founded analytics company at the heart of the story. It had a data registry said to include some 96,032 Covid patient records from nearly 671 hospitals on six continents. That lured a star Indian-origin cardiac surgeon, with all the requisite gravitas, to push through major observational studies into venerable journals, The Lancet and The New England Journal of Medicine. The Lancet study claimed anti-malarial drug Hydroxychloroquine (HCQ), touted by President Trump as a ‘wonder drug’, posed an increased death risk to Covid patients; the NEJM paper said certain blood pressure medications did not. Both were retracted within an hour of each other after suspicions mounted about the data sets—which were never made available for peer review. In their apology statement, the authors said they no longer had “confidence in the origination and veracity of the data, nor the findings they have led to”. But by then, based on that research, the WHO had halted the HCQ arm of its global Solidarity trial. They have since resumed.

HCQ, meanwhile, still receives bullish endorsements from Indian advocates. But India is yet to extricate itself fully from the heart of darkness when it comes to data. How many people tested? How many men, how many women? What age-groups? Do we know their blood groups? Co-morbidities? Are there patterns of any sort? Why did we not test enough earlier? The best way to hide data is, of course, to not even generate it. What value was there in saying India had only around 500 Covid +ve cases on March 24 when we had tested only a grand total of 12,872 people by then—not even a handful of apartment blocks? We were logging 1,300-odd tests a day when we could have tested 12,000 a day. The tsunami was born as a trickle. Questions were posed before the Union health ministry and ICMR about their reluctance to release both raw data and its sourcing: the answer was mostly omerta. Not very conducive to analytics.


Over two months on, the ministry website still provides only basic data: the total number of active cases, recoveries, and fatalities, and their state-wise break-up. Asymptomatic? Mildly symptomatic? The number of beds they occupy? No. For more geographical more granular details—for ins­tance, snapshots of district and taluka—state government websites are the only recourse (though Delhi does not confer us with that honour either). Most cripplingly, there’s no uniformity in the approach to data collection—hence no comparability. Some states occasionally release age/gender data, some are hazy on that detail. The taps shut completely when it comes to more nuanced information, such as the number of PPE kits available, or of patients on ventilators or oxy­gen support in hospitals in the country. There’s no such thing as a set of, say, 10 basic data-points that all states religiously generate, so that medical professionals and res­earchers can rifle through them and arrive at reasonable conclusions, so that they can feed bureaucrats and policymakers, so that the layperson can both be illuminated and forewarned. That would speak of a data culture that we just don’t seem to have; India as a collective largely behaves as if it’s still a family—disease is a stigma that we love to hide from our neighbours. Sorry to disappoint you, we aren’t sick.

Also Read | The Mysterious World Of Viruses And Why You Can’t Escape Them

The ministry’s daily briefings dovetail perfectly with that information-shyness. When they do happen—one hadn’t been held for 10 days as this goes to print—they divulge genuinely valuable details and analyses less readily than they dispense homilies on the merits of physical distancing and pronouncements of ‘faraway peaks’. And yet, government press releases are the only official sources of information on such pressing questions as the doubling rate of positive cases, status of PPE kits or updated guidelines for testing protocols. Other crucial elements remain elusive: till date, we have been handed zero details on testing kit production, capacity and imp­ort status. Nor any break-up of the number of tests conducted by private labs and government labs and the percentage of positive cases in each. When it does update the total number of RT-PCR tests conducted, it uses the ambiguously-worded ‘specimen’—an imprecise measure of the number of individuals tested, since all Covid +ve cases involve multiple tests with as many specimens.

“So we don’t know how many individuals have been tested across India so far. If this basic information is not provided to common people, what else one can expect?” says a senior government official, on condition of anonymity. A common request made during Outlook’s information-gathering efforts, it mostly met with a wall of rel­uctance. Those who do speak from within officialdom say that besides the lack of data transparency, the situation is further aggravated by a dearth of the requisite levels of data savvy at both the crisis res­ponse management and leadership levels.

What’s the SoP?

There’s none, and that’s a major impediment to data analysis, researchers say. The lack of uniformity across states in the way their data is compiled, analysed, reported and presented makes it difficult for those outside the system to access… let alone und­erstand. Put that down to ICMR’s lack of interest in devising any standard operating procedure. “What we are working with is highly imperfect because different states, even different districts, have adopted different reporting and testing strategies. It’s not an apples-to-apples comparison. When we put them all on the same graph, it’s almost like committing a statistical sin,” says Prof Bhramar Mukherjee, a biostatistician at the University of Michigan who heads the Cov-Ind-19 Study Group—a collection of US academics tracking the pandemic in India. She spoke those words at an online symposium organised on June 6 by the Indian Scientists Response to Covid-19 (ISRC) collective, an initiative comprising about 600 scientists attempting to provide independent, data-driven, evidence-based information about the pandemic to the public.

Statistical and data reporting infrastructure itself is an issue, says Prof Mukherjee, speaking to Outlook. Owing to “gaps and misalignments in daily reporting from ICMR…we use I have been looking for daily admissions and death data related to Influenza-like illness (ILI) and Severe Acute Respiratory Illness (SARI) across India, both historically and now, but have not been able to locate it. This is very important to track in the future as a departure from the historical curve could indicate an outbreak.” is a crowd-sourced volunteer initiative that aggregates data from state press bulletins, official social media handles and media reports—the gap they fill is obviously left by a retreating officialdom.

Another speaker at the symposium, Dr T. Sundararaman, former executive director of the National Health Systems Resource Centre, had similar thoughts. “The ICMR protocol is very clear that ILI cases which are symptomatic, outside hotspots and containment zones will not be tested. There’s a phenomenal reluctance to test symptomatic patients. But a hotspot cannot be dec­lared as such unless you test. So there’s this whole chicken or egg situation,” he said. And why had the Integrated Disease Surveillance Programme (IDSP) stopped its monthly reporting of ILI and SARI from February 24, since Covid-19 is “clinically indistinguishable from the flu”? The only plausible answer is, so that we don’t know. “There’s a mountain of data being churned out from various sources, including IDSP and health management information systems, among others. But there’s almost no inter-operability; these data don’t talk to each other despite a high degree of overlaps,” Dr Sundararaman noted.

Experts blame a top-down lack of expertise for data for poor quality of ground-level collection. The poverty of data collection transmits downward.

Some data anomalies happen even within one system: on May 14, the Delhi government put out two sets of figures—one mandated by the high court, another in its daily health bulletin. In response to a court order, it said 1,807 cases were reported that day from various government labs, of which 472—that is, 26 per cent—were positive. Meanwhile, the daily bulletin stated that the total testing done till May 14 was 1,19,736—that was 6,391 more than on May 13, hence 4,504 more than the number given to the court. Numbers from private labs, which report only to ICMR, could have accounted for the difference, except for the curious fact that the number of positive cases (472) was the same in both reports. So private labs tested 4,504 people—earning eighteen million rupees—and found zero? “How is this possible? That too when 1,807 public lab tests gave 472 positive cases? The state government is either unable to explain it properly or manipulating the data,” says a Delhi government official. “If this is happening in the national capital, can you ima­gine the situation in other states?”

Human errors?

Several data experts blame a top-down lack of expertise and appreciation for data for a host of ills, including the poor quality of ground-level data collection. Some, however, call it a “legacy problem” involving previous governments—yet, the manner of, and motivation for, data collection does change with each change of the guard. And the poverty of the data culture transmits downward. A government data expert says the problem is more individual than institutional. “If a person is not data-savvy, it leads to a lot of data collection in non-standard formats, leading to confusion. There have been multiple changes in the way data was recorded over the last two months even in the country’s most significant city…so public data cannot be considered accurate,” he says. Dr Raman Sharma, an expert on infectious diseases at Jaipur’s Sawai Man Singh Hospital, too says the Union health ministry is not entirely at fault because a lot of data contamination occurs from the ground up. “Sometimes the patient’s age is missing, often the gender is not marked. If data transmission from the field is not correct, its analysis is bound to give errors. The Centre is dep­endent on state governments and hospitals for all such data,” he says.

Dr Sundararaman, though, isn’t inclined to blame doctors or auxiliary nurse midwives (ANMs) for falsifying whole sets of data. “There’s no stake for them to do so. If the numbers upstairs are unreliable, there are complex processes responsible for this… the whole way in which this data management is done, with senior IAS officers at the top making pronouncements with raw numbers without any sense of the denominators, is problematic,” he said at the symposium. Concerns about the denominator have been raised in a number of indicators and metrics—including cases/deaths/tests per million, as also test positivity and case fatality rates (TPR and CFR). “In India, TPRs have been staying at 4-5 per cent. It’s been argued that this shows there’s no community transmission, but it’s undeniable that there’s wide variation in TPR across states. It’s important to monitor such trends at the granular level,” Prof Mukherjee notes. Her Cov-Ind-19 group published a preprint on medRxiv last month that illustrated the wider problem of national data masking state-level variations by using metrics that don’t reflect their heterogeneity, thereby hampering nuanced analyses and informed policy interventions.

As for the case fatality ratio, Ashwin Srivastava, CEO, Sapio Analytics, a data analytics firm working with various government agencies, told Outlook, “The government has adopted a wrong method to calculate CFR. It counts all cases together and then finds the death rate. That shows a low mortality rate because the den­ominator consists of cases that should not be included in the pool.” That is, also the closed cases—the dead and recovered—along with the active cases. A patient adm­itted today may die in 10-15 days—that’s the relevant time-period. In the symposium, Dr Sundararaman too flagged the issue of time-lags being neglected. “If we use the right denominator accounting for the 10-15 days, the CFR would be 4.5-6 per cent. This is internationally comparable. It shouldn’t be surprising to learn that the virus behaves in India exactly as it does elsewhere in the world,” he noted.

Others like Dr Sanjay Mehendale, director (research) at Pune’s Hinduja hospital, maintain that data has to be validated bef­ore being released. “In my 30 years of working with ICMR, I can assure you data will become available in public domains only once it is clearly validated and its quality ensured. It’s easy to criticise that the correct data is not available, but you have to agree that if non-validated data is put in the public domain, it will lead to much higher criticism.”

The teaching hospital

Many experts feel government bodies are so stuck in fire-fighting mode that they forget to plan for the medium- and long-term. “It’s one of the main reasons why our situation is so critical right now. Instilling a culture of data appreciation is much needed,” says Srivastava. Generating lateral data, going beyond the immediate focus of testing someone positive or negative, could be of immense value. Studies published in Europe and China, for instance, have also looked at possible statistical links between Covid and genetic variations—such as in the gene that determines blood types. The research papers, while preliminary, suggest that people with blood type A might be more vulnerable to coronavirus infection while people with blood type O—which a 2014 study found to be the most common in India—might be resistant to more severe symptoms. How do we know if there’s a pot­ential correlation unless someone has taken the trouble—over the lakhs of samples tested—to also do a simple blood group test? Researchers now hope that at least ICMR’s nation-wide serological survey (see D-G’s Interview) analyses samples for genotyping. “It will help prioritise risk stratification, help public health planning. If an association is found between certain genetic variants and severe outcomes…fut­ure patients carrying that mutation can be screened in a targeted way. Identifying pathways, predicting treatment response and drug targets…it will aid all of that,” says Prof Mukherjee. Data, like time, is folded into three—there’s the past that needs to be understood, there’s the present emergency that needs to be solved, and most of all, there’s the orientation to the future that the other two must feed. This pandemic, and the next one. India as a civilisation cannot just be a hospital, it needs to be a research hospital that sows and reaps its knowledge for tomorrow.

Vox PoP

Prof Bhramar Mukherjee

John D. Kalbfleisch Collegiate Professor of Biostatistics, School of Public Health, University of Michigan

“Investment in public health, biostatistics and epidemiology is minimal in India and I am sure after this pandemic is over, the need for growing capacity in these domains will be forgotten until the next public health crisis emerges. It is a medicine/treatment-focused culture rather than one that invests in prevention/public health. Maintaining a strong public health infrastructure is a general broader need as there will be many other public health crises aside from a pandemic in the future. I hope this public health challenge gives us an opportunity to advocate for public health, data and data scientists. Data transparency and modelling is key not just in this pandemic, but for information-driven policymaking in general. We need a well-funded CDC in India, staffed with qualified data scientists and public health professionals.”

courtesy Outlook