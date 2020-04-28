“There are three kinds of lies: lies, damned lies and statistics.” Disraeli? Mark Twain? (see here)

INTRODUCTION

This is Part II of a series about dealing with the covid-19 pandemic. Part I dealt with government folly, focusing on Pennsylvania. In this article, Part II, I want to give some statistics, discuss what they tell us, even though for the most part they are incomplete and/or misleading. That last statement pains me greatly. I’ve outgrown most of my mild Asberger’s syndrome, but I still want to believe (with Pythagoras?) that numbers represent reality faithfully.

Since the middle of March, I’ve been putting statistics from the Johns Hopkins world map and from the Pennsylvania Department of Health into a Numbers spreadsheet (the Mac equivalent of Excel) and playing with them, as follows:

US “confirmed” cases, deaths, recovered cases;

Pennsylvania confirmed cases, deaths;

Montgomery county (Philadelphia suburb) cases, deaths;

Montour county (rural, north-central PA) cases, deaths;

S. Korea “confirmed” cases, deaths, recovery.

The locales were chosen to illustrate different environmental situations (and different ways of coping with covid-19).

Now, even though the values for the numbers from the sources do not represent reality faithfully, they may give a good representation of trends, how things are changing with time. That would require that the data be taken in a uniform way, so that if there is a bias, it remains the same for each data point and would thus tend to cancel out when differences are taken. To see how things change with time and thus determine a trend, I’ve taken daily differences (represented by “Δ(something)”) and differences of the daily differences (represented by “Δ(Δ(something)).” The daily difference is a measure of the rate of change (graphically: the slope of the plot, “speed,” or first derivative); the difference of the daily difference is a measure of how fast the increase rate is changing (graphically: the curvature of the plot, “acceleration,” or second derivative).

Now that the methodology has been explained (I hope), let’s look at the data.

WHAT THE NUMBERS TELL US (AND DON’T TELL US)

If you look at the featured image, number of US recoveries from covid-19, versus time, you’ll see a steady increase, going from about 5000 in the middle of March to a little less than 120,000 as of today (28 April). Well, as the Church Lady might say, “Isn’t that special!” Indeed it is, but given that just shy of a million cases for the US have been reported as of April 28th, it isn’t that encouraging—only about a 12% recovery rate.

But wait! Have all US covid-19 cases been reported? What about incidence of asymptomatic cases and cases where the symptoms were so mild that cases were never reported to a hospital? A study of antibody prevalence in Santa Clara County, California indicated that many more people have been infected with covid-19 than has been reported:

“These prevalence estimates represent a range between 48,000 and 81,000 people infected in Santa Clara County by early April, 50 (to) 85-fold more than the number of confirmed cases,” Erin benDavid et al, “Antibody Seroprevalence in Santa Clara County, California”

If 50-fold is an accurate estimate of the missed covid-19 cases, that would give about 50 million recovered from the virus. That number is clearly an over-estimate for the whole US. The population of Santa Clara County is about 2 million, so it’s densely populated (relative to flyover country). The infection prevalence is likely to be much higher there than where I live (only 47 reported cases out of the 18,000 county population). Even so, it’s evident that a large number of covid-19 infected have not been reported, so the “recovery numbers” represent a stratified sample. The trend is encouraging, but the absolute numbers are not meaningful.

There are other things that make me skeptical of the reliability of these statistics. In the World Map statistics I looked for data from my home county, Montour (PA). For four days running it reported an incidence of 54, while the PA Dept of Health gave 29. The discrepancy: Geisinger Medical Center, a tertiary care center in Montour County, takes in patients from a region containing several neighboring counties; some of the incidence data reported initially from Geisinger contained data that should have been attributed to these other neighboring counties; the error was corrected later, but evidently not taken into account by the World Map.

Here’s another: the CDC changed guidelines (around April 19th) for attributing deaths due to covid-19. Their guidelines suggested that deaths where covid-19 may have been a likely factor should be included in statistics, even though no test for covid-19 had been made. As a result deaths due to causes unrelated to covid-19 have been

included in the stats, for example drug overdose deaths in California. If you look at a graph of reported Pennsylvania deaths you’ll see this sharp spike when the reporting changes were made. You’ll see a similar behavior if you look at data for US deaths, NYC deaths, but not for deaths reported from other countries. So this jump is an artifact. There may be other reporting artifacts.

Looking at the number of confirmed cases or reported deaths one sees a weekly cycle. Numbers decrease on Saturday, Sunday and Monday and then jump again on Tuesday. Are the computers resting or are people not going to hospital on weekends? A similar cycle is seen for European countries, but not for Asian or Middle Eastern.

Now even though one may be skeptical about whether these stats give true numbers, they can still be used for comparison. For example, if I compare Montgomery county, a Philadelphia suburb, with my county, Montour, I find this ratio for cases: 4043/47 = 86. (The population ratio is only 46.) If I compare the fatality rate percentage for covid-19 (# of deaths/# of cases, %) for the US with that of S.Korea, I get 5.7% (and rising slightly) for the US, compared to 2.3% for S.Korea (approximately constant this last month). I’ll leave it to the reader to draw conclusions from these comparisons.

Here’s another comparison that’s of interest, the ratio of recovered / deaths. For the US, the present value (28 April) is about 3.2 and rising slightly. For South Korea, the value is 36 and approximately constant. For various European countries it ranges from less than 1 (UK) to about 20 (Germany).

Let’s take a look now at what insights can be gleaned from daily difference and double difference manipulations.

RATES OF CHANGE AND THEIR RATES OF CHANGE; SLOPES AND CURVATURE

Here’s what to look for in the difference (Δ) and double difference (Δ(Δ)) numbers. When the difference numbers start to decrease consistently it signifies that the rate of increase is slowing down, the curve is starting to flatten out. This is also shown by the double difference numbers: they become negative and increase in magnitude. Some of such decreases are seen on weekends only to start rising again on Tuesdays. Accordingly, a trend is indicated only if it lasts for at least a week. Here is an example for incidence of PA cases, given as a graph:

You see a weekly cycle superimposed on a slight decrease in the difference. So one could conclude that the curve is flattening slightly, although another week of data would be nice to confirm that. If you look at the double difference for this data, it would appear to go randomly up and down about zero.

Most of the quantities that I’ve surveyed in this mini-study show a similar recent behavior: a slight decrease of the difference (rate) on which is superimposed a cyclic variation. I could do some heavy statistics, regression or time series fitting with linear and sinusoidal components, but the game isn’t worth the candle given the basic unreliability of the fundamental data.

FINAL THOUGHTS

I am less optimistic at this time about the ultimate value of this number crunching. By the time the data show long term trends one will be aware of them without needing a statistical confirmation. Well, it does help pass the time. Not as pleasurable as rereading Jane Austen and Trollope, or rewatching Gilbert and Sullivan productions, but more socially valuable?

NOTE

If any of you readers would like pdf copies of the spreadsheet on which this piece is based, please so indicate in a comment. Your email addresses are given as a prerequisite for commenting, so I’ll be able to email you the pdf file.