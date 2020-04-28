“There are three kinds of lies: lies, damned lies and statistics.” Disraeli? Mark Twain? (see here)
INTRODUCTION
This is Part II of a series about dealing with the covid-19 pandemic. Part I dealt with government folly, focusing on Pennsylvania. In this article, Part II, I want to give some statistics, discuss what they tell us, even though for the most part they are incomplete and/or misleading. That last statement pains me greatly. I’ve outgrown most of my mild Asberger’s syndrome, but I still want to believe (with Pythagoras?) that numbers represent reality faithfully.
Since the middle of March, I’ve been putting statistics from the Johns Hopkins world map and from the Pennsylvania Department of Health into a Numbers spreadsheet (the Mac equivalent of Excel) and playing with them, as follows:
- US “confirmed” cases, deaths, recovered cases;
- Pennsylvania confirmed cases, deaths;
- Montgomery county (Philadelphia suburb) cases, deaths;
- Montour county (rural, north-central PA) cases, deaths;
- S. Korea “confirmed” cases, deaths, recovery.
The locales were chosen to illustrate different environmental situations (and different ways of coping with covid-19).
Now, even though the values for the numbers from the sources do not represent reality faithfully, they may give a good representation of trends, how things are changing with time. That would require that the data be taken in a uniform way, so that if there is a bias, it remains the same for each data point and would thus tend to cancel out when differences are taken. To see how things change with time and thus determine a trend, I’ve taken daily differences (represented by “Δ(something)”) and differences of the daily differences (represented by “Δ(Δ(something)).” The daily difference is a measure of the rate of change (graphically: the slope of the plot, “speed,” or first derivative); the difference of the daily difference is a measure of how fast the increase rate is changing (graphically: the curvature of the plot, “acceleration,” or second derivative).
Now that the methodology has been explained (I hope), let’s look at the data.
WHAT THE NUMBERS TELL US (AND DON’T TELL US)
If you look at the featured image, number of US recoveries from covid-19, versus time, you’ll see a steady increase, going from about 5000 in the middle of March to a little less than 120,000 as of today (28 April). Well, as the Church Lady might say, “Isn’t that special!” Indeed it is, but given that just shy of a million cases for the US have been reported as of April 28th, it isn’t that encouraging—only about a 12% recovery rate.
But wait! Have all US covid-19 cases been reported? What about incidence of asymptomatic cases and cases where the symptoms were so mild that cases were never reported to a hospital? A study of antibody prevalence in Santa Clara County, California indicated that many more people have been infected with covid-19 than has been reported:
“These prevalence estimates represent a range between 48,000 and 81,000 people infected in Santa Clara County by early April, 50 (to) 85-fold more than the number of confirmed cases,” Erin benDavid et al, “Antibody Seroprevalence in Santa Clara County, California”
If 50-fold is an accurate estimate of the missed covid-19 cases, that would give about 50 million recovered from the virus. That number is clearly an over-estimate for the whole US. The population of Santa Clara County is about 2 million, so it’s densely populated (relative to flyover country). The infection prevalence is likely to be much higher there than where I live (only 47 reported cases out of the 18,000 county population). Even so, it’s evident that a large number of covid-19 infected have not been reported, so the “recovery numbers” represent a stratified sample. The trend is encouraging, but the absolute numbers are not meaningful.
There are other things that make me skeptical of the reliability of these statistics. In the World Map statistics I looked for data from my home county, Montour (PA). For four days running it reported an incidence of 54, while the PA Dept of Health gave 29. The discrepancy: Geisinger Medical Center, a tertiary care center in Montour County, takes in patients from a region containing several neighboring counties; some of the incidence data reported initially from Geisinger contained data that should have been attributed to these other neighboring counties; the error was corrected later, but evidently not taken into account by the World Map.
Here’s another: the CDC changed guidelines (around April 19th) for attributing deaths due to covid-19. Their guidelines suggested that deaths where covid-19 may have been a likely factor should be included in statistics, even though no test for covid-19 had been made. As a result deaths due to causes unrelated to covid-19 have been
included in the stats, for example drug overdose deaths in California. If you look at a graph of reported Pennsylvania deaths you’ll see this sharp spike when the reporting changes were made. You’ll see a similar behavior if you look at data for US deaths, NYC deaths, but not for deaths reported from other countries. So this jump is an artifact. There may be other reporting artifacts.
Looking at the number of confirmed cases or reported deaths one sees a weekly cycle. Numbers decrease on Saturday, Sunday and Monday and then jump again on Tuesday. Are the computers resting or are people not going to hospital on weekends? A similar cycle is seen for European countries, but not for Asian or Middle Eastern.
Now even though one may be skeptical about whether these stats give true numbers, they can still be used for comparison. For example, if I compare Montgomery county, a Philadelphia suburb, with my county, Montour, I find this ratio for cases: 4043/47 = 86. (The population ratio is only 46.) If I compare the fatality rate percentage for covid-19 (# of deaths/# of cases, %) for the US with that of S.Korea, I get 5.7% (and rising slightly) for the US, compared to 2.3% for S.Korea (approximately constant this last month). I’ll leave it to the reader to draw conclusions from these comparisons.
Here’s another comparison that’s of interest, the ratio of recovered / deaths. For the US, the present value (28 April) is about 3.2 and rising slightly. For South Korea, the value is 36 and approximately constant. For various European countries it ranges from less than 1 (UK) to about 20 (Germany).
Let’s take a look now at what insights can be gleaned from daily difference and double difference manipulations.
RATES OF CHANGE AND THEIR RATES OF CHANGE; SLOPES AND CURVATURE
Here’s what to look for in the difference (Δ) and double difference (Δ(Δ)) numbers. When the difference numbers start to decrease consistently it signifies that the rate of increase is slowing down, the curve is starting to flatten out. This is also shown by the double difference numbers: they become negative and increase in magnitude. Some of such decreases are seen on weekends only to start rising again on Tuesdays. Accordingly, a trend is indicated only if it lasts for at least a week. Here is an example for incidence of PA cases, given as a graph:
You see a weekly cycle superimposed on a slight decrease in the difference. So one could conclude that the curve is flattening slightly, although another week of data would be nice to confirm that. If you look at the double difference for this data, it would appear to go randomly up and down about zero.
Most of the quantities that I’ve surveyed in this mini-study show a similar recent behavior: a slight decrease of the difference (rate) on which is superimposed a cyclic variation. I could do some heavy statistics, regression or time series fitting with linear and sinusoidal components, but the game isn’t worth the candle given the basic unreliability of the fundamental data.
FINAL THOUGHTS
I am less optimistic at this time about the ultimate value of this number crunching. By the time the data show long term trends one will be aware of them without needing a statistical confirmation. Well, it does help pass the time. Not as pleasurable as rereading Jane Austen and Trollope, or rewatching Gilbert and Sullivan productions, but more socially valuable?
NOTE
If any of you readers would like pdf copies of the spreadsheet on which this piece is based, please so indicate in a comment. Your email addresses are given as a prerequisite for commenting, so I’ll be able to email you the pdf file.
A question, please, from one who hates math and almost changed majors in college because the Poli Sci department required a course in social statistics for a degree (I talked the department head into letting me take pre-calculus trig instead): are there any of the various rates and ratios being pushed by the media that are of any real value in understanding what’s going on? Or should we just be looking at raw numbers, and trying to judge which ones are reasonably probative of something, given how much book-cooking seems to be going on by governments and their agencies?
Here’s another: the CDC changed guidelines (around April 19th) for attributing deaths due to covid-19. Their guidelines suggested that deaths where covid-19 may have been a likely factor should be included in statistics, even though no test for covid-19 had been made. As a result deaths due to causes unrelated to covid-19 have been included in the stats.
Bob:
1) How do you get from “may have been a likely factor” to “causes unrelated”? Seems like a huge jump there.
2) I’ve talked to 6 doctor friends in 3 states since the ‘false statistics’ meme arose last week and all told me there is no way there is any appreciable inflation of the actual deaths. For example, my radiologist friend in Boston said there is no need for a nasal or blood test when the imaging is so distinctive. My fellow parishioners said no one stands over them when they sign death certificates.
3) There is going to have to be a retrospective CDC estimate of COVID-19 deaths that were not cou/nted in the actual death. This is no different than what they do with the flu every year. It should be noted that most flu stats that are thrown around, such as the “up to 90,000 deaths in a bad flu year’ that Don quotes are estimates based on comparatively small actual numbers. So, may I ask, why the major concern with exact COVID counts and a lack of such concern with exact flu counts?
Don or Bob: I don’t know how my full name got posted with that comment, but if possible could you change it back to just “Frank?” If that’s not possible, no problem, I am retired and don’t have a job to worry about any more. 🙂
Tom, thanks for your comment. I’ll admit that saying “causes unrelated” was perhaps too strong. However, look at the link below where drug overdose deaths were included. . I’ve looked at the CDC directive as it’s been quoted on the PA Dept. of Health website and it does say “likely” without much other qualification. In other words, people tested positive for covid-19 but died of other serious conditions UNRELATED to covid-19.
By the way, I worked with radiologists and MD’s in a tertiary care facility, taught undergraduate premeds and graduate radiology interns, and I would trust only about 10% of them to make statistically valid inferences. (I apologize beforehand to MD’s reading this comment.) In any case, it may very well be the case that some deaths with other conditions should be included, but I doubt that all should. And that makes the data suspect, as indicated by the way it jumped for US, NYC, PA cases, whereas that for other countries did not.
Frank, as I said in the article, I don’t feel that raw data–cases or deaths–are useful. I think changes are useful in indicating trends, but because the sources are so unreliable, it is only comparisons or changes that give information.
Looking at the number of confirmed cases or reported deaths one sees a weekly cycle. Numbers decrease on Saturday, Sunday and Monday and then jump again on Tuesday. Are the computers resting or are people not going to hospital on weekends?
Saturday’s numbers represent Friday’s reports, Sunday, Saturday’s and Monday Sunday’s. The paper and pencil pushers are taking the weekend off. The numbers trickle in on Monday and get reported on Tuesday.