Different sources; different data
When I read the news about the numbers of cases and deaths for the coronavirus and I compare those numbers against my set, I see that the numbers are slightly different.
I have two different Wikipedia sites (one for data for countries around the world and the other for the US) and each of them gave different death counts: 147,650 and 134,301. The Covid Tracking site managed by The Atlantic showed 137,450. John Hopkins Medical Center had 146,050. And finally the New York Times reported 145,430. The two Wikipedia and Covid Tracking numbers were pulled last night 7/24 while John Hopkins and NYT was pulled today 7/25.
What’s going on?
Background
When I first started looking into this coronavirus, my main goal was to learn how to use Power BI and maybe follow the thinking of scientists and journalists on how they track and analyze what is going on. It was basically a learning exercise. My first foray into this started with doing an internet search on “coronavirus” and seeing what was out there.
The first site that allowed me to pull in data was Wikipedia and all of my initial Power BI visualizations stemmed from the data pulled from Wikipedia. Later on I read about the Covid Tracking Project and found the site that included APIs as well as articles analyzing the trends and the prognostications. So far, I haven’t found any mechanism to pull data from the John Hopkins site, but that site is really beautiful and well worth going through. And today, I finally found a repository of data on cases and deaths from the New York Times sitting on GitHub.
In the beginning, I was interested in how I could pull in data using Power BI and watching how the infection spread throughout the US through time. I was not really concerned about precise or accurate numbers; I was more concerned about trends. And I really didn’t think that the numbers would be so different.
My thoughts on why the numbers are different: how data is collected
It could be timing but I also think it is more fundamentally how the data is collected. My guess is Wikipedia is updated by volunteers so I imagine the site is being updated manually by volunteers. Maybe those volunteers update the figures by looking at the state sites.
The Covid Tracking Project seems to also be updated from the states but also from meetings and webinars. I have noticed that the Wikipedia data comes trickling in earlier in the day while the Covid Tracking Project appears to be updated around 4 or 5 pm. When I’m updating through Power BI, I’m doing it between 9 and 10 at night. The Wikipedia and Covid Tracking state data (not the Wikipedia data from the world comparison) track pretty closely to each other.
The John Hopkins site may be pulling data from additional sources because the death counts are always greater than those in the Wikipedia state data or the Covid Tracking Project. The John Hopkins count track more closely with the Wikipedia world tracking data. And I almost think that John Hopkins has some kind of automation going on where programs pull in data from the state sites and whatever other sources of data. But John Hopkins also references the Covid Tracking Project, so that is interesting.
The New York Times have said they are using state data and maybe additional sources because their count is closer to John Hopkins. The site was specific in mentioning that it does not rely on the CDC for data so the quality of the data should not be imperiled by the recent governmental stipulation that all state data bypass the CDC and go to the DHS.
Possibility two: definition of what goes into the count
There are also “rules” as to what gets counted in the death counts (and possibly in the infection case counts). For the death counts, a lot of states only include deaths that have been confirmed as coronavirus related, but there is a suspicion that a lot of coronavirus related deaths are being undercounted, possibly by as much as 10 times. Now some states, not all, include probable deaths in the count where the death has not been confirmed by testing as coronavirus related but the prior symptoms suggests that the coronavirus could have been involved.
I haven’t sifted through what kind of death count each site is using but I have a feeling sites being manned by volunteers are probably not being consistent in the counts. It’s enough to get a ball park number. These are volunteers freely giving up their time to help log historical information for free access. The main purpose is to tell a story.
I’m sure John Hopkins, the Covid Tracking Project and New York Times have had to change their collection process because we are all learning as we research and dig into what is going on with this virus. This is a new virus and a huge collective effort to try to figure out how to get us out of this crisis with as many people as we can.
At the end of the day…
I’m still just interested in the trends and how we can delve into the data to spot trends or to project where we’re going. I’m not really as concerned about the precise numbers because everybody is doing the best they can. Preciseness is not the goal. So 137,000 versus 147,000… a 10,000 difference, yes, kind of huge but my graphs tracks in the same direction as everybody else and at this point, it is all we have. When the news discuss about such and such state having surges, and I can say, “yes, I saw that starting to happen last week or two weeks ago”, then I know that I’m spotting trends early on. So learning and catching trends early on is enough for me.
Oh, and also, getting on the other side of this crisis.