|

Playing around with John Hopkins Coronavirus data

I’ve had a chance to play around with the John Hopkins data and found that there is a little twist in how Power BI pulls in the data. The access is not like the Wikipedia or the Covid tracking project; some additional manipulation needs to be done.

The GitHub twist

First, be aware that the John Hopkins University (JHU) data resides on GitHub, so for average folks like me, there’s some “tricks” to deploy in order to pull in data. When you first access the data, you will see something like the following image – yeah, it’s kind of blurry but it’s from “Snip & Sketch”:

The site web address is https://github.com/CSSEGISandData/COVID-19

I went into the CSSE_Covid_19_data folder, as highlighted in yellow.

For the time being, I’m using the stuff in the time series folder, which contains data for both US and global data for confirmed cases and confirmed deaths and global recoveries. Click on any of the desired links and you will encounter a screen like either of the following:

For screens that show a “download” button, click on that button and you should get something like a screen full of numbers. You want to use that page’s https address when “getting data” in Power BI or power query.

For pages that don’t offer up a “download” button but does include a “raw” button, click on that “raw” button instead. You’ll get the same string of numbers running across the screen. Again, it is that page’s https address you want to use in Power BI/power query.

The problem with the JHU set up

The issue that I had was Power BI would not update with the latest day’s information, so if I pulled in information through November 28, that was it. When November 29 data got posted onto GitHub the next day, Power BI would not pull in the update. Here’s why:

Look at the code for the Source step of the Applied steps (the red circle marks out the Source step). In that code you will see a very specific coding showing Columns = 324 (highlighted in yellow) which I think was related to November 28; however, everyday will bring in a new column, and the code only goes up to column 324. I needed something that will keep on adding columns each day.

Even if I were to correct the column to include the latest additions (say Column=325 for November 29), I would still encounter another problem:

The column headings are very specific in the “Changed Type” step. The first image on the left shows the front end of the code for “Changed Type” step and the second image shows the code scrolled to the end where one can see “Column324” (highlighted in yellow). However, if you had added in a new column in the “Source” step, Column325 for November 29, that new column will not be part of the “Changed type” and could kick up an error message.

In short, the thing to watch out for in the codes for the Applied steps (or in the Advanced Editor) is anything that is very specific such as a header or column header that could change in the next update because the source data is constantly changing the header or column header. So far, for me, the issues have stemmed from dates changing, so if a date is part of a generic header or column, be aware there could be problems).

My solution for this specific situation

The following site offers a solution to issues like this where the Source code or the column headers are so specific but prone to changing that Power BI gives you an error message.

https://www.howtoexcel.org/power-query/how-to-deal-with-changing-data-formats-in-power-query/

I only needed to use one of the solutions offered by the site and the rest was a matter of rearranging the process. I have two tables being pulled from the JHU GitHub site: the confirmed cases and the confirmed deaths. The confirmed cases table is the table that have been corrected to enable pulling up new dates while confirmed deaths table was left untouched so I could retrieve images for explication purposes.

Briefly, here are the changes I made to the code in the Applied steps:

  1. Source: changed “Column=324” to “Column=500”. This action doesn’t eliminate the problem totally but it does extend the time that I can refresh the Power BI without having to change this line of code. What this does is bring in a bunch of empty columns which get eliminated later on. On the left is the image from the confirmed death table showing “Column=324” and on the right is the image from the confirmed cases table showing the change.
  1. Changed Type: It is safe to delete this step. This step is always added by Power BI, even if you don’t actually take steps to change type.
  1. Promoted Headers: After pulling in the data in the Source, I usually try to promote headers if it is possible – sometimes there are other junk rows that have to be deleted first but not in this JHU data source. The code for the Promoted Headers step does not include any specific headers so it is safe to perform at this point, but be aware that far off to the right there will be columns with blanks.
  1. Normally after promoting rows/headers I would start changing the type of data. For instance, I would make sure the date columns with confirmed cases or confirmed deaths data underneath are set to whole numbers rather than text, but because of those new additional columns that are empty, Power BI will throw up an error when I refresh the data due to those columns heading being changed. I don’t want to change type at this point – I need to do it later in the steps, if it is possible.
  1. So before changing the data type, I will unpivot all of those columns with dates as headers. I want those dates to be in its own column. By unpivoting those columns, the table will be better set up for charting. Note that the code for unpivoting does not reference to any columns whose heading could be changed during an update. All of the column names in this unpivot step will remain constant upon future refreshes.
  1. Now I can change the data type for the dates from text to date types and the “value” column from text to whole numbers. I can also change the column heading to a better description.
  1. After changing the data types, you may notice that errors started to pop up under the date column, so I had to remove rows with errors – there is an option to remove rows with errors under “Remove rows” button in the menu. I also filtered out any nulls.

After that, the tables from JHU GitHub are now ready to be refreshed in Power BI.

Similar Posts