Unpivoting your data source
I love using Pivot tables in Excel because they can make gathering, slicing and dicing data so much easier; however, I probably don’t like them as a data source. This post will kind of explain why using a data source laid out in a “pivot table” fashion may not be the best way to pull in data in Power BI and I will explain how to arrange the data for easier use – so far that I know of.
For this post the source data I’m using is: https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_pandemic_data/United_States_medical_cases
When I first started using Covid-19 data as a way of learning more deeply how to use Power BI, I found data in Wikipedia that was presented as a table with data running down the left-hand column and states running across the top as column headers. This is the “pivot table” format I’m talking about.
It looks something like the following. I had to cropped the image to fit the dimension that I get when I export Power BI PDFs – that particular size seems to work okay when uploading to the website. All images are created via Microsoft’s “Snip & Sketch” app, so the images will not be crisp but hopefully the image will still be legible.
When I pull that data into Power BI I get the following data table:
I had to do some transformation (which I will later show) to finally arrive at a data table looking like this:
So far so good. When you look at the list of fields, you will find each state has its own field for the daily value of confirmed cases of infections.
I suspected I really wanted a column for Region, a column for States and then a final column for the daily infection cases in the data table, but I went ahead and tried out this table, just for experimentation purposes. I quickly found that when I wanted to calculate the daily cumulative totals for each state, the daily 7-day moving average (or running totals) for each state, and the max single day infection count, I found I had to create new calculation measures for each state.
I ended up with a lot of measures:
Every time I wanted to do a new state, I had to create its own set of measures and those measures were just growing.
But if I unpivoted that “pivot table”, I can reduce the number of measures being created down to only one cumulative totals (or running totals), one 7-day moving average, one max single day infection count. So I want to change the data table from A to B:
A
to
B
And now the field list look something like:
Why will unpivoting reduce the number of measures? By unpivoting, I created a column for states, a column for region and a column for daily infection rates. Now I can create generic measures for moving average, maximum infection case count, running totals rather than creating state specific calculation measures. Then when I want to look at a specific state, I can just filter on that state (because I now have a new state column). Without that separate column for states, I can't do any filtering of states in the filter pane.
I haven’t yet created a measure for cumulative/running totals because I’m currently thinking of not using it, but we’ll see. Also, you will note that there are other calculation measures (Change, %Change, LastWeek_Moving_Ave, PerCapita). Those are additional calculations that I'm not going to be talking about in this post.
Now with this simplified field list, I can specify which state or region (Attribute.1 holds the regions) in the filter pane rather than creating a new set of calculation measures for each state graphic I want to create.
How I created the data table
Upon first pulling in the data, the table will need to be fixed before you can unpivot the data into the final form. I want to move from A (raw unformatted data from the source) to B (cleaned up data but pivoted):
A
to
B
Here are some of the notable steps:
1. Filter out the non-date rows in the first column which holds the dates.
2. Change that date column from text to date type.
3. If an error occurs from changing the text into date type, remove that error using “remove rows that contain errors”. That should generally do the trick. The "error message" will be a little red bar rather than a green bar beneath the column header.
4. Note that all of the columns with daily infection cases are in text. Change them into integers or numbers.
5. Remove any columns you don’t want. In this table being pulled from Wikipedia, there is an extra date column so I removed that column.
Now, we’re at the point where I originally was that led to all of those calculated measures for each state.
To transform the table further to reduce the number of calculated measures:
1. I used the unpivot feature in Power BI. The unpivot feature is found in the Transform tab. See graphics 1A and 1B below.
2. We get a new column showing both the region and the state, amongst other things. We need to split the region from the state by using the Split Column feature found under the Home tab.
See graphics 2A and 2B.
3. We get two new columns: Attribute.1 and Attribute.2. I renamed Attribute.2 to States and I eventually renamed Value to Daily Confirmed Cases.
1A
1B
2A
2A
2B
I eventually ended up with a table like below which shows an additional column so I did perform extra transformation of the table, but I won’t be describing them today. The extra column was to assist in calculating the last week's 7-day moving average for comparison against “today’s” 7-day moving average to see what kind of % change we are seeing in the moving day average.
That's it for this post. There are other things that I've been doing but I want to end this post right here.