Update #3
This post is part 3 of my Power BI updates. Post 1 was an update on the state tracking on Power BI and contained a rehash of how I created some measures such as running total (or cumulative) from the daily cases, moving averages for “x” days (in my case, I’m using 7 days moving average), and maximum infection case number. I also talked about how I cleaned up the data and I want to continue on this topic in this post because I did some changes. And finally Post 1 ended up with new state charts showing the new colored bars and the per capita charting. Post 2 contains all of the images of charts for 5/9/2020, including the new states that were re-opening their economy. I want to see if cases show an uptick since re-opening. Right now, I’d say the results are mixed.
In this third and final installment, I will repeat the data clean up and add the changes, show how I imposed colored bars in the chart, and how I developed the per capita measures. In addition, there will be a brief discussion about linking tables because I brought in population figures for per capita measures, which required linking state’s population with the daily infection cases.
Before diving into my post, I will start off again with my data sources, provide a refresher on some calculations of cumulative or running totals and moving average, and then lead into today’s discussion on a new data clean up.
Data Sources
Raw daily data: https://en.wikipedia.org/wiki/Template:2019–20_coronavirus_pandemic_data/United_States_medical_cases
Table – Non-repatriated COVID-19 cases in the US by state
Population data: https://worldpopulationreview.com/states/
Table – Table(0)
State abbreviations: https://www.50states.com/abbreviations.htm
I actually copied the data into Excel because the site would not let me pull directly from the website. Then Power BI pulled the excel information.
Stay-at-home orders: https://en.wikipedia.org/wiki/U.S._state_and_local_government_response_to_the_2020_coronavirus_pandemic
Table – Regions that formerly had stay at home orders or advisory
Refresher: DAX calculations for some measures
- Running total or cumulative (I couldn’t make the quick measure work)
South GA running total in South GA = CALCULATE( SUM('Non-repatriated COVID-19 cases in the US by state ( vte )'[South GA]), FILTER( ALLSELECTED('Non-repatriated COVID-19 cases in the US by state ( vte )'[Date]), 'Non-repatriated COVID-19 cases in the US by state ( vte )'[Date]<= MAX('Non-repatriated COVID-19 cases in the US by state ( vte )'[Date]) ) )
- Moving average – This version is a 3-day moving average
South GA_Moving_Ave = CALCULATE ( AVERAGEX ( 'Non-repatriated COVID-19 cases in the US by state ( vte )', 'Non-repatriated COVID-19 cases in the US by state ( vte )'[South GA]), DATESINPERIOD ( 'Non-repatriated COVID-19 cases in the US by state ( vte )'[Date], LASTDATE ( 'Non-repatriated COVID-19 cases in the US by state ( vte )'[Date]), -(3), DAY ))
- Moving average with flexibility to denote how many days (or weeks, months, etc.)
South GA_Moving_Ave = CALCULATE ( AVERAGEX ( 'Non-repatriated COVID-19 cases in the US by state ( vte )', 'Non-repatriated COVID-19 cases in the US by state ( vte )'[South GA]), DATESINPERIOD ( 'Non-repatriated COVID-19 cases in the US by state ( vte )'[Date], LASTDATE ( 'Non-repatriated COVID-19 cases in the US by state ( vte )'[Date]), -('Non-repatriated COVID-19 cases in the US by state ( vte )'[Day Variable]), DAY ))
- Maximum # of cases in a day – this is a quick measure rather than a regular measure. This is easier to create. You don’t really need to write the formula but here it is anyway.
South GA max per South GA = MAXX( KEEPFILTERS(VALUES('Non-repatriated COVID-19 cases in the US by state ( vte )'[South GA])), CALCULATE(SUM('Non-repatriated COVID-19 cases in the US by state ( vte )'[South GA])) )
Data Cleanup – the original plus the new steps
When trying to develop the per capita measures, I realized I could have gone further with the original data cleanup, leading to a very different table structure to enable me to create the per capita measures. I suspect I could have used this method initially when developing the charts but I still need to test this theory; however, I do believe that this different table would have reduced the need to create so many measures that I currently have.
In Post 1, I showed how the table looked when I first pulled the data into the Power Query Editor:
Here’s how it ended when I finished cleaning up the data:
For the per capita measures (a type of measure that eliminates the impact of population differences and becomes apples to apples on a per 100,000 basis), I expanded on that data clean up, but first, I made a copy of that original table. By the way, that table was called “Non-repatriated COVID-19 cases in the US by state (vte)” (the name of the table on the Wiki website). So, I copied this table and named the new table “Per Capita Calculations”.
Upon completion of data cleanup, the new table has the following data structure:
Steps 1 through 7 details the cleanup of the first table; steps 8 through 10 explains what I did to convert the first table into the second table (after duplicating the first table). On the right hand side, the same steps are spelled out in the Advanced Editor in Power Query Editor for those of you who know DAX – yellow highlights are the new additions.
Per Capita Calculation
The per capital calculation is a calculated measure, but first I brought in a table of states’ populations. I have listed my source at the beginning of this blog. When I pulled it in, it looked like the following:
I made no changes to the table.
I also had to bring in a state abbreviations table because the Population table had the full name of the states but the Per Capita Calculation table had the abbreviated form (that form originated from the website). Bringing in additional tables meant I had to build relationships between the tables. I will talk briefly in the next section about relationships, so keep this fact in the back of your mind.
Once I got all of the tables connected, I then was able to write up a new measure “PerCapita”:
PerCapita = (Sum('Per Capita Calculations'[Value])/Sum('Population'[2020 Pop.]))*100000
This measure is a per capita per 100,000. All I’m doing is dividing some value – in this case, the number of infection cases – by the population to eliminate population effects. Then I multiplied that number by 100,000. You could use 1 million rather than hundred thousand. I’m using 100,00 because that’s the common usage and because not all states have a population of over a million.
Table Relationships
This section has to necessarily be very brief and light on how-to information since I’m not an expert in getting tables to link up properly. First, it’s easier to link up tables when most of the tables have the key listed once rather than multiple times. These kinds of links are commonly called one-to-one relationship. For example, the key that would connect the Population and State abbreviation tables would be the states. Population has the full state listed and State abbreviation has both the abbreviated name and the full name, but for both tables, the states are listed once. In the PerCapita Calculations table, the states will be listed multiple times due to the fact that daily data is being captured.
I start trying to link the tables by using the common keys that could connect the tables which in this case would be the states, either the full name or the abbreviation. I admit that this linking is done by trial and error in a search for one-to-one relationship or one-to-many relationship since I'm not an expert yet. I will usually create a matrix table visual with numbers to check that the linkings result in the way I expect it too – I always have a tab for testing numerical results when creating measures.
So, here are my table relationships - the red boxed area is of relevance to today's post.
Alright, I think I'm going to end here. I was going to do my usual thoughts on our current state of affairs but it is getting late and it's time to move on.
You must be logged in to post a comment.