Lately I’ve been reading a lot about people solving hard problems such as saving the world, eliminating poverty, solving climate change. These are BIG, HAIRY problems that won’t be easily eradicated. Most of us don’t deal with stuff like this. But we all do work on hard problems, just on a smaller scale.
Right now I’m working on a new project that entails absorbing a lot of data in a short period of time. For the past week I’ve been working on pulling data that are laid out in different formats in Excel or in PDF. Most of the data were in Excel files with monthly financials running horizontally. But there were a couple that came in PDF files and laid out as either “one month prior, two months prior, three months prior” in horizontal fashion (ie. monthly but not in the January through December timeframe) or PTD (period to date) format. These were historical data and I needed them in order to review some budgets that were submitted.
The formats and layouts were not conducive to slicing and dicing for analysis.
Planning ahead, I also knew we would need a system to track actuals against budget, so I needed something to help me manage the current data as well as future incoming data.
The first half of the week was spent copying and pasting the Excel data into one sheet as a temporary holding fashion. That was the easy part. Then I had to spend some time figuring out how to cull data from the PDF files. This was the hard problem that I was trying to figure out.
The first step was to highlight the relevant data in the PDF file and then paste them into Excel which created one column of data going down the rows. Then, I converted the data using Data – Text to Columns, choosing Delimited and then choosing Space. What this does is separate each element, word or number, into each own’s cells. Since I’m working with GL accounts and their descriptions, the description could be something like “waste disposal”. When converting, waste will fall in one cell and disposal will fall in the cell next to waste and following that column would be the monthly data. GL descriptions may be one word, two words, three words, or however many words. This means the monthly numbers will not line up under the same columns. So I next worked on capturing the GL descriptions in one cell and then lining up the monthly data into the proper columns.
The short technical discussion below provides the basic framework of logic (not just the formulas but the thinking/solving approach) for pulling the PDF converted data into the proper columns. Of course, this discussion is for a specific PDF layout. Other layouts will require modifications of the formulas, but I have found, so far, that these set of formulas provide a good starting point for the logic: most downloads will require pulling together the descriptions, finding the start column with numbers, and the pulling in the rest of the numbers.
After I got a table of monthly cost data for each GL account, my next step was to convert this table into a form usable for pivot tables. Pivot tables require a specific format where you have monthly data running down a column. But…this topic of a macro is for another post. This one is already long enough. The PDF conversion was a hard problem alone and the macro was another.
[divider]Short discussion on formulas to line up PDF data[/divider]
In the example below, PDF converted data starts in column O. Formulas to extract the data into proper columns are written in the cells to the left.
1. I first set up a formula to the left of the data to count up how many columns contain words and mark the first column that contains the monthly data. So “waste disposal” would be in two columns and the following formula would give 3, meaning numbers would start under column 3.
2. The next formula next to the counting column is to pull together the GL description. Column A is the counting column developed in step 1. Column B is to display the GL description:
=IF(A31=5,O31&” “&P31&” “&Q31&” “&R31,IF(A31=4,O31&” “&P31&” “&Q31,IF(A31=3,O31&” “&P31,O31)))
3. Next, since this particular PDF did not display information as January through December but as 11 months prior, 10 months prior, 9 months prior and so on, January information really did not start until 8 months prior. The first month displayed was actually October. So I set up a couple of rows to help me keep track of what month I was dealing with. Row 24 had the 8 in column D, 7 in column E, and so on. This row was to mimic the 8 months prior, 7 months prior and so on. Row 25 had Jan in column D, Feb in column E and so on. Finally, row 23 had the formulas to find the appropriate monthly data in the converted PDF data off to the right:
This formula basically says, looking through columns O through AS on row 31, bring me back the data that is so many columns (D23) to the right of the first column with numbers (A31). Since in this dataset, column A represented October and I needed January, D23 gave me 3 which meant 3 columns beyond the first column of numbers. So for GL description waste disposal, October would begin in the 3rd column of converted data and January would be in the 6th column.