Worked on a Unique Puzzle: Matching Names When There Are Different but Close Spellings
First - it is so hard to get anything done. I'm still trying to update my "About" page so my blogging has slowed down. But, because it is taking so long to do (difficulty with coming up with decent pictures and decent phrasings), I decided I would do a quickie blog.
This past week, I've been working on a really cool puzzle: how to match up names with a site code, especially when there are many spelling permutations.
Every month my boss has to report some financial data to the owner. Most of reporting has been automated but the revenue piece has to be hand coded. That is because of the way the revenue is being booked by the accountant - the revenue for each site is recorded by state rather than by site name, for tax purposes.
So, if you have several sites in Oklahoma, you have to separate out the different revenues and code them into your spreadsheet. There is a description field where the site name can be entered, and it is being entered, but each month the spelling may differ.
I asked my boss, "Why don't you just ask her to work on using the same spelling every month? Or better still, why can't she upload a standard file?"
"No, that is one battle I'm not doing." My boss already has to push her to close the books in a timely manner (this is a private company so things are done a little differently), so she didn't want to expend needed energy and capital on getting her to do consistent naming. The first priority is to get the accounting done.
Okay, so we will have to make do with the descriptions she provides and try to make the automation work with the different spellings she may provide. The spreadsheet will have to be able to recognize that "The Coolest Restaurant" is the same as "Coolest Restaurant" and is also the same as "TCR".
And I think I've done it. I did do some research on fuzzy logic to see if I could learn and apply anything, but I really didn't find anything that I could apply using Excel. And I wanted to avoid programming as much as I could. I also tried using Microsoft's Fuzzy Logic Add In - yes, there is an add-in for this type of conundrum. However, it did not work as well as the final method, largely due to the significant abbreviations she sometimes apply.
My final method was to use some kind of master table that housed possible spelling permutations. On Monday, I started to pull together a list of spellings she had used over the last 6 months and found that the most spelling permutation was 5 different spellings. Most of the sites had either 1 or 2 spelling variations. Now I knew that a VLOOKUP was not going to work because I wasn't created a long list of spelling permutations in a vertical method like:
The Coolest Restaurant 2222 where 2222 is the site code for The Coolest Restaurant
Coolest Restaurant 2222
TCR 2222
I figured that would make the table unwieldy and hard to maintain. Instead I set it up as:
The Coolest Restaurant Coolest Restaurant TCR
I set up each site name along with its various spelling permutations along the same row, with the logic being each row represent one site and its corresponding site code (2222). Then, I set the first cell as a catenation of all of the spelling variations. Let's say this catenation is done in column A. The formula would look like =C2&" "&D2&" "&E2&" "&F2&" "&G2, where C2 contains "The Coolest Restaurant", D2 has "Coolest Restaurant", E2 has "TCR" and cells F2 and G2 are left blank. As I said before, the most spelling variations I found was 5 so cells C2, D2, E2, F2 and G2 represents the 5 possible spelling permutations she might provide. If she ended up creating a 6th variation, I can add a 6th column H and add that into the formula in column A. Column B will be reserved for the site code (2222).
Now I only need to search along A to find a match and pull the corresponding site code in column B. VLOOKUP will not work because the spelling variation will have to exactly match those set in column A. So, if the accountant uses "The Coolest Restaurant", the VLOOKUP won't find that phrasing in column A because column A contains a catenation of multiple spelling variations.
But what will work is the combination of INDEX and MATCH because MATCH can use wild cards "*". So let's pretend you are working with a download of the general ledger that contains the site descriptions the accountant provided in column D. And let's pretend your Master Table resides in another tab called Table. So the formula to search and pull the site code becomes something like: =INDEX('Table'!$B1:SB2000,@MATCH("*"&D4&"*",'Table'!$A1:$A2000,0),1) where D4 is "The Coolest Restaurant".
Now, this isn't the perfect solution but I think it will help out tremendously. The biggest weakness are those sites that are one word and others containing that one word. So if you have one site called Highrise and another site called The Blue Highrise, you might run into problems. We actually have a couple of those, so with those sites, we will have to manually fix them in order to make it work. But that beats manually entering everything.
You must be logged in to post a comment.