Attempting Text Analysis
Lately, I’ve been trying to figure out how text analysis work. Last year, when I was looking at some comments, I created a dictionary of similar phrases so I could see trends in comments – what kind of comments are mentioned most frequently. In the business I’m in, certain phrases come up frequently so one could receive hundreds of the exact same phrasing – the phrases tended to cluster in the same usage. This is very different from searching and analyzing comments on Facebook where the comments could cover anything whereas the comments I was looking at was comments made by employees covering a small slice of the business. So in this situation, I was able to create a rather small dictionary of similar phrases falling into maybe about 15 – 20 categories of comments. It took a while to create this small dictionary though, although I imagine that it could be re-used from year to year.
Out of curiosity, I wanted to see if there was a way I could automate this process, or at least learn some of the process that goes into text analysis. I imagine that most text analysis is done via more robust tools, maybe Python, but I thought it would be at least instructive. The first stab was to count up the phrases which was easy enough – a pivot table is sufficient. However, what was
missing was some phrases were similar and fell into a category but the pivot table will not count those similar phrases as being a same category. For example, a pivot table will regard bought a house and buying a house as totally different but in my mind, I regard them as a similar category of buying: buying a house, buying a home, bought a home, buying a townhome, etc.
So the next attempt was to count individual words to see what the most common words were. Okay, a problem cropped up in that filler words (very important filler words though) such as “a”, “the”, “in”, etc. would rise to the top, so I had to develop a macro to count individual words but leave out filler words. But to do this I had to create a dictionary of filler words.
Next, I thought, okay, how about finding the next most common words associated to the list of “n” most common words. So for example, if the word “home” was the most common single word, let’s count the words that comes with “home” and see what is the most common. Then do the same for the second most common word after “home”, and so on down the list. I decided to restrict the count to the top 5 most common word and then set of top 5 words. I got a macro that accomplishes this but I’m not sure if this is going in the direction that will be useful.
I then started to do research using Google and did a search on Excel text analysis and found mostly word count or “n” step word count (there’s a terminology for this but I can’t find it). So this is where I’m at for right now.
You must be logged in to post a comment.