Tools for Cleaning Data in Excel
I think a couple of posts ago, I wrote something about using Word’s special auto feature so store commonly used words, long words, most often misspelled words, in the “auto dictionary” that comes up when you type in the shortened “code” form. In that post, I used the longest village name as an example.
Well, a funny thing happened when I copied and pasted that word into Excel to calculate how many letters made up the word. It mysteriously attained hyphens in the word.
I have read somewhere that data scientists have complained about spending up to 60% of their time on data cleaning.
In working with data, especially those pulled from somewhere else or those typed in by others, we often have to clean them up or standardize them in the format we need. You are probably familiar with dates: sometimes they come in as text and other times they come in as date formats or as a numeric format.
But sometimes, you will get something weird and perplexing.
For some reason, I can’t attach my PDF so I’m going to have to try to explain this as concisely as possible.
TRIM and CLEAN
The first thing I do whenever I encounter something weird is to use Excel functions TRIM and CLEAN.
SUBSTITUTE
Once I finished cleaning with what usually works, I did the next thing to get rid of the hyphens: substitute the hyphens with blanks or “”.
Except that didn’t work. The hyphens were still there!
What could be going on? My suspicion was that the hyphens were not really hyphens. How do I test that? began by extracting how that hyphen, either by using the MID function or just by deleting the characters around the hyphen.
CHAR and CODE
Now, I never remember which to use, CHAR or CODE, so I play with them. In this case, it is CODE. See cell C13? That is the original text with hyphens with all of the letters deleted. Then cell C15 contains the formula CODE which gave me 173.
If you were to type in a regular hyphen from the keyboard, you would find that its code is 45. Ah ha!
So, I have confirmed that the hyphen really was not a hyphen, at least not the one typed from the keyboard.
New Approach with SUBSTITUTE
Originally, when I used SUBSTITUTE, I wrote it as =SUBSTITUTE(B2,”-“,””).
Now, I’m going to replace that hyphen with CHAR(173) (which converts 173 into the fake hyphen).
I blew up the image to make it easier to see the formula.
The written formula became =SUBSTITUTE(B2,CHAR(173),””) and I finally got results without hyphens. After that it became a simple matter to attain the number of letters in the word.
LEN
Wow! It is 58 letters long! Imagine typing or writing that out everytime you have to mail something to that village.
You must be logged in to post a comment.