I’m going to try out a new topic called Excel Simulations, learning from a book with that very name, written by Dr. Gerard M. Verschuuren. The book covers 80 simulations, extending over a wide range of fields. I’m not sure how this will work here because I wanted to see if I could up with some real world application or a creative real world application, but the problem is I’m not that creative. And to come up with a decent picture. And to do this every week, which for 80 of them means doing this for over a year. This could very well be ugly.
And I have already failed on the first one: I can’t come up with a real world example.
The first simulation is very basic: rolling of a die and showing the result on the screen. The image below is that of the file. You can place the cursor in the body of the screenshot and hit F9 and see what the file does but unfortunately, you cannot see the formulas. (Update: I used to be able to see the screenshot but now it has disappeared?) By the way, F9 appears to only work on the desktop.
I’m going to have to work on the approach a little more to see what it is I can achieve with my limited imagination and skill set.
While I haven’t been able to come up with a real world application with this scenario, you can use the RAND function that plays a major role in this simulation. For those with a more updated Excel, you can use RANDBETWEEN (X,Y) where X=1 and Y=6 to generate a number between 1 and 6. The RAND formula in cell B2 is structured as INT(RAND()*6)+1. RAND() will generate a number between 0 and 1, but never exactly zero or 1. INT(RAND()) will give you just 0, because INT rounds to the nearest whole integer. Nearest whole integer of .99 is still zero. Multiplying the RAND() by 6 will give you something between 0.0…. and 5.6….. Taking the integer of that [INT(RAND()*6)] gives you a range between 0 and 5. And finally, adding 1 will get you a range between 1 and 6, perfect for rolling a die.
To figure out what formulas to use in the black square representing the die, it helps to draw out all of the results, which is what columns O through P does. The show you where the dots lie for the numbers 1 through 6. First off, I noticed that the very center spot is used only when you roll numbers 1, 3 and 5, so I write a formula in that center for when the INT(RAND()*6)+1 equals 1, 3, or 5.
Next, I notice the very middle row only happens when the die is 6, so I write the appropriate formula in those 2 middle cells.
And the reasoning process follows in the same fashion for the remainder of the cells.
I haven’t figured out when I can use something like this (other than the RAND function where I’ve used it for simulations) but I can keep in mind of the idea of generating pictures. Most of the stuff I do don’t entail these kind of nifty picture. Or maybe I haven’t dreamed up a reason yet.
The next simulation in the book is pretty much the same thing but with 6 dice. (Okay, a little time out: according to some wiki-dictionary and Answer, die is supposed to be singular and dice plural, but it sure is hard not to add an S at the end of dice.) So to do this exercise, you pretty much copy what you did in the first one and then add some additional stuff such as a chart and some coloring to highlight when a die has 6 dots or when at least 2 dice have 6 dots.
To add the chart, all I did was just “Insert” it via the menu bar. The menu leads you just right into it.
Now, to highlight when a die has 6 dots, all you have to do is do a conditional formatting with criteria “=A1=6”, where A1 is the location of the random function generating the die’s numbers. Each die will have its own random generator, so A1 will change for each die.
For the instance where at least 2 dice show 6 dots, we want the entire row of dice to be highlighted. Again, that is just a conditional formatting with a criteria of “=countif(row of random generator,6)>=2”.
If you are on the desktop, placing your cursor in the body of the spreadsheet and hitting F9 might work. Making this thing work is tricky and right now, it does not appear to work for mobile devices.
The next simulation in the book brings up a topic of frequencies. Here, we are counting how frequently a certain item comes up. This example uses an array function called FREQUENCY to do the counting. The simulation rolls two dice using the RAND function, much like the prior two simulations, and then counts up the results using the FREQUENCY function and graphs it. There is also an AVERAGE calculation with some conditional formatting to highlight when the average becomes extreme, in this case, less than 5.5 or greater than 8.5.
To use the FREQUENCY function, you must first figure out your bins which would be ranges of numbers you want to count. In this example, we’re using single numbers as bins: counting the number of times a total of 2 comes up, the number of times a total of 3 comes up and so on. But sometimes you want to use a range of numbers as bins: how many people have a height of 4 feet or less, how many with height between 4 and 5 feet, and so on. Once you figured out what you want, highlight the range of cells where you want the count to reside. In this example, I highlighted cells H3 to H13. Then start typing in your formula FREQUENCY(range1, range2) where range1 is the input range to be counted and range2 is the location of your bins. So for this example, cells D3 to D11 holds the sums we want to count and cells G3 to G13 holds the bins. Since FREQUENCY is an array function, you won’t just hit – you will use Ctrl+Shift+Enter, holding the keys down as you hit those keys.
The graphing is fairly simple with the exception of applying a vertical line designating the average. I don’t know how to add that second line…yet.
FREQUENCY is very similar to the histogram which you can pull from Data Analysis found under the Data tab. This is an add-in from the Analysis Pak (I think) so you may have to pull it in. Histogram will give you the same result but FREQUENCY has one advantage: you can use F9 key to re-calculate the results whereas histogram has to be regenerated each time by going into Data – Data Analysis. The histogram method is a little more tedious if you want to run a couple of simulations.
As usual, the sample spreadsheet works best on a desktop where you can place a cursor in the body and hit F9. The Excel plug in tool that I found is NOT mobile friendly. Also, I tried creating a macro that would enable you to switch between seeing the formulas to seeing the numbers, but the plug in will not allow macro files to be upload. I believe it is for security reasons.
Now, I have actually used the histogram before or a variation thereof. I have used it when I first got to SAIC and was investigating the nature of the project works we were doing. I wanted to see what contract values our project works were bringing in and whether the 80/20 Pareto rule applied. When I first got there, out of 150 to 170 projects, most of our projects clustered around the $10K range and we had only maybe 2 or 3 at around $500K and maybe one $1 million. So most of the project managers’, project controllers’ and billers’ time were spent on the $10K projects, because there were so many of them. I displayed that histogram during our EAC review, mainly as a silent way of showing of where the work effort was being expended. After one or two years, the project contract values started to shift more towards the $500K range and a couple of $1 million dollar contracts. We had fewer projects but the values were greater and the overall revenue increased.
The next simulation involves “betting” that the next number will be even or odd. We’re using the RAND function again, combined with INT function, to generate numbers between 1 and 36. I could have used RANDBETWEEN but I wanted to practice using the logic of RAND and INT, as if I didn’t have the latest version of Excel. What you do is generate a list of numbers between 1 and 36, using RAND and INT. Column B in the attached file found below does that. In column C, you enter either 1 if you think the number is odd or 2 if you think the number is even. Once you put in your guess, the number appears in column B and your right or wrong score is depicted in column D, with cumulative score shown in cell L4. In this set up, you get 1000 guesses per play.
A couple of little quirks to point out: because RAND will generate a random number each time you enter your guess (unless you have set your file to manual calculation), I had to figure out a mechanism to freeze the random numbers so that you could use it for one set of play. It involves letting the random generator create the list of numbers in column B (a list with 1000 rows), copying that list, pasting special value that list into column R and then “freezing” the random generator by marking “x” in cell R1. When you want to generate a new list, you just delete the “x” in cell R1 and F9 should generate a new list. The green box provides more specific instructions.
I also have a choice of two scoring systems: system “a” gives you a zero for when you guess incorrectly and system “b” subtracts 1 when you answer incorrectly. System “a” doesn’t give you a losing incentive but system “b” takes away “your money” if you guess incorrectly. When I first built this “game”, I started with system “a” and could not see what the author was talking about when he said “you could lose” when using the “law of averages” approach. But after building in system “b” scoring, I began to see what he was talking about. Just like you have a 50/50 chance of guessing even or odd correctly, you have roughly 50/50 chance of winning, or put another way, 50/50 chance of losing.
For this simulation, you will have to download the file in order to play with it to see what the author is talking about. Again, this file is best seen on your desktop. The plug-in I’m using does not have a response friendly mechanism and I haven’t found anything better yet.
So, what aspect of this simulation can we use in our everyday life or work problems? Right now, I don’t see much use but the MOD function might come in handy if you want to alternate rows of color. You can use the formula (MOD(ROW(cell),2)) in conditional formatting to set up alternating bands of color. I’ve done alternating bands of color but by setting up a column of alternating zeros and ones and setting up conditional formatting to assign colors to zero and/or one. You can totally skip adding a column of ones and zeros and utilize the MOD(ROW(cell),2) in your conditional formatting. That little formula would take the row number and divide it by 2 to give you either a 1 remainder or a zero remainder. But other than that, I can’t think of any other use…at least not at this time.
Tuesday, June 30
This next simulation uses data tables in a really interesting way – one that I had to do some testing to convince myself it really does work. The basic scenario is we set up a situation where the gambler has a 50/50 chance of guessing correctly. He gets 1 point if he guesses correctly and one point deducted if he guesses incorrectly. Also, the gambler gets 100 guesses. That’s one simulation. We do that same scenario nineteen more times to create a total of 20 simulations. We collect statistics on those simulations.
In the file below, which again is not responsive, the guesses are under column B and the scoring is under column C. The statistics being collected is found in the table to the right. The guesses under column B are RAND() and are forced to 50/50 chance by setting up the scoring where you get a point if B>.5, otherwise, a point is deducted. In the statistics tables, row 3 contains the formulas for average (AVERAGE), maximum (MAX), minimum (MIN), and standard deviation (STDEV.S) (I believe this is a new one for Excel 2013). Beneath that row of formula is left blank to set up the data table.
I normally don’t use data tables because my world typically does not ask for it but whenever I come across a situation where I can use it, I always have to get instructions on how to set it up. When I do use data table, it typically has been a 2 variable data table but you can do a one variable data table and can have multiple formulas. This simulation uses a one variable data (the RAND) and contains multiple formulas. The way I have the spreadsheet set up, I have a sample one variable and two variable data tables set up to the right (basically columns O through S) where the highlighted regions are inputs to the data table. So for a one variable table, you have an input cell P31, the variables you want to try in column O23 to O27, and the formulas across in a row P22 to Q22. Highlight the region O22 to Q27, then go to the menu and select Data – What-If Analysis – Data Table. If you don’t see Data Table, you may have to pull it in via the Analysis Pak. After selecting Data Table, you will get an input box asking for row input cell and column input cell. Since I have the variables going down in a column, I want to use the column input cell. My column input cell is going to be P31. My understanding is that all of the variables in column O23 to O27 will pass through input cell P31 which feeds into the formulas in P22 to Q22 and the answers get spit out into the highlighted are O22 to Q27.
The two variable data table is set up slightly differently in that you can only have one formula to query. I won’t go into details here.
For this simulation, the formulas are arranged in a row H3 to L3. The variables, which will be empty, are in column G3 to G22. The table to highlight before going into the menu bar will be G3 to L22. The input cell will be an empty cell, but note, the formulas will not reference that input cell. Instead, those formulas in row H3 to L3 references column C which contains the scoring system (which itself references the RAND formula). This set up will automatically generate 20 simulations.
In order to convince yourself that this works, do these series of tests:
- First set up a normal one variable data table with at least 2 formulas (make them up), a column of variables, and an input column cell that the formulas references.
- Now do a new one variable data table, keeping everything the same except make the column of variables empty and see what happens. You will have to make your input cell empty too.
- Copy scenario 2 or data table 2 and change your input cell to be another cell which does not feed into the formulas. See what happens then.
- Finally, copy scenario 3 or data table 3, and enter RAND into the cell that the formulas point to, keeping the input cell the same as in #3. In other words, the formulas will reference to a cell with a RAND function but does not reference the input cell.
Scenario 4 or data table 4 is the set up being used in the simulation. Hope this helps you see why this set up creates 20 simulations.
Below is the Excel file showing the simulation. This is best seen with a desktop and not a mobile device. This was the only plug in I could find that would embed Excel but it is not responsive.
I think this one is going to be short as not much creativity is running through me. Also, I’m going to have to work on how to make this section another blog but separate from my original, that way this page is not one long page. The author’s book, Excel Simulations, has about 80 simulations.
This next simulation is mainly about plotting random numbers. One set of random numbers (the x variable) moves up or down depending on whether your RAND() generator is greater than .5 or not. These set of numbers create one set of graph. The second set of numbers (the y variable) moves left or right and creates the second and more interesting graph. Again, the Excel view shown below is best seen on your desktop and not on a mobile phone. The plug in I’m using is not responsive and I haven’t found anything better. If you are on a desktop, place your cursor in the Excel view and hit F9 to generate new calculations. Both graphs change and the second one is rather mesmerizing.
I kind of wonder if the second graph is similar to or is the genesis of the “life” simulations where “things” get born, grow and die. It has that “life” feel. Maybe you start from this basic set up and then add some constraints or conditions? Not sure yet, have to think about this. The other interesting thing I find is that the movement does not go beyond 80 or 90, in any direction. In other words, we don’t move up 1000 times – we go up 2 and then down 1 before moving back up. I’m not sure 80 or 90 is the threshold for 1000 runs and would differ by the number of runs, but it does seem that there is a limit to the movements. Is this similar to the theory (or is it fact?) that there are only certain conditions that can have the current physical constants and it is only with these physical constants that we have life in the universe?
The big learning take away from this simulation is the function COUNTIFS where you can use multiple criteria instead of just one. Right now I use array formulas to perform multiple criteria and on this simulation, I used an array formula to cross check the COUNTIFS function. The COUNTIFS function, which probably can only be used with 2013 Excel version and maybe with 2007, is easier to use than array formulas and probably does not use up much calculating space. The author also talks about the MATCH function, but I already know about this function. It’s a useful function, especially in conjunction with INDEX, and can replace VLOOKUP. VLOOKUPs have some limitations that the MATCH/INDEX combo addresses.
Okay, that’s it for now.
July 12, 2015
The next installment of simulations uses a technique I’ve seen used in discussions about Monte Carlo simulations in Excel. Basically you assign a percentage that an occurrence will happen and you force the simulation to come very close to that assignment. For example, in the embedded spreadsheet below (again, this embedded spreadsheet is best seen on a desktop as the plug in is not responsive), cells E1 to F6 shows the dollar denominations and the percentages. In this example, we are going to deploy some dollars in some order and with some frequency. The frequency is determined by the percentages, so 60% of our monetary allocations will be in $1 denomination, 20% will be in $5 denominations, 10% will be in $10 and so on. Column B shows the order of the spending of the denominations, using the RAND function as well as the VLOOKUP. In column B, there is a hundred runs of “spending” or “allocation” and cells E11 to F16 shows the frequency of the dollar denominations deployed in column B. A chart comparing the decided allocation percentages (cells E1 to F6) versus the resulting allocations (cells E11 to F16) is shown to the right.
The RAND function generates the percentage and the VLOOKUP looks up that percentage and assigns the appropriate dollar denomination. The quirk with the VLOOKUP is that the first column of the search range must contain the percentage it will look up, so column D plays that role rather than column F. Cell D2 shows zero and cell D3 shows 60, so any percentages between zero and 59 created by RAND will be for the $1 denomination. Any percentages between 60 to 79 will be assigned the $5 denomination. For clarification, cells H1 to H6 show the percentage ranges assigned to the denomination. The formula in column B is VLOOKUP(RAND(),$D$2:$E$6,2). In the beginning, in order to test how that formula would work, I first started with the RAND function in cell C2 with cell B2 referring to C2 as VLOOKUP(C2,$D$2:$E$6,2). I hit F9 multiple times to check how the VLOOKUP formula was assigning the dollar denominations. Once I was satisfied with the table layout in cells D1 to E6, I embedded the RAND function within the VLOOKUP function so that column B’s formula was VLOOKUP(RAND(),$D$2:$E$6,2).
Hitting F9 multiple times will show you that the frequency result is close but not exactly what you assigned. So the occurrence of the $1 denomination happened close to 60% of the time but not exactly. If you want to force an exact percentage allocation, you will probably have to combine RAND and RANK to force a random percentage with a rank between 0 and 59 to be $1, anything with a rank between 60 and 79 to be $5 and so on. The RANK pretty much forces the RAND to allocate as per the percentages given. Now this works with 100 runs. With a different number of runs, you will have to adjust the rankings range to come up with the desired frequency. I’m currently doing something very similar to this approach and found that RANK pretty much did the job of stabilizing the frequency to what we wanted.
This particular posting will be done in multiple pieces, possibly two or three, as I try to understand some of the materials the author writes about. But first, I want to touch upon the simulation that I would have done for last week if I hadn’t had problems with my PC doing a bunch of upgrades. I really didn’t see a whole lot of use for that posting even though it was about cracking passwords. In that instance, it was more about searching for a word in a random list of 4 letter words. The only thing worth noting about that simulation is that you can use the function CHAR along with numbers 97 to 122 to generate random 4 letter words. I already use CHAR and CODE, so there really was nothing new there and the use of them was not very startling.
This week’s, and for the next couple of weeks, simulation will cover binomial distribution. There were some aspects of his simulation that I did not quite understand and I’m finding that my study of it will take a while. The simulation is “toss a coin 6 times and count how many times you get heads” (zero heads, 1 heads, 2 heads, etc). Do many runs of the 6 toss and see what the distribution of the heads turn out to be. I used RAND to simulate a 6 toss trial to see how many heads come up, and then do that 500 times. It turns out that 3 heads and 3 tails are the most likely toss – duh! That made sense but what I was really testing was the 31% probability assertion he was claiming. I believe him but I wanted to test it for myself. Section A1:H501 was the brute force test using RAND function. A table summarizing the frequency was developed in cells K1:N9. If you place your cursor in the body of the embedded spreadsheet (this will only work if you are seeing this web page on your desktop; mobile phone won’t work) and hit F9 a couple of times, you will that the 3 heads and 3 tails toss ranges around a 31% probability.
Then, I did a variation where I listed out all of the possible arrangement of heads and tails. Cells T1:AA65 depict that. Table K13:N30 counts up those variations and you see where the 31% for 3 heads and 3 tails come from.
Now what I’m striving to do, before I continue on with his simulation, is figure out how do you get the probability without such brute force. I can figure out the total number of variations for tossing coin 6 times: 2 to the 6 times (or 2 x 2 x 2 x 2 x 2 x 2) and that equals 64 variations. That’s what cells T1:AA65 shows. If you toss a coin 8 times, then you would have 2 to the 8 times = 256 variations (and trying to do that in excel would take time). If you had a weird die with 3 faces (say heads, tails, and guts) and you tossed that die 4 times then you would get 3 to the 4th times = 81 variations. The image I had for this post attempts to display why the math works this way. Think of a coin toss where you toss it 2 times. Then you get a 2 x 2 table of possibilities or 4 cells in a table. That’s the flat figure of 2 x 2. If you toss the coin 3 times, then you start off with the 2 x 2 and add the third dimension to get a table of 2 x 2 x 2 possibilities. At least that was how I was visualizing the generation of possibilities. So I can figure out how to get the total number of variation possibilities.
What I can’t figure out yet is how to determine how many variations of 2 heads and 4 tails there are or 1 head and 5 tails without generating a list much like cells T1:AA65. I’m trying to figure out if there is a mathematical way of figuring this out and thus figure out the probability. That’s where I’m at right now.
The embedded spreadsheet below is not responsive and thus will not work with mobile devices. You must view using your desktop.
Sunday, August 2, 2015
I played around trying to figure out if there was a pattern or formula that would tell me how many times “x” number of heads would show up if you flip a coin “n” times. So, if I flip a coin 8 times, how many different ways can 5 heads appear? I can figure out how many head/tail combinations there will be if I flip 8 times: 2^8 (2 to the 8th power). I assume if you had a 3 headed die, flipping it 8 times would lead to 3^8 combinations.
But I wanted to know, out of all of those combinations, how many would yield 5 heads and 3 tails? Once I had that, I could calculate the probability of that combination occurring. From that, I could run a test to see if the coin or die is a “fair” coin or die or something doctored to lead to a certain outcome.
One of the thing I noticed was that upon building up a table, I found that by adding 2 numbers from the previous “flip” gave me my results. An example would help clarify what I just said. In the embedded spreadsheet below (you might be able to see it in your mobile device but you probably won’t be able to move around; it is best seen on your desktop), cells A1:H13 shows a table I developed that depicts the counts of the head/tail combinations for “n” flips of the coin. So, if you flip the coin 6 times, the pattern of 2 heads and 4 tails would show up 15 different ways (pattern 1: h, h, t, t, t, t; pattern 2: h, t, h, t, t, t; pattern 3: h, t, t, h, t, t; and so on). Cell G7 shows you the 15. The number 15 is the sum of 5 and 10, under the previous column – the previous “flip”. All of the numbers in the table is the sum of 2 numbers in the previous flip.
But I wanted a formula, not a table because suppose I had a situation where I wanted to do 50 flips. I could build a table but it would be much easier with a formula.
In looking at that table, it brought to mind the Fibonacci numbers/sequence. I vaguely remember seeing somewhere a table where you added 2 numbers from a previous row or column and wondered if it was related to the Fibonacci sequence. It turns out that Fibonacci sequence is related to the table and that the table is known as Pascal’s Triangle. And from there, I found the formula: n!/(k!*(n-k)!). And, further research yielded an Excel function COMBIN(n,k), where n is similar to the number of flips and k is the number of heads showing up. You could use it to figure out the number of ways of choosing 8 people to play out of a 10 member team (n=10, k=8).
Because I had to spend the last 2 or 3 weeks on the video creation, I’m now trying to catch up on other activities. So this is a belated entry.
This entry will probably be the last one related to this simulation. I think I figured out all I am going to figure out. There are some things I’m not sure I get where he is going but I suspect I won’t get it until much later on.
Last week, I was trying to figure out how to count the number of variations of a certain coin flip out of the total pool of possibilities. For example, how many variations of 5 heads and 1 tail flip will there be out of a pool of 2^6 flips (64 total combinations of heads and tails when you flip a coin 6 times). From that inquiry, I learned there is a Pascal’s triangle that gives you the answer but there is also a formula: n!/(k! * (n-k)!). In Excel, you could use the FACT function to do the “!” part or you could use the COMBIN(n,k) function which is a shorthand for the formula. To get probability, you would then divide that number by the total variations which would be described by the formula 2^n (where 2 is for the 2 outcomes heads or tails). I presume you could do the same if you had 3 outcomes: heads, tails, and balls (if you had a 3 headed die).
The author talked about a fair coin where you had a 50/50 chance of getting a head. An unfair coin would be something where the odds were skewed towards a certain outcome, such as a 40% chance of getting a head. The table I developed last week out of the Pascal’s triangle was for the 50/50 chance but the author had a table of probabilities for the 20%, 30%, 40% and 50% chance of flipping a head. As an example, see cells B1 to H10 in the embedded Excel below (best seen using desktop). Here was where I got lost and I wanted to understand how we derive that, not just use the Excel function. It turns out that the Bernoulli trials play a role here.
I believe the Bernoulli trials only work where you have 2 outcomes such as succeed/fail, yes/no, head/tails, etc. I won’t go into all of the details because it is still sketchy and I haven’t internalized everything but the basic formula is: [n!/(k! * (n-k)!)] * [p ^ k] * [(1-p)^(n-k)] where n is the number of flips, k is the number of heads and p is the probability of getting heads. Note the first part is the formula that was obtained from the Pascal’s triangle so you still have the count of the variations of head/tail combinations. The next part is the probability of succeeding, so if you flip a head 5 times out of 6 tosses and the probability of getting a head is 20%, then the probability of flipping 5 heads becomes .2*.2*.2*.2*.2. The third part, or the last part, is the probability of failure or tails which would be .8^(6-5) = .8. In this instance, with a probability of heads at 20%, flipping 5 heads is practically zero. Column Z and on to the right shows my thinking, if you can understand it. There is a proofing of the formula for the 50% probability which gets back to the fair coin situation.
There is an Excel function that calculates these kind of probabilities: BINOM.DIST(k,n,p,false). I could have just used this formula, but in the beginning I wasn’t understanding where the author was going. Now I have a better understanding of what this function is doing so that I don’t just blindly use it. The last part that I still haven’t squared away is his comment that it is very hard to tell if the coin is fair or not. I sort of see what he is talking about but I’m not totally convinced that I do understand. But that is for another time. Now it’s getting late and I need to get into bed.
Monday, August 8, 2015
With this simulation, I’m not sure where the author is going with this. What is our takeaway? That the “mean of the means” does not fluctuate much? The definition of SKEW? What?
This simulation is basically 20 trials or runs of ten numbers generated randomly. The random numbers range from zero to ten, using the combination of INT and RAND. Nothing new here. An average of the 10 random numbers is calculated for each of the 20 rows. Then a grand average of all 20 averages is calculated. If you hit F9 often enough, you will note that the 20 averages fluctuate somewhat wildly whereas the main average of the averages does not.
Off to the right, you will see a frequency table that counts up the averages. The function used is an array function FREQUENCY which we’ve used before so nothing earth shattering here.
Then there is a graph. I need to figure out how to graph two very different set of numbers. I can graph the averages but I can’t seem to add in the vertical line representing the “mean of the means”.
Here’s the embedded file of the simulation. Again, this file is best seen on a desktop, not on a mobile device, since the tool I’m using is not responsive.
This week is an expansion of last week’s discussion. In fact, the simulation is almost exactly the same: there is a number of rows of 10 randomly generated numbers from 0 to 10. Each row is a sample of 10 randomly generated numbers, the 20 rows represent a sampling population “culled” from the greater population. Generally, when you do surveys or polls, such as that being done during the run up to the Republican nomination, you can’t survey or poll every single person in the population. It’s too expensive to do that, so there is a sampling of the population. Now I don’t know how to equate the sample of 10 random numbers to the Republican sampling unless 1) the general population is the Republican voters in the whole US, 2) the sampling population is the Republican population in a specific state, and 3) a sample is those polled?
The embedded file below shows on the left hand side the frequency of the random numbers that shows up in row 6 whereas the table on the right hand side shows the frequency of the average of the numbers in the sampling population. The table on the right is last week’s table. Beneath the frequency table on the left hand side, you will see the standard deviation for a single sample. The interesting thing is, that standard deviation enables you to determine the sampling error for the sampling population table on the right. It is interesting how a single sample can determine the error of a sampling population. I’m not getting an intuitive sense of why that is so, unless…that single sample is supposed to be representative of the whole sampling population in which case, the more items in the sample, the smaller the sampling error. The formula for the sampling error is (standard dev)/(square root of # of items in a sample). Hmmm, interesting thought process here.
The embedded file below is best seen on desktop and not on a mobile device. The widget used to embed the file is not a responsive widget but it’s the only one I could find.
The next simulation in the book is Normal Distribution and I learned a new analytical tool that I need to explore further. The tool comes from the Analysis Toolpak add-in and is called the Random Number Generator. At first, I thought it was cool but then I found that it had one odd quality. What is does is generate a list of random numbers that fits the normal distribution coalescing around a certain mean (or average) that you input along with a certain standard deviation that you also input. In this example I asked for 100 random numbers that fit the normal distribution profile with a mean of 10 and standard deviation of .5. With the new Excel 2013, you have a choice of using standard deviation for sample or standard deviation for population and they do give slightly different results. I used sample. You also have to input a seed number which in this instance, I chose 5 as the seed. The Random Number Generator then pops out a list of random numbers (100 of them in column A in the embedded file below). These numbers are hard coded, so if you hit F9, those numbers will not change.
But that’s not really the weird thing. The weird thing is that if you do over the Random Number Generator with the same set of criteria as input, you will get the same list. I don’t know if that is because for each set of mean, standard deviation and seed, there is only one solution that meets the normal distribution model. Somehow, I don’t think so but I really don’t know yet. I do know that if the seed is different, then you will generate a different set of numbers. And of course, differing means and standard deviations will also lead to different sets of numbers.
So my idea is if you want to use the Random Number Generator, use RAND to create a randomized seed number and input that so that something is random.
With the Random Number Generator, you can also create different kinds of distributions beside normal: uniform, Bernouilli, binomial, Poisson, patterned and discrete. So this could have some interesting play.
The final thing to point out is there is a new function that I’ve never used before: NORM.DIST. This function gives a percentage probability of a number occurring in a series with a certain mean and a certain standard deviation. The formula is used in column G in the embedded file below (which, by the way, is best seen on desktop – the widget is not responsive for mobile). Columns D and E are just the ranges of number created by the Random Number Generator. Column F is the FREQUENCY or count of the numbers found in column A. To the right, there is a graph displaying the typical normal distribution curve along with the graphing of the count of the results. And then even further to the right are my efforts of using the Random Number Generator again using the same criteria.
Sunday, September 5, 2015
Ah, Labor Day Weekend is officially here.
The next stimulation is still on the topic of normal distribution but it has some more substance to learn. One thing, I re-read the last simulation in the book and saw that the author discusses running the Random Number Generator again to get slightly different numbers. What is he doing differently? I do not get slightly different – I get completely the same set of numbers.
On to the new topic – this new simulation topic discusses a new way to generate a normal distribution of random numbers. There is first a column of 100 randomly generated numbers using the RAND function (column B in the embedded file). Then there is another column of a normally distributed random numbers created by using NORM.INV.
In the last simulation, we used NORM.DIST(x, mean, standard deviation) that produced the probability of x appearing, given the average and standard deviation. In this simulation, NORM.INV(probability, mean, standard deviation) generates the number with a given probability. The RAND function goes into the probability part of the formula: NORM.INV(RAND(), mean, standard deviation). The chart shows how using NORM.INV sort of conforms to the normal distribution.
In addition to a new function, the author talks about how 1 SE (or standard error) leads to 70% confidence level that the values will lie between 90 and 110. At 3 SE, we have a 99.7 confidence level that the values will lie between 70 and 130. BUT…when I add up the percentage occurrence next to the frequency column (cells H34 to H36 in embedded file), for values between 90 and 110, I keep getting around 85%. I must be missing something. I’m assuming that standard error is roughly the same as standard deviation?
This week’s simulation was to be the normal distribution again but with a data table to show how more trials will get you closer to a normal distribution. It was to be 20 runs of 100 number trials or 2000 numbers, which, according to the normal is still relatively small data set. I did do the data table so that I could refresh my memory on the methodology but rather than post my efforts here, I thought I would do something different.
When I first started doing these normal distribution simulations, an idea popped into my head but I didn’t have the time to play it out. Now I have had a chance to pull at least some of the data but it was rather tedious. I wanted to do a normal distribution of the IQ for Chinese and Indian populations and compare to the US population but pulling the population data took time so I did just the Chinese population, which was eye-opening enough alone. All information I pulled came from Wikipedia. The first thing I learned about IQ is that the tests were developed to have a mean IQ of 100 and a standard deviation of 15. There appears to be an assumption that IQ is normally distributed and Wikipedia implies that with the statement that 2/3 of the IQ will fall within the range 85 to 115 (1 standard error or the 70% confidence interval). Wikipedia also says only 5% have IQ 125 and greater. Cells F31 to G52 in the embedded file (which is best seen on desktop, not mobile, since the widget is not responsive) show the IQ ranges and the counts to mimic the normal distribution. Column C does the random creation of IQs, fitting a normal distribution curve with a mean of 100 and standard deviation of 15. The random normal distribution in column C was created by using the function NORM.INV(RAND(),mean, standard dev).
The population information I also pulled from Wikipedia but from an infographic which was interesting to look at but tedious to pull numbers from. First the overall Chinese population is around 1.4 billion whereas the US is around 321 million. In percentage terms, the US population is only 23% of that of China’s! Already I can see problems – their population number swamps ours. India’s is roughly 1.28 billion. Combine China and India together you get roughly 2.68 billion and US is just 12% of the combined population. Furthermore, I broke out China’s population into age groups, using the infographic that I found. I wanted to get at the working age population. First, I assigned the population numbers to each age, from newborn to 90 and the total came up to 1.335 billion, obviously way short of the 1.4 billion. But I decided to go with it because the numbers came from 2010 data and what’s a few millions here and there when the numbers are so large anyway? I decided to define the working age population to be 20 to 65 and the total number turned out to be around 900 million, still way more than US’ 321 million. (And that 321 million number includes children and the elderly!) Then I used the percentages culled from the IQ normal distribution to determine how many Chinese working age folks have enough intelligence to do the work (IQ 100 and above). Depending on how the random numbers are generated, you are looking at around 500 to 600 million. China alone have way more sufficiently intelligent working age people than the entire US population alone. How do you generate and distribute sufficient number of decent jobs? The entire IT industry does not produce a lot of good jobs to replace the ones going away.
Now add in the Indian population and the coming robotization that is going to eliminate 47% of the jobs (probably numbers stated for the US).
The next topic, which is on page 30, is confidence margins. Don’t get confused like I did. On page 27, he talked about confidence as in 70% confidence, 95% confidence, etc. It took me a while to figure out, but that is a different confidence than the 95% confidence the author talks about on page 30. Page 27 uses confidence as a measure of the area under the normal curve to depict 95% confidence that the values will be between x and y. Or I like to rephrase it as 95% of the values will lie between x and y. On page 30, the 95% confidence is related to the fact that a single number, the mean, will be between x and y, which are different from x and y on page 27.
If I write in numbers, you will see the source of my confusion. Page 27 depicted 95% confidence as being 1.96 SE from the mean. In the example on page 30, the mean was 4.5 and standard deviation was .7, which meant a range of 3.13 to 5.87, per the confidence concept on page 27. But the page 30 example said that the range was .23 around 4.5, or 4.27 to 4.73, a completely different set of range. It took me a while to understand the page 30 range as a range having 95% probability that the mean will fall in it, given the standard deviation of .7. That range did not mean that 95% of the values fitting a normal curve will fall within the range.
In my effort to understand this, I went out to the web to do some research. In doing so, I learned that there are different formulas used to calculate the confidence margins. The formula used in Excel is for the mean; there is another formula for proportions, whatever that means. So there appears to be more formulas going into the confidence margins than the book tells you. I might have to dig into this to understand better.
The big takeaway from this simulation is that there is a function, CONFIDENCE, that gives you the confidence margin to add to or subtract from your mean to arrive at the range of numbers where you are confident the mean will lie. The function syntax is CONFIDENCE(5%, standard deviation, size). So for the example in the book, CONFIDENCE gave plus or minus .23 which meant a range of 4.27 to 4.73 where you are 95% confident the mean lies within. The other takeaway is the use of data table to create a table of confidence margin ranges, given your sample size and standard deviations. Previously, we’ve been using that data table to create simulations. Now we use it to create what-if tables to show how the result changes due to changing variables (in this case, sample size and standard deviations).
I worked on this on Sunday but some things did not make sense so I decided to hold off another day, in the hopes that clarity would come to me. Nope, no such luck. Monday was too busy and I ended up exhausted as it was not a good day. And today, I’m not feeling too hot so I will keep this post even more brief than usual.
This week’s simulation involves more of the statistics that I’ve been working on. The functions NORM.DIST, NORM.INV, CONFIDENCE, NORMSDIST (new) and NORMSINV (new) are all interrelated. I need to grab a statistics book that covers these topics so I understand them better. I think I understand only on the surface level.
In this week’s simulation, I do not understand the basis for the formula found in column D, which is highlighted in yellow in the embedded file (said file is best seen on desktop). Why did he set up a formula like that? Also, reading the chart, how does one read it? It’s beautiful but I’m not sure I would use it properly. Finally, I would like to understand better how to use it in real life.
Okay, that’s it for this week. I feel either a cold or flu coming on.
The last few simulations have been centered around normal curves. This week discusses how subgroups can make up a total population and make the population non-normal. The author cautions us that when the population is not normal, you can’t use the normal statistical tests such as Student’s t-values. So that brings up the question of how do you know when the population is not normal?
The other question that I have is how did the author managed to have smooth looking curves in his chart?
And finally, there is some kind of potential in the way he combined two subgroups to make the population. I don’t know how I would use that concept but there is something there.