View Full Version : Need help BS'ing my way through statistical sampling
Upchurch
4th May 2007, 09:07 AM
I'll start off by coming clean. I suck at statistics. It is probably the root of my problem with Quantum Mechanics in college and why I left the field of physics. In helping me, assume I know nothing.
Here's my situation: I have a series of 100+ online courses that use a common courseware engine. In the past, when a change was made to the courseware, we used an automated testing program to verify that 100% of the courses still functioned properly. We recently lost our vendor for maintaining the automated test and have been thoroughly unsuccessful in replacing them.
I have managed to convince the client that running a representative statistical sampling of the courses through a manual testing protocol will be more than sufficient for a reasonable quality assurance.
Now: if we round the number of courses off to an even 100, how many courses do I need to sample to obtain a reasonable representation and what would my confidence be for that sample size?
Thank you very much in advance.
69dodge
4th May 2007, 09:41 AM
I don't think software works like that. Computers aren't random. They do what you tell them to do. (Maybe programmers' mistakes are in some sense random? But the relationship between a change to a program and the resulting change to the program's behavior is in general very complicated. I can't imagine it being easy to model statistically.)
If all the courses are basically the same, all will work or none will. Maybe some use features of the software that others don't use? I think the testing should be done at a lower level of granularity than a course. Test individual features of the software. Focus on what has been recently changed.
What does "maintain the automated test" mean? It still does what it always did, presumably. Software doesn't wear out.
I guess you know all this already? Apologies if I sound condescending.
Who wrote the software for the courses? Who changes it? You?
Or are you using software someone else wrote, but you're still responsible to your client for making sure it works? Yuck. What are you supposed to do if it doesn't?
drkitten
4th May 2007, 09:41 AM
I have managed to convince the client that running a representative statistical sampling of the courses through a manual testing protocol will be more than sufficient for a reasonable quality assurance.
Now: if we round the number of courses off to an even 100, how many courses do I need to sample to obtain a reasonable representation and what would my confidence be for that sample size?
I'm afraid that I can't answer your question as you posed it, since you're the one who needs to decide what "reasonable" is. I suspect that what you want isn't possible. But I can walk you through the procedure to get the answer.
Assuming I can get the notation to do what I want.
Let me start out by introducing you to the Z (standard normal) distribution. Upchurch, Z distribution. Z distribution, Upchurch. You'll find it at the back of any stats book. It is your oracle for magic numbers; in particular, if you want 95% confidence, then the magic number is 1.96. For example, if at the end, I give you a result like 98.8% +/- 3.0% of the courses work properly, this means that I am "95% confident" that the true value lies between 95.6% and 101.6% -- and I used the number 1.96 somewhere in my calculation to get it.
Now, what we're really going to do is try to estimate the proportion of the population of courses with a given attribute (in this case, "working properly"). So let's say there are N courses, of which X work and Y don't. We will sample n courses, find that x work and y don't. CAPITAL letters are the true values, lower case are our samples and the estimates we get from them. So the true proportion P is X/N, and our estimate p is x/n. Similarly, the true proportion of failures Q is Y/N and our estimate q is y/n. So far so good?
Our confidence interval e can be calculated as 1.96 * sqrt( p * q / n). In particular, note the use of 1.96 there. If we were happy with only 90% confidence, we could get a smaller interval by pulling the magic number 1.64 from the table. If we wanted 99% confidence, we'd need to use 2.575.
Now, we take that equation and solve for n in terms of e.
If we have an idea of what p and q should be, then we can use n = (1.96)^2 * p * q / e^2
If we have no idea of what p and q are, we can just assume p = q = 0.50 -- maximally uninformative -- and we get
n = (1.96)^2 * 0.25 / e^2
Here's where you need to tell me what a "reasonable" margin of error is. If you tell me that you're happy with 5 classes (out of a hundred) not working, this means a margin of error of +/- 0.05 Setting e to be 0.05 means you would need to sample about 385 classes to get that kind of certainty, which of course is more classes than you have.
If you're happy with a margin of error of 25%, then you'd need to check about classes -- but that would mean that you're only really confident that 75% of the classes work. For a margin of error
of 10%, you'd need to check 96 classes.
Basically, I'm not sure you can do it. You don't have enough data.
Now, if I make the a-priori assumption that it will work with about 90% of the classes (based on the fact that it's worked so far), then I can do a little better. I'd still need to look at 139 classes for a 5% margin of error, but a 10% margin of error would only need 35 classes. So by looking at 35 classes, I could say that I was 95% confident that at least 90% of the classes worked properly.
Upchurch
4th May 2007, 09:45 AM
(as I was writing this post, I noticed drkitten posted. I will continue what I was going to say and then read his.)
Found this calculator (http://www.surveysystem.com/sscalc.htm).
I'm using the second one to massage the numbers:
Confidence Level: 95%
Sample Size: 10
Population: 100
Percentage: 99
Confidence Interval: 5.88
The way I read this, if I verify that 10 courses are working 99%, then I am 95% confident that they will all work with a 6 course margin of error.
Am I way off?
69dodge
4th May 2007, 09:48 AM
He didn't notice me... :(
Cuddles
4th May 2007, 09:51 AM
I think I'm with 69dodge on this, it isn't just a simple probability question. If you change something that all the courses use then they will either all break or all keep working. Likewise for a subset of courses that use a common feature that is changed. You don't want to sample randomly, you want to take a representative sample. For example, if 20 courses use one feature and only 1 course uses another then you need only test 2 courses, one from the group of 20 and the 1 on its own. A random sample is likely to miss out the smaller group entirely.
drkitten
4th May 2007, 09:51 AM
The way I read this, if I verify that 10 courses are working 99%
What does it mean for a course to be "working 99%"?
I thought a course either worked or it didn't.
Upchurch
4th May 2007, 09:52 AM
He didn't notice me... :(
Sorry, drkitten was at the top of the reply list. I didn't scroll down far enough. :o
69dodge
4th May 2007, 09:59 AM
I'm just kidding around.
Upchurch
4th May 2007, 10:00 AM
I don't think software works like that. Computers aren't random. They do what you tell them to do. (Maybe programmers' mistakes are in some sense random? But the relationship between a change to a program and the resulting change to the program's behavior is in general very complicated. I can't imagine it being easy to model statistically.)
The key here is that I get the client to believe that it works that way. I don't want to get caught on sloppy math, but I'm prepared to risk it if there is no other way.
What does "maintain the automated test" mean? It still does what it always did, presumably. Software doesn't wear out.The automated test needs to be modified to account for changes in the courseware. It's just complicated and time intensive enough that I don't want go through that learning curve to do it myself.
Who wrote the software for the courses? Who changes it? You?
I write the courseware, other people write the content, still other people maintain (and modify) the server on which it is run. Plus it is run off the Adobe Flash Player which is a sometimes moving target and then people try to run it through terminal services, which is yet another two cooks in the kitchen. Although it is rare these days, occasionally we get unexpected results.
This whole thing is more political than practical.
Upchurch
4th May 2007, 10:01 AM
What does it mean for a course to be "working 99%"?
I thought a course either worked or it didn't.
It wouldn't let me enter 100%
eta: although, it could be working 99% if had a bug or something, but that is neither here nor there. Let me rephrase:
if I verify that 10 courses are working 100%, then I am 95% confident that they will all work 100% with a 6 course margin of error.
Baron Samedi
4th May 2007, 10:04 AM
(as I was writing this post, I noticed drkitten posted. I will continue what I was going to say and then read his.)
Found this calculator (http://www.surveysystem.com/sscalc.htm).
I'm using the second one to massage the numbers:
Confidence Level: 95%
Sample Size: 10
Population: 100
Percentage: 99
Confidence Interval: 5.88
The way I read this, if I verify that 10 courses are working 99%, then I am 95% confident that they will all work with a 6 course margin of error.
Am I way off?
Upchurch,
I just checked the calculator. If I read this right, the number under "percentage" should be the percent either "pass" or "fail", based upon your sample size. So for 10 classes checked, 99 isn't a valid estimate. 90 may be, or 50, but 99 isn't.
However, you're trying to find an estimate that the probability of pass is 100, correct? If you type in 100%, you cannot do the calculation. May I suggest being slightly conservative, and jury-rigging the calculation so you have at least one bad for the percentage, or at least half a bad? Testing for 100% or 0% becomes a very huge problem. It's easy if you have one "fail", but not 10 or 20 or 40 "passes". The big question is, generally speaking, just how often do these type of errors occur?
drkitten
4th May 2007, 10:04 AM
It wouldn't let me enter 100%
I don't think you understand what that number really means....
Baron Samedi
4th May 2007, 10:06 AM
It wouldn't let me enter 100%
eta: although, it could be working 99% if had a bug or something, but that is neither here nor there. Let me rephrase:
if I verify that 10 courses are working 100%, then I am 95% confident that they will all work 100% with a 6 course margin of error.
One more question... when you say "all courses", how many is "all"?
Upchurch
4th May 2007, 10:09 AM
I don't think you understand what that number really means....
I think that is fairly safe to assume.
My goal here is to present the client with a statement along the lines of "We verified that 10 courses are fully functional. There is only an X% chance that one* of the other courses will show up with an unexpected bug"
eta: * or two or three or whatever
Upchurch
4th May 2007, 10:14 AM
The big question is, generally speaking, just how often do these type of errors occur?
When we deliver, we deliver with no known errors. And honestly, we haven't had any surprises in the last year and a half which constitutes maybe 10-12 deliveries. (which had more to do with making sure our development server correctly matched the hosting server environment than anything else.)
Upchurch
4th May 2007, 10:17 AM
One more question... when you say "all courses", how many is "all"?
I don't have an exact number of currently running courses off the top of my head, but it is around 100.
Baron Samedi
4th May 2007, 10:23 AM
When we deliver, we deliver with no known errors. And honestly, we haven't had any surprises in the last year and a half which constitutes maybe 10-12 deliveries. (which had more to do with making sure our development server correctly matched the hosting server environment than anything else.)
So it's a true 0% bad rate problem! It's wonderful for your reputation, but awful for us stats geeks. You perfectionists, always making life so hard for us. :)
I don't have an exact number of currently running courses off the top of my head, but it is around 100.
I've done this jury-rigging calculation a few times for BizNez people. Here's how I've played it.
I sample n people, and all are "good". For my lower 95% confidence interval, I need to calculate the true probability p such that all n observations to be "good" is (1-0.95). If a probability of "good" is 90%, then the probability of 10 "goods" is 0.9^10 = 35%. So to solve, always just use: lower estimate p = (0.05)^(1/n). So if you take 10 observations, and they're all good, then your 95% CI is between 74%->100% true good rate.
Worse case scenario, then, is that out of the remaining 90 courses, 26% may be bad. 23 may be a little high. How will they go? At most 2?
Upchurch
4th May 2007, 10:30 AM
I've done this jury-rigging calculation a few times for BizNez people. Here's how I've played it.
Now, you're talkin'!
Worse case scenario, then, is that out of the remaining 90 courses, 26% may be bad. 23 may be a little high. How will they go? At most 2?Okay, this is good, but I don't understand your question. At most 2 what?
Is there a way to translate this to a per course number? We're not saying 26 of the courses will be bad, only that they may be bad. Can we turn it around to something along the lines of "There is only a 26% chance that a course will be bad"?
Baron Samedi
4th May 2007, 10:37 AM
Now, you're talkin'!
Okay, this is good, but I don't understand your question. At most 2 what?
Is there a way to translate this to a per course number? We're not saying 26 of the courses will be bad, only that they may be bad. Can we turn it around to something along the lines of "There is only a 26% chance that a course will be bad"?
Sorry, I was thinking way too fast for my own mouth again. I get that way with numbers. :P When I asked "at most how many?", I meant how many "bads" would The BizNez tolerate? But forget I asked in the first place, since for The BizNez, any number that is greater than 0 will cause them to panic.
So I would think it would be safe to say to them, "Based on my 95% confidence interval, the worst case is a 26% chance that an individual course will be bad."
ETA: Just remember, we statisticians are really, really, completely, notoriously cynical and skeptical. Even if you came back with 25 passes in a row, I'd still claim that there's an 11% chance of failure. 40 passes just brings you down to a 7% chance of failure. That's why I've been banned from meetings with Marketing.
Upchurch
4th May 2007, 10:45 AM
But forget I asked in the first place, since for The BizNez, any number that is greater than 0 will cause them to panic.
Ain't that the truth? Business is truly a pseudoscience.
So I would think it would be safe to say to them, "Based on my 95% confidence interval, the worst case is a 26% chance that an individual course will be bad."
Exactly what I was looking for. Thank you and thanks to everyone who helped me muddle this out.
69dodge
4th May 2007, 10:56 AM
I write the courseware, other people write the content, still other people maintain (and modify) the server on which it is run. Plus it is run off the Adobe Flash Player which is a sometimes moving target and then people try to run it through terminal services, which is yet another two cooks in the kitchen.
You have my sympathy.
This whole thing is more political than practical.
Try telling them:Sir, $\frac{a + b^n}{n} = x$, hence the Software works—reply!
(cf. Euler on God (http://en.wikipedia.org/wiki/Leonhard_Euler#Philosophy_and_religious_beliefs).)
slyjoe
4th May 2007, 11:01 AM
...
So I would think it would be safe to say to them, "Based on my 95% confidence interval, the worst case is a 26% chance that an individual course will be bad."
...
Most engineers I work with don't know how to interpret that statement, let alone business people. Good luck though :)
Upchurch
4th May 2007, 11:12 AM
Most engineers I work with don't know how to interpret that statement, let alone business people. Good luck though :)
If you can't wow them with facts, dazzle them with BS.
mhaze
4th May 2007, 11:24 AM
I'll be happy to logon a couple of the courses and tell you if they work.
I'll pay the fees to take the course.
When I find one that does not work, I want a 1000x payoff on the course fee.
Where do I send a contract for this work?
Alternate solutions method: Just listen to complaints from customers. That's call forcing the customers to do your beta testing.....
Baron Samedi
4th May 2007, 11:35 AM
Most engineers I work with don't know how to interpret that statement, let alone business people. Good luck though :)
Exactly. But it makes them feel like it's a meaningful statement. It's not what you know, it's how you look while you claim to know.
ingoa
4th May 2007, 03:06 PM
I'll be happy to logon a couple of the courses and tell you if they work.
I'll pay the fees to take the course.
When I find one that does not work, I want a 1000x payoff on the course fee.
Where do I send a contract for this work?
Alternate solutions method: Just listen to complaints from customers. That's call forcing the customers to do your beta testing.....
Up to now I thought this is called MicroSoft-business model.
ingoa
4th May 2007, 03:13 PM
Upchurch,
I do not understand your problem. If you can test one course, you can test them all.
You will know exactly how many failed.
Why do you need a sample? It's only electrons running around in a CPU. Why not test all the courses? I am absolutely puzzled.
I did my share of automated software testing, but such a question (like yours) never crossed my mind. :confused:
T'ai Chi
4th May 2007, 04:11 PM
Here's my situation: I have a series of 100+ online courses that use a common courseware engine. In the past, when a change was made to the courseware, we used an automated testing program to verify that 100% of the courses still functioned properly. We recently lost our vendor for maintaining the automated test and have been thoroughly unsuccessful in replacing them.
I have managed to convince the client that running a representative statistical sampling of the courses through a manual testing protocol will be more than sufficient for a reasonable quality assurance.
Now: if we round the number of courses off to an even 100, how many courses do I need to sample to obtain a reasonable representation and what would my confidence be for that sample size?
So you're taking a course, and manually checking to see if it works properly. You're keeping track of how many out of 100 fail.
A ME is the +- part in an estimate. Here we are estimating a percentage, the percentage that work properly. Because we're estimating a porportion,
ME = 1.96*sqrt(p*(1-p))/n]
You won't know what p is until you're done sampling, but the worst case scenario for ME, is when p = .5 (because that is when the above function attains its maximum point)
ME = 1.96*sqrt[(.5*.5)/n]
ME ~ sqrt(1/n)
and substitute various n in there until you find a ME that you are satisfied with.
So any p you estimate will have a +- ME% attached to it with 95% confidence (because we used the 1.96 value above).
CapelDodger
4th May 2007, 05:20 PM
Upchurch,
I do not understand your problem. If you can test one course, you can test them all.
You will know exactly how many failed.
Why do you need a sample? It's only electrons running around in a CPU. Why not test all the courses? I am absolutely puzzled.
I did my share of automated software testing, but such a question (like yours) never crossed my mind. :confused:
I have a suspicion - it's no stronger than that - that what Upchurch refers to as "maintenance" includes what we would refer to as "operation". In other words, the guy who knew how to work it has gone AWOL and not left any notes.
Baron Samedi
4th May 2007, 05:22 PM
So you're taking a course, and manually checking to see if it works properly. You're keeping track of how many out of 100 fail.
A ME is the +- part in an estimate. Here we are estimating a percentage, the percentage that work properly. Because we're estimating a porportion,
ME = 1.96*sqrt(p*(1-p))/n]
You won't know what p is until you're done sampling, but the worst case scenario for ME, is when p = .5 (because that is when the above function attains its maximum point)
ME = 1.96*sqrt[(.5*.5)/n]
ME ~ sqrt(1/n)
and substitute various n in there until you find a ME that you are satisfied with.
So any p you estimate will have a +- ME% attached to it with 95% confidence (because we used the 1.96 value above).
I've always been bothered by this calculation when the estimated p is very extreme. The +- 1.96*sqrt[(.5*.5)/n] assumes a normal two sided distribution. However, if you do the calculation, and your CI ends up being p(pass) = 93% +- 12%, it's clearly not approximately normal in nature. In this case, it's a very special case, since Upchurch's p(pass) is 100%. (Although I have known a software developer who did indeed say, "Yes, there are bugs in my code, but we're promoting into production anyway.")
But I digress. That was a statistical and theoretical argument. For quickie general BizNez calculations, p +- 1.96*sqrt[(.5*.5)/n] is a great back of the envelope number. :)
T'ai Chi
4th May 2007, 05:56 PM
I've always been bothered by this calculation when the estimated p is very extreme. The +- 1.96*sqrt[(.5*.5)/n] assumes a normal two sided distribution. However, if you do the calculation, and your CI ends up being p(pass) = 93% +- 12%, it's clearly not approximately normal in nature. In this case, it's a very special case, since Upchurch's p(pass) is 100%. (Although I have known a software developer who did indeed say, "Yes, there are bugs in my code, but we're promoting into production anyway.")
In that case there are some options. One can go with the CI of [81, 100]. One can also use a Wilson estimator.
That is, instead of using p = x/n, use p = (x+2)/(n+4).
This moves p away from extreme cases of p=0 or p=100.
And then confidence intervals would be calculated as
p +- 1.96*sqrt[p(1-p) / (n+4)]
CapelDodger
4th May 2007, 05:59 PM
(Although I have known a software developer who did indeed say, "Yes, there are bugs in my code, but we're promoting into production anyway.")
You find this worth remarking on why?
"Is it good?"
"It's good."
"How good?"
"You won't get fired."
"Good enough, let's hit the streets."
(We sometimes lied about the not getting fired; it was usually personal, not mindlessly vicious.)
"Does it tip over if you swerve suddenly?"
"Best not go there."
"OK, we'll downplay stability."
"Don't. Go. There."
"Forget I asked."
Yeah, right, like we're gonna get sued instead. We know where emails go when they die.
JoeTheJuggler
4th May 2007, 06:35 PM
Are these courses being used? Why not just give users the ability to report any malfunction (or even a simple "everything's OK" when they get to the end)? Then your client will know what he really wants to know, and if there are problems, you'll have some idea of where to track them down.
Or is there a way the course can fail and the user not be aware of it?
Baron Samedi
4th May 2007, 06:45 PM
You find this worth remarking on why?
"Is it good?"
"It's good."
"How good?"
"You won't get fired."
"Good enough, let's hit the streets."
(We sometimes lied about the not getting fired; it was usually personal, not mindlessly vicious.)
"Does it tip over if you swerve suddenly?"
"Best not go there."
"OK, we'll downplay stability."
"Don't. Go. There."
"Forget I asked."
Yeah, right, like we're gonna get sued instead. We know where emails go when they die.
Heh. But in that case, you blame the user for pushing the software past its designed specifications, no? :D
Baron Samedi
4th May 2007, 06:55 PM
In that case there are some options. One can go with the CI of [81, 100]. One can also use a Wilson estimator.
That is, instead of using p = x/n, use p = (x+2)/(n+4).
This moves p away from extreme cases of p=0 or p=100.
And then confidence intervals would be calculated as
p +- 1.96*sqrt[p(1-p) / (n+4)]
Oooh, that's awesome. I've just been cheating and using p=(x+-0.5)/n as my estimate and just claiming to be adjusting due to continuity correction. I'll have to try Wilson instead.
And for the CI, that's my point. Normal means symmetrical. Even by truncating to [81, 100], you've violated the assumption of normality, or even a 95% CI. If p(x<81%)=0.025, then this is a 97.5% CI. So really, your CI could be tighter and you can wow the pants out of people even more so. :)
autumn1971
4th May 2007, 11:57 PM
If you can't wow them with facts, dazzle them with BS.
I have nothing to add to the meat of the OP, but a quote from one of Robert Aspirin's "Myth" series is "If you can't dazzle them with dexterity, baffle them with bull****", attributed to Howard Hill, of "The Music Man"
© 2001-2008, James Randi Educational Foundation. All Rights Reserved.
vBulletin® v3.7.3, Copyright ©2000-2008, Jelsoft Enterprises Ltd.