PDA

View Full Version : Statistical significance


Gr8wight
13th December 2006, 08:23 AM
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).

drkitten
13th December 2006, 08:47 AM
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).

Will you settle for me just trying to explain? It's easier than finding a well-written web site.

Every experiment is really a comparison between two hypotheses -- an "experimental" hypothesis and a "null" hypothesis. For example, my experimental hypothesis might be that I can predict coin flips with better than 50% accuracy, while my "null" hypothesis would then be that I can't.

So if I flip a coin eight times, the null hypothesis would say that the most likely number of correct predictions would be four. (which has about a 22% chance of happening). [ETA: make that 27% chance, 'cause I can't read statistical tables. "A mind is a terrible thing to waste." Illiteracy bites. Don't be a fool, stay in school. There once was a lass from Nantucket....)

However, the actual number of correct predictions that I get might be anywhere in the range. I might get all eight right by chance (with a probability of about 0.4%), I might get all eight wrong (with the same chance), or I might get seven right (with a probability of about 3%) or six right (probability about 10%, if I did the math right). So if you graph the number right on the x-axis and the probability of getting exactly that many right on the y-axis, you see a standard bell curve.

Now, remember the two hypotheses? There are correspondingly two types of errors. A type I error is where you believe the experimental hypothesis is true when it isn't (incorrectly reject the null hyptothesis) A type II error is where you incorrectly accept the null hypothesis. We can't say anything about the probability that the experimental hypothesis is correct -- but we can say something about the probabiliy that you would have gotten at least as good results as you did if the null hypothesis were true.

If the null hypothesis were true, the chance of my getting eight of eight is less than 1%, which is pretty unlikely -- therefore, my chance of a type I error is less than 1%. If the null hypothesis were true, my chance of getting at least six out of eight is about 15%.

"Statistical significance" represents the arbitrary cutoff where we consider the chance of a type I error to be acceptably small. Usually that's set at a 5% level, but it can sometimes be set higher or lower depending upon the circumstances.

andyandy
13th December 2006, 08:50 AM
edit.

dr kitten's was better.

drkitten
13th December 2006, 08:58 AM
a basic intro....

http://www.surveysystem.com/signif.htm

A basic and utterly misleading intro.


Significance levels show you how likely a result is due to chance.

No, they don't.


The most common level, used to mean something is good enough to be believed, is .95. This means that the finding has a 95% chance of being true.

No, it doesn't.


No statistical package will show you "95%" or ".95" to indicate this level. Instead it will show you ".05," meaning that the finding has a five percent (.05) chance of not being true,

No, it doesn't. It means that the finding has a five percent chance of being observed, if the experimental hypothesis happens to be false.

How the hell can a "finding" be true or false in the first place? It's the number that came out of my observation equipment. If I flipped eight coins and got five right, are you suggesting that I'm lying and I didn't actually get five right?

joobz
13th December 2006, 09:04 AM
So if I flip a coin eight times, the null hypothesis would say that the most likely number of correct predictions would be four. (which has about a 22% chance of happening).
Isn't this a 27% chance in happening?
getting either 5 or 3 right would be 22%, right?

andyandy
13th December 2006, 09:04 AM
A basic and utterly misleading intro.



No, they don't.



No, it doesn't.



No, it doesn't. It means that the finding has a five percent chance of being observed, if the experimental hypothesis happens to be false.

How the hell can a "finding" be true or false in the first place? It's the number that came out of my observation equipment. If I flipped eight coins and got five right, are you suggesting that I'm lying and I didn't actually get five right?

oh you've replied to it - i just edited it out :)

Mercutio
13th December 2006, 09:07 AM
Very nice, drk.

Would you care to say a few words about the logic behind the assumption of the null? I would (and will, if you don't), but dang, you got a way with words.

(...and such an explanation would be very nice to be able to point to when people say things like "well, why can't we just assume X until proven otherwise?")

drkitten
13th December 2006, 09:13 AM
Isn't this a 27% chance in happening?
getting either 5 or 3 right would be 22%, right?

Did I read from the wrong line of the table? Wouldn't be the first time.

drkitten
13th December 2006, 09:22 AM
Would you care to say a few words about the logic behind the assumption of the null? I would (and will, if you don't), but dang, you got a way with words.

(...and such an explanation would be very nice to be able to point to when people say things like "well, why can't we just assume X until proven otherwise?")

Well, thank you. I don't know if my "few words" will address what you consider the important issues, but you're welcome to explain the areas I miss here.

The basic problem is that of proving a negative. I can't prove that leprechauns don't exist until I've looked everywhere in the universe, including in orbit around distance stars, for them. Similarly, I can't even say that they don't exist in my garden -- since they might just be invisible, or too small for me to see, or something like that. Leprechauns might exist but not be detectable by the instrument I'm using.

So say we're testing a new drug. There are two possbilities -- either it does something, or it doesn't. I can't prove that it doesn't do anything, since it might do something that I'm not equipped to detect . But I can I have a pretty good idea of what "not doing something" would look like.

So the idea is that I set up two hypotheses. My "experimental" hypothesis is that it does something, and my "null" hypothesis is that it doesn't. (it more or less has to be set up this way, because of this detection issue.) From a standpoint of drug performance -- and my future career as a research pharmacologist -- the best I can hope for is clear and convincing evidence that people treated with this drug differ from people left untreated. That will be strong evidence against the null hypothesis, and therefore for my experimental hypothesis. I can "reject the null hypothesis."

But let's say I'm unlucky and I don't notice any difference? Does that mean that my drug doesn't do anything? Of course not. it just means that I didn't notice any difference. So I don't have any evidence for my experimental hypothesis, and I have "failed to reject the null hypothesis." But I haven't proven it. Someone else might come along with more sensitive detectors or more people and find that my drug actually works.....

You can see this in the coin flips example. If all I claim is to be "psychic," that just means I can predict with higher than 50% accuracy. But what counts as "higher"? 100%? Obviously. Eight out of eight is definitely significant. 75%? Six out of eight isn't really "signficant," but it's noticeable. 55% ? That's probably not going to show up as significant on a test as small as eight flips. 50.0001%? That will get lost in the noise on any reasonable-sized experiment. So I might "fail" the test of "psychic," not because I'm not psychic, but because I'm not strongly enough psychic for anyone to care.

Ben Tilly
13th December 2006, 02:11 PM
Very nice, drk.

Would you care to say a few words about the logic behind the assumption of the null? I would (and will, if you don't), but dang, you got a way with words.

(...and such an explanation would be very nice to be able to point to when people say things like "well, why can't we just assume X until proven otherwise?")

Well I'll say some things, but they won't be what you're expecting. ;)

The problem is that people always want to solve an impossible problem. So statisticians have come up with a problem they can solve which sounds enough like the impossible problem that we can't solve that people are happy. And then statisticians get annoyed when people mistake the one for the other.

Allow me to expand.

The problem that people want to answer is, "What is the probability that this is true?" Well anyone who is familiar with Bayes' Theorem can tell you that this doesn't have a well-defined answer - it depends on how likely you thought it was before you did your experiment. But people don't like that answer much.

So we say, "Well let's figure out the probability of getting a result this weird under the null hypothesis. If that is low enough, then we'll reject the null hypothesis." Now this question is something we can answer, it is well-defined, and people are happy to use the answer.

The problem is that people insist on mistaking the second question for the first. And that is where statisticians get annoyed. Because they aren't the same thing at all. Or worse yet, people want a concrete answer of the form, "I see that this is better than that, with 95% probability, how much better is it?" Which again is the impossible question. No matter how much you explain it to them, people want the simple answer, and want to state it in the simple way.

So to keep the statisticians happy, all we need to do is just get it straight and keep it straight? Well, that depends on the statistician. You see, there is a debate among statisticians about whether or not the standard procedure makes much sense at all. Bayesians like to bring up cases like the following one.

Suppose we know that a couple planned to have children until they had both a son and a daughter. They have 7 sons in a row, then a daughter. At a 95% confidence level, should we reject the hypothesis that they are equally likely to have sons or daughters? (*) Well the null hypothesis is equal probabilities, under which a result this strange or stranger requires a string of 7 boys or 7 girls in the first 7 kids, which will happen 1 time in 2^6, or 1.5625% of the time. So at the 95% confidence level (even at a 98% confidence level) we'd reject the null hypothesis.

Now let's change the problem. Suppose they just were going to have 8 kids. What then? Well the odds of a result that odd are the odds of 7 boys and 1 girl (happens 8 ways), or 7 girls and 1 boy (happens 8 ways) or 8 boys (1 way) or 8 girls (1 way). So there are 18/2^8 ways in which we could get a result this odd, which has probability 7.03125%. So at the 95% confidence level we should not reject the hypothesis.

But according to Bayes' theorem, no matter what prior probabilities you assign, your posterior probabilities will not depend on the knowledge that they were going for 8 kids or both a boy and a girl. In any valid system of inference that piece of data is a red herring that should not make sense. Therefore standard statistical methods lead to nonsensical results.

In the real world the Bayesians lose for two reasons. First, everyone is used to the standard solution. And second, Bayesian alternatives to the standard methods are far more complex to understand and explain.

Cheers,
Ben

* In fact this hypothesis is generally wrong. Population statistics demonstrate that there is a small but significant bias towards having sons rather than daughters.

bpesta22
13th December 2006, 07:04 PM
Let me try too!

Suppose you claim you know the difference between pepsi and coke, just by taste.

I'm skeptical, so we set up an experiment to see if you can do it.

But what does "know the difference between" really mean?

For example, would you have to have 100% accuracy for me to believe that you know the difference?

Likely not.

What about 75% accuracy?

That probably still would support your claim (you can tell the difference with better than chance accuracy).

What about 55% accuracy-- less impressive, but still might be better than chance guessing.

So, what hypothesis should I test: That you know the difference with 100% or 75% or 55% accuracy?

We dunno, so we don't actually test any of these "experimental hypotheses" precisely because we likely don't know what their true value is (and if we did know the true value, well, then we wouldn't need to do the experiment).

Instead, we always test the null hypothesis, which is typically the exact opposite of what we're interested in testing.

The null here would be: You can't tell pepsi from coke.

This is the default; we're going to assume it's true unless we have evidence suggesting it's not.

Why bother testing the null? Because we know exactly what your performance should be if the null were true. If you DON'T know the difference between pepsi and coke, then you should perform at 50% accuracy in any taste test.

Even if your taste buds were dead, if I gave you an unlabeled glass and had you pick pepsi or coke, you would be right 50% of the time by chance alone.

What were looking for is statistical evidence that lets us reject the null (thereby accepting the alternate, that you do know the difference between pepsi and coke).

So, we expect to get 50% accuracy, given the null. We do the taste test and record your actual accuracy rate (suppose it is 67%).

The question then becomes: Is 67% actual performance different enough from 50% expected performance for us to reject the null (i.e., and conclude that you can indeed tell the difference).

To answer the question, we have to draw a line in the sand. This is called our alpha value, and the convention in science is to set it at .05.

Assuming the null is true, the probability between expecting 50% accuracy and getting 67% accuracy has to be less than .05 for us to reject the null (your performance has to be so improbably better than 50% accuracy that the only reasonable conclusion is the null is false here and you indeed can tell the difference between pepsi and coke).

In this simple scenario, whether we reject the null or not depends on how many trials we gave you.

Assume it was just 3 taste trials-- you got 2 right and one wrong.

The probability of getting 2 right just by guessing (as would be the case if the null were indeed true) is 3/8 or about .24

Since .24 (observed) is greater than .05 (our alpha value) we cannot reject the null. We don't have enough statistical evidence to rule out chance guessing and so we haven't proved that you can tell the difference.

If you achieved 67% accuracy over 50 trials, however, the actual p value would be some number much lower than .05. So, 67% accuracy here would lead us to reject the null because it was based on N=50 here (whereas the exact same accuracy with N=3 lead us to not reject the null).

So, rejecting the null is reaching the conclusion that: the difference between actual and expected values is too big to be due to chance (i.e., the null being true) and is therefore hopefully due to the experiemental manipulation (to the extent the experiment possesses internal validity).

Note that in the N=3 trials case, we didn't really offer a fair test. We need more trials than 3 to fairly test you. Not rejecting the null here was likely due to our lack of statistical power.

If you achieved 50% accuracy over 50 trials, then our not rejecting the null here would be more convincing (the test was fair and reasonably powerful, yet you failed).

Mercutio
13th December 2006, 08:28 PM
Let me try too!
[snip]
So, rejecting the null is reaching the conclusion that: the difference between actual and expected values is too big to be due to chance (i.e., the null being true) and is therefore hopefully due to the experiemental manipulation (to the extent the experiment possesses internal validity).


Lemme add a bit more, and put my own neck on the chopping block for the other statisticians to take a whack at.

With the null and alternative hypotheses, we have the option "***** happens" (which is the null hypothesis--nothing but random) and the option "something happened in addition to *****" (the alternative hypothesis). Sadly, when we reject the null (which is either a hit, or type I error, and we can never know which), all we are left with is "something in addition..." We don't know, to borrow Pest's example, whether our subject has 55%, 75%, 62.377639103% psychic abilities, or what. And Pest's parenthetical warning is key--"something in addition..." can mean cheating instead of psychic ability.

ETA--how strange that we get 5 asterisks for a 4-letter word...

joobz
13th December 2006, 09:42 PM
With all of that said, it simply means you can't substitute logic for statistics. If you don't understand at least the foundations of what you are studing, no amount of statistical treatments will help you.

If I had one suggestion though. Anyone whose going to plan a larget set of studies. talk to a statistician. They really can save you some time and effort. experimental design is a very important step, but you'd be suprised how often it's not performed.

Gr8wight
13th December 2006, 10:35 PM
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:
On a 0-to-10 scale, children in the guided-imagery group had an average post-pain intervention score of 4.3, a point lower than children in the control group. While the difference was not statistically significant, Schmidt believes it is "clinically" significant.

"If it works for you, and it reduced your pain by one point or two points, isn't it worth it?" she asked.
I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.

JoeTheJuggler
13th December 2006, 11:29 PM
I think this stuff finally clicked for me visually.

You've got a number from your experiment, but you don't know if it is from the distribution curve described by the null hypothesis or not. Generally, if it falls way out in the skinny little little tail (one tail or both, depending on the experimental model), we say it probably was not from that distribution.

We suppose then that the measurement is from another distribution curve that is described by the effect we "wanted" (predicted by the hypothesis).

If it falls in the fat part (of the distribution predicted by the null hypothesis)--well, then we can't say that the measurement probably doesn't belong to the null hypothesis distribution curve.

Really--it's MUCH clearer with pictures!

Dustin Kesselberg
13th December 2006, 11:33 PM
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.


Can anyone refreash my memory?

69dodge
14th December 2006, 06:38 AM
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:On a 0-to-10 scale, children in the guided-imagery group had an average post-pain intervention score of 4.3, a point lower than children in the control group. While the difference was not statistically significant, Schmidt believes it is "clinically" significant.

"If it works for you, and it reduced your pain by one point or two points, isn't it worth it?" she asked.

I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.Yes, of course, if the guided imagery will reduce my pain by one or two points, that could certainly be worth it, depending on how much pain one or two points is. (Calling the reduction in pain "clinically significant" means that it's quite a noticeable amount.)

But if the difference in the experiment wasn't statistically significant, that means that the results of the experiment don't give me very much reason to believe that guided imagery will in fact reduce my pain. Even if it were totally ineffective in reducing pain, some children would presumably end up with somewhat less pain than others, due to unknown factors unrelated to the treatment: in this experiment, it turned out to be the ones in the guided-imagery group; in the next, it might turn out to be the ones in the control group.

While it's true that statistical significance isn't very important without clinical significance, clinical significance is meaningless without statistical significance, because without statistical significance, there's not much reason to believe that the clinical significance will continue to be present in the future.

Jeff Corey
14th December 2006, 07:02 AM
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:

I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.

It would help if you could provide more details or a link to the article. Were the subjects randomly assigned to the treatment vs. no treatment groups? Was the pain rating on the typical 0 to 10 scale? How many subjects?
But, in any case, no competent scientist would ever state that statistically insignificant results, even at the less rigorous level of .05, were clinically significant.
Quite often, the opposite is true. With a large number of subjects, statistical significance can be obtained with trivial effects.

Jeff Corey
14th December 2006, 07:08 AM
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.


Can anyone refreash my memory?

Just because there are two possibilities doesn't make them equally likely.
"If I am walking out to my car, the chances of me being struck by lightning are 50% because either I get struck by lightning or I don't."

Beth
14th December 2006, 07:09 AM
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.



Can anyone refreash my memory?

Just because there are only two possibilities doesn't imply that the two possibilities are equally likely i.e. 50/50.

Probabilities are only additive when the events are mutually exclusive. If they are not, then you have to subtract the probability of the intersection of the two events.

eta: I see Jeff beat me to the response.

Beth
14th December 2006, 07:12 AM
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:

I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.


One thing to keep in mind is the sample size. If the sample size is small, the effect may well be real even though it isn't statistically significant.

Gr8wight
14th December 2006, 07:17 AM
It would help if you could provide more details or a link to the article. Were the subjects randomly assigned to the treatment vs. no treatment groups? Was the pain rating on the typical 0 to 10 scale? How many subjects?
But, in any case, no competent scientist would ever state that statistically insignificant results, even at the less rigorous level of .05, were clinically significant.
Quite often, the opposite is true. With a large number of subjects, statistical significance can be obtained with trivial effects.

Hi Jeff,

I didn't link to the article because that quote I supplied is all they say about the actual study that was allegedly undertaken. Pardon me, other than to say that the study group was a measly 17 children.

Here is the link: http://www.livescience.com/healthday/535689.html

Jekyll
14th December 2006, 07:18 AM
If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50.

This bit is wrong. According to this argument there is a 50;50 chance of throwing a 6 with a normal dice because either it will happen or it wont.

The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

You multiply independent probabilities rather than adding them so if the chance of not having a crash as you drove on the road the shop and your house was 1/2 the chance of not having a crash on the way there or back is 1/4.
So you'd have a 75% chance of crashing on the way there or back if the 50% figure was correct.

Entertainingly, you misstated the conclusion which probably should have been "there is a 100% chance that the car will crash on the trip." Your conclusion that afterwards you will "either have a wreck or not" is actually correct.

Edit: Abridged version: I agree with Beth and Jeff.

Rodney
14th December 2006, 07:19 AM
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.


Can anyone refreash my memory?
What some people reason is that, after you are back from the store (assuming you ever get back ;)), pre-trip probabilities are meaningless. Either you had a wreck or you did not. However, you could logically conclude that the pre-trip probability of a wreck on the way to and from the store is 50% only if an analysis that factors in all known variables (such as prior number of wrecks on the way to and from the store, traffic conditions, weather conditions, etc.) indicates that there is an equal chance of having a wreck or not having a wreck. If it does, you should consider walking to the store or staying home . . .

69dodge
14th December 2006, 07:22 AM
One thing to keep in mind is the sample size. If the sample size is small, the effect may well be real even though it isn't statistically significant.Sure. But the experiment in question doesn't provide much evidence in favor of its reality. Nor, of course, against its reality. A small experiment just isn't too informative either way, and we're left more or less where we started.

fls
14th December 2006, 07:36 AM
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?

If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.

Can anyone refreash my memory?

You can use "either you will get into a wreck or you won't", but you actually have to figure out the frequency with which you get into wrecks and the frequency with which you won't.

If you want to combine the chance from the trips to and from the store, you are looking at "the chance of having a wreck on the way to the store and not on the way back" or "the chance of having a wreck on the way back from the store and not on the way there" or "both on the way there and on the way back". The "or" is a good indication that the probabilities are additive. So you need to figure out the probability of each scenario and add them up.

Let's look at the chance of having a wreck on the way to the store and not on the way back. The use of "and" is a good indication that the probabilities are multiplicative. So you have the chance of having a wreck (50% using your example) times the chance of not having a wreck (50%), which comes out to 25%. The chance for each of the other two scenarios is also 25%. Adding it all together, you have a 75% chance that you will have at least one wreck on the way to and from the store (or a 50% chance of only one wreck).

There is an easier way to solve this particular problem (1-(the chance of not having a wreck on the way to the store and not having a wreck on the way back), but I did it this way to illustrate the difference between when you add probabilities and when you multiply probabilities.

Linda

lenny
26th December 2006, 05:29 PM
Every experiment is really a comparison between two hypotheses -- an "experimental" hypothesis and a "null" hypothesis.

really? every experiment? how does that work if am looking for a number, say if i am measuring the speed of light (before it was set equal to 1, of course)?

or if i am a Bayesian estimating how far my car will go on a tank of gas, and end up with a postierior distribution?

that is not to say that there aren't many cases where the dual hyposthesis structure is extremely useful! but the original post also asked:

...what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).

lenny
3rd January 2007, 05:08 PM
You can use "either you will get into a wreck or you won't", but you actually have to figure out the frequency with which you get into wrecks and the frequency with which you won't.

Linda
so how do i do that on My first trip to the store?

fls
3rd January 2007, 05:15 PM
so how do i do that on My first trip to the store?

Guess.

Linda

zooloo
3rd January 2007, 05:41 PM
Assume it was just 3 taste trials-- you got 2 right and one wrong.

The probability of getting 2 right just by guessing (as would be the case if the null were indeed true) is 3/8 or about .24

How do you decide the figure 3/8 please?

fls
3rd January 2007, 05:57 PM
Quote:
Assume it was just 3 taste trials-- you got 2 right and one wrong.

The probability of getting 2 right just by guessing (as would be the case if the null were indeed true) is 3/8 or about .24
How do you decide the figure 3/8 please?

There are 2 possible answers for each taste test. Three taste tests gives you 8 possible outcomes (2x2x2). Three of those outcomes involve two correct guesses and one wrong guess.

Linda

lenny
3rd January 2007, 05:59 PM
Guess.

geez. what is the point of a discussion board when someone finds a good one word answer...

Elaedith
3rd January 2007, 06:17 PM
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:

I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.


The silliness in the quote is that the researcher is implying that everyone will get a reduction of one or two points from the method, even though the lack of statistical significance means that the probability of an effect of that size occurring when chance alone is operating is unacceptably high. Its almost as though the researcher thinks that a statistically significant effect is just a bigger effect than the one they got but that this doesn't matter as long as the effect is big enough to be useful.

Also, 'one or two points' is meaningless as a measure of effect size. I'm sure somebody will correct me, but I think that the difference between the means should be divided by the average standard deviation to get a meaningful measure of effect size.

Neither 'clinical' or statistical significance matter if the study wasn't properly conducted, and there isn't any information about what was done to the control group.

lenny
3rd January 2007, 06:28 PM
A small experiment just isn't too informative either way, and we're left more or less where we started.agreed, but is this not, arguably, a circular definition of "small".

a not-small experiment might only consist of one photographic plate, with a few stars in the "wrong" place, even in those rares cases where one might believe that "Every experiment is really a comparison between two hypotheses -- an "experimental" hypothesis and a "null" hypothesis"

Jarom
3rd January 2007, 10:07 PM
The probability of getting 2 right just by guessing (as would be the case if the null were indeed true) is 3/8 or about .24Sorry, kept expecting someone else to correct here, but I guess I will. 3/8 is actually 0.375, not about 0.24. This makes a bit of difference if you're eyeballing the numbers to make a decision.

Good explanation, though, bpesta.

zooloo
4th January 2007, 03:56 AM
Sorry, kept expecting someone else to correct here, but I guess I will. 3/8 is actually 0.375, not about 0.24. This makes a bit of difference if you're eyeballing the numbers to make a decision.

Good explanation, though, bpesta.

Thank you for answering my second question.

Also thank you fls for explaining the 3/8 to me.

What a lovely forum :)

Cheers

bpesta22
4th January 2007, 08:26 AM
Thank you for answering my second question.

Also thank you fls for explaining the 3/8 to me.

What a lovely forum :)

Cheers

Whoops, my bad on the 3/8 = .24 thing. It was definately a mistake, but I did leave a bit out of the explanation for teaching purposes. I think the probability you would use to evaluate the null here would actually be .50.

In other words, guessing 2 of 3 right would result in us concluding only 50% accuracy for evaluating the null.

Here's all possibilities for three taste trials (T= true, you got it right; F=False, you got it wrong):

ttt
ttf
tft
tff
ftt
ftf
fft
fff

Three of them have exactly 2 right and 1 wrong (which gives the 3/8 probability), but we need to actually calculate the probability of performing at "2 out of 3 correct OR better" to properly test the null.

So, for binomial tests, we also have to factor in not only the P of the subject's actual performance, but the sum of all Ps for performance even better than that.

Since there are 4 ways where the subject can perform at 2 out of 3 right or better, the observed probability would be .50 (4/8).

Since the .50 is greater than the .05 alpha level, we would not reject the null.

If you think of the bell curve, your performance needs to be at the tail end (to the right of whatever alpha level you set). To get where your performance is on the curve, you have to calculate not only the P of your actual performance, but the P values for all outcomes that are even rarer than this.

I left this out of the original because it doesn't help conceptually.

bpesta22
4th January 2007, 08:27 AM
note also that with only 3 trials and alpha =.05, you would never be able to reject the null as perfect performance here would have a probability of .125, which is > alpha.

Moral of the story: Add more trials.

bpesta22
4th January 2007, 08:35 AM
I'm on a manic role here, but to further complicate things, the above assumes a one-tailed test (we're testing only better than chance accuracy, not the possibility that the guy could be performing signficantly worse than chance).

If we were doing a two tailed test-- which makes little sense here as it's testing whether the guy is either better or worse than chance at detecting coke versus pepsi-- we would have a really strange result with only 3 trials.

2 out of 3 right or better and the opposite (to cover both ends of the tail) would have a 100% probability.

In other words, for the two tailed test and only 3 trials, one is guaranteed to get at least 2 right or better, or at least 2 wrong or worse!

drkitten
4th January 2007, 08:55 AM
If we were doing a two tailed test-- which makes little sense here as it's testing whether the guy is either better or worse than chance at detecting coke versus pepsi-- we would have a really strange result with only 3 trials.

Not really that strange, is it? Perhaps the guy can detect the difference, but can't label it properly. A two-tailed test captures and controls for that possibility. And any time you perform at exactly the midpoint -- or as close to the exact midpoint as the quantization of the data will permit -- you get a 100% result on a two-tailed test. If I flip a hundred coins, I'm guaranteeed to get either at least fifty heads or at least fifty tails.

DevilsAdvocate
5th January 2007, 02:40 AM
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).On public television they run some classroom-type shows. There was a great one (made in the 1970s, like the best of them) that had a guy that did a series and explained statistics more clearly than anything I have ever seen. Unfortunately, I can't find the series or any website that even comes close.

You asked about the “+/- range of accuracy”. This is a margin of error. You often hear about this in polls. Our poll for the U.S. presidential election shows Aaron has 60% of the vote and Barry has 40%, with a margin of error of +- 3%. What does that mean?

This means the poll resulted in 60% of the people saying they would vote for Aaron and 40% would vote for Barry. But how accurate is this poll? An how confident are we that the number are realistic?

We have to look at sample size. If the poll was based on just 10 people, then we don’t have much confidence in the results. If the poll was based on 10 million people, we would be very confident in the result. The larger the sample size, the larger the confidence that our poll numbers are accurate.

Of course even with a very large sample size, our poll isn’t going to be exact. Even if we poll 10 million people and the results are 60% for Aaron, this doesn’t mean that Aaron will get EXACTLY 60% of the vote.

We have to use a combination of margin or error a confidence level. We can say with 100% confidence that Aaron will get 60% of the vote with a margin of error of +- 60%. Or you could say that Aaron will get 60% of the vote with a margin of error of +- 0% at a 0% confidence level. Neither means much. It means you could be equally right or wrong.

You can’t be 100% confident that your projected numbers are correct. You have to allow for some margin of error. So the goal is to calculate a margin of error that has a reasonable level of confidence—like 95%. The lower your margin of error, the lower the confidence level.

If your sample size for a U.S. election was only 10 people, you probably couldn’t get a 95% confidence level without having a margin of error somewhere around +- 100%. Which means it would be meaningless. If your sample size were 10 million, you would have a low margin of error and a 95% confidence level would be no problem.

I wish I could explain this better. :(

lenny
6th January 2007, 06:28 PM
You asked about the “+/- range of accuracy”. This is a margin of error. You often hear about this in polls. Our poll for the U.S. presidential election shows Aaron has 60% of the vote and Barry has 40%, with a margin of error of +- 3%. What does that mean?

note it is not just the size of the sample, but whether or not those selected to be in the sample reflect the distribution of the population. you could have a huge sample of people who were all "odd in some way" and get a rather poor "forecast". this fact is often NOT included in the "margin of error" reported in news papers. (and such systematic errors are often hard to avoid a priori).

blutoski
6th January 2007, 08:12 PM
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).

The other posters have done a good job of describing statistical significance. I'd like to address your example above, and distinguish it from what they were talking about.

Statistical significance is different than errorbars. That's the +/- you were talking about. You get errorbars in two ways: instrument limitations, or sample error.

Instrument limitations make the most sense: your thermometer has lines every 1.0 degrees. So, you can only measure temperature in +/- 0.5 degrees. Instument error has special rules for when you add or manipulate data: it frequently grows larger when you combine data.

Sample error has to do with assumptions about the sampling and the population being sampled. Usually assumes a uniform sample, in a normalized distribution. Not always, though. Typically, it's ballpark root(N) / N. So, for example, (again, this is just ballpark) if you sample 100 people, your error could be about +/-5%.

This is why you hear reports of surveys that have two error disclaimers: "Fifty percent of people surveyed prefer Brand X, plus or minus five percent, nineteen times out of twenty."

The first statement is the sampling error; the second is a claim that this survey acheives statistical significance ("nineteen times out of twenty" = "95% confidence interval" = "p<=.05")

Art Vandelay
7th January 2007, 01:07 PM
A statistical test consists of the following steps:
1. Choose a parameter that you want to test.
2. Choose a null hypothesis regarding the parameter.
3. Choose some random variable that (presumanbly, one that has something to do with the parameter).
4. Choose a rejection region for the associated statistic.
5. Calculate the probability, under the null hypothesis, of the statistic falling in the rejection region.
6. Perform an experiment that creates an instantiation of the statistic.
7. Evaluate whether the resulting statistic falls in the rejection region; if so, declare the null to be rejected.

The statistical significance is the probability calculated in step 5. Notice that it is a statement about the experiment, not the data (and should be caclulated before you even know what the data is). The statistical significance has a quantitative value: a number between zero and one. The data, however, is purely binary: either it is in the rejection region, or it's not. Within the context of a statistical test, there is no such thing as data that is "very significant" or "low significance" or "almost statistically significant".

Now, on to confidence intervals. Sometimes, the value of a parameter is estimated with a statistic. Since statistical tests involve randomness, this isn't the exact value. So statisticians come up with an interval where it might be. They then can calculate the probability, given that the parameter is correct, of getting an interval that includes it. Notice that it is often misinterpreted as the probability, given an interval, of getting a parameter in that interval, when in fact it's the opposite.

Suppose we know that a couple planned to have children until they had both a son and a daughter. They have 7 sons in a row, then a daughter. At a 95% confidence level, should we reject the hypothesis that they are equally likely to have sons or daughters?"95% confidence level"? What does that mean? You're not asking a valid statistical question.

This is data mining, since the data comes before the calculation of alpha. It seems to me that your example is simply an example of misdirection. What the couple was planning to do has nothing to do with it; what matters is what statistic we use. Basically, what you're doing is deciding what statistic to use after the data has been collected, finding that the conclusion depends on which statistic is used, then declaring the results "nonsensical". To cover up your malfeasance, you bring in the red herring of what the couple was planning on doing, to make the choice of statistic seem nonarbitrary.

Here's your example made a bit more transparent. Suppose there's a class of 30 students, and I've labeled them from 1 to 30. If I tell you that students 1,4,5,8,10,11, and 12 are all boys, you would, according to your above logic, conclude that more than half the class is boys. If I tell you that students 2,3,6,7,9,13,14, and 17 are girls, then you would conclude that more than half is girls. And you are saying that there something nonsensical about this, because two different sets of data resulted in two different conclusions.

But according to Bayes' theorem, no matter what prior probabilities you assign, your posterior probabilities will not depend on the knowledge that they were going for 8 kids or both a boy and a girl. They depend no less in the Bayesian system then they do in the standard system.

Therefore standard statistical methods lead to nonsensical results.That is a complete non sequitur. You didn't present an example of standard statistical methods; you presented an example of ignoring basic statistical rules.

Merko
7th January 2007, 01:27 PM
really? every experiment? how does that work if am looking for a number, say if i am measuring the speed of light (before it was set equal to 1, of course)?


Well, let's call the currently accepted speed of light (in vacuum) c (eg if we find out it is wrong, we don't change c).

So we make an experiment, and come up with a measurement of 1.1 c for the speed of light. However, depending on how the experiment is set up, there might be alternative explanations for the measurement. Let's say our clock is not really accurate enough. In this case, we could set up a null hypothesis - the measurement is caused by an inaccuracy of the clock. At least in theory, we might even know the distribution of the clock error and we could assign a probability for the measurement occuring, given that the speed of light is actually c.

69dodge
7th January 2007, 03:07 PM
Therefore standard statistical methods lead to nonsensical results.

That is a complete non sequitur.It's not a non-sequitur if one wants the technical notion of a statistical significance test rejecting a null hypothesis to correspond to the intuitive notion of us having reason to believe that the hypothesis is false, and in particular, if one wants the level of significance of the rejecting test to correspond to the amount of evidence it provides against the truth of the hypothesis rejected.

Fisher certainly wanted this, even if Neyman and Pearson didn't. See chapter 4, "Some Misapprehensions about Tests of Significance," of his book Statistical Methods and Scientific Inference, where he rails against them about it.

Art Vandelay
7th January 2007, 04:38 PM
It's a non sequitur because it doesn't follow from the preceding. Ben Tilly didn't present an example of standard statistical methods.

As for what else you say,
"statistical significance test rejecting a null hypothesis to correspond to the intuitive notion of us having reason to believe that the hypothesis is false"
I guess that as long as P(reject Ho|Ho)<P(reject Ho|Ha), there is such a correspondence. Of course, P(reject Ho|Ha) only is definable if we have a particular Ha in mind.

"one wants the level of significance of the rejecting test to correspond to the amount of evidence it provides against the truth of the hypothesis rejected"
Well, there are clearly more factors than just alpha. I don't think that there is anything "nonsensical" about this failing to hold. Would it be "nonsensical" for one car to get worse gas mileage than another, even though it is lighter? Perhaps indicative of inefficiencies, but hardly "nonsensical".

a_unique_person
7th January 2007, 07:07 PM
One thing I have wondered. If 95% significance level is good enough, for example, does that mean that 1 in 20 tests actually be wrong.

Art Vandelay
7th January 2007, 07:50 PM
Yes, by definition, if the null hypothesis is true, and the significance is 5%, then there is a 5% chance of being wrong.

There was a study once that purported to show that prayer helps people heal, thand they had split it up into a bunch of subexperiments, testing different diseases to see whether prayer helps them. Well, if you test 20 diseases, you should expect one of them to be "helped" just by chance. 5% is often cited as the "standard" number, but it's rather weak. The idea of, for instance, having it as the significance level for the JREF challenge is rather ridiculous; if a hundred people applied, we'd expect 5 of them to walk away with a million dollars. Someone determined enough and rich enough can "prove" pretty much anything at 5%, by simply having a bunch of experiments and a bunch of different statistics. That's why when evaluating an experiment, you should look at whether the procedures, statistics, and rejection region are well documented prior to the beginning of the experiment, and whether the experimenter releases the results of all of the experiments, or just some.

Ben Tilly
7th January 2007, 08:51 PM
It's a non sequitur because it doesn't follow from the preceding. Ben Tilly didn't present an example of standard statistical methods.

Actually I did, but you may not have understood that.

Standard statistical methods say that you do the following:

Set up an experiment, get a result.
Produce a null hypothesis.
Figure out the odds of getting the result you got from the experiment, or anything less likely. That is your confidence level. (This is the step you likely did not recognize because the experiments that I set up did not follow a distribution that you're used to using hypothesis testing on. However take the description to a statistician and they'll confirm that I followed the appropriate method.)
Make some decision based on the confidence level of your experiment. So let's set up two different experiments. In experiment A, a couple decides to have children until they have both a son and a daughter. They have 7 sons in a row, and then one daughter. In experiment B, a couple decides to have 8 children. The first 7 are sons and the last is a daughter.

The null hypothesis in both cases is that sons and daughters are equally likely. However the different design of the experiments means that you calculate different probabilities. (They are different because in experiment A getting a daughter on the second try ends the experiment, while in experiment B getting a daughter on the second try and having the other 7 be sons is as unlikely as the observed outcome. So there are more combinations that are as unlikely as what was observed in experiment B than experiment A.) Therefore you make different choices under hypothesis testing.

This result is problematic because Bayes' Theorem shows that no reasonable method of drawing inferences would give a different conclusion from experiment A than experiment B. Hypothesis testing does, therefore it is an unreasonable method of drawing inferences.

Cheers,
Ben

Art Vandelay
8th January 2007, 12:04 AM
Standard statistical methods say that you do the following:

Set up an experiment, get a result.
Produce a null hypothesis.
Figure out the odds of getting the result you got from the experiment, or anything less likely. That is your confidence level. (This is the step you likely did not recognize because the experiments that I set up did not follow a distribution that you're used to using hypothesis testing on. However take the description to a statistician and they'll confirm that I followed the appropriate method.)
Make some decision based on the confidence level of your experiment. That's not the standard statistical method, as I said in my post. I consider myself a statistician, and I say that number three is wrong. And, at the risk of sounding conceited, I would consider anyone who disagrees to not be a statistician. "the odds of getting the result you got from the experiment, or anything less likely" is not a meaningful phrase. In the example that you gave, every result is equally likely. Seven boys, then a girl, is just as likely as three boys, then two girls, then two more boys.

Proper statistical method requires that you decide on a rejection region before any data is collected.

So let's set up two different experiments. In experiment A, a couple decides to have children until they have both a son and a daughter. They have 7 sons in a row, and then one daughter. In experiment B, a couple decides to have 8 children. The first 7 are sons and the last is a daughter.As I said, what the couple decides is a red herring. All that matters is the statistic used.

This result is problematic because Bayes' Theorem shows that no reasonable method of drawing inferences would give a different conclusion from experiment A than experiment B. The only way that statement can be defended is by a "no true Scotsman" type argument, as we already have a method that gives different conclusions. What is unreasonable about it? Mathematical theorems make no statements about anything but mathematical concepts, therefore Bayes' Theorem cannot say anything about "reasonable" methods except insofar as you are redefining "reasonable" to be a mathematical concept.

Hypothesis testing does, therefore it is an unreasonable method of drawing inferences.Ah. The reason that it is unreasonable is it gives different results, and you've decided that everything that gives different results is unreasonable.

Badly Shaved Monkey
8th January 2007, 06:17 AM
An implicit assumption has been made in all the answers in this thread that warrants being stated explicitly.

All of this testing and statistical significance relates to the behaviour of the average (mean or median depending on circumstances) value of a parameter for some group of objects. The point of testing is to determine whether there is truly a difference between the average values of two groups.

This is fine, provided the question you are asking can be properly answered by reference to the behaviour of group average values. But that is a very narrow view of the behaviour of data. The fact that it is useful in so many circumstances is because that narrow view often suffices for the situation

Here is an example where the mere asking of a question that is answerable by reference to the behaviour of group averages means that you have analysed the situation wrongly.

I have been looking at the behaviour of our business bank balance to see whether there are identifiable patterns across the month that we could exploit to manage our account better. I pooled data for 36 months and sure enough, there is an obvious cycle through the month. Let's use notional figures for illustration: we start the month with a mean bank balance of £50,000 and there is a mid-month peak at £80,000. The s.e.m. around these values is quite tight, about £5,000, so we can confirm, at high probability, that this monthly cycle is real and not just a fluke. But I want to know when I can safely write big cheques to pay big bills. The problem is that the standard deviation is about £30,000, i.e. about 95% of the time the actual account balance on any given day is +/- £60,000 of that day's mean value. That's great if the day I write a £20,000 cheque the account is at £80,000 + £60,000, but if it is actually at £50,000 - £60,000 I am likely to receive an embarrassing phone call. So, in this instance, I have shown a statistically real behaviour but if I relied on it for my intended action I would find I had answered the wrong question.

A related problem in medicine is similar to the above. There is an important distinction between statistical significance and biological/clinical significance. If I pool data from 1,000,000 patients, I might find that a certain drug really, genuinely, honestly does lower blood pressure, by an average of 0.1mmHg.This is not very likely to be clinically useful. Tis has been alluded to on the previous page when the idea of statistical power was introduced. It is important to decide what size of an effect would matter clinically or biologically then design the test to look for an effect of that size. Veterinary medicine is plagued by low-powered studies because of the difficulty of recruiting enough subjects to look at real medical conditions for useful lengths of time.

69dodge
8th January 2007, 09:39 AM
Proper statistical method requires that you decide on a rejection region before any data is collected.And what is the reason for this requirement?

There's no way to look at the results of an experiment directly, and see what they tell us about a hypothesis?

What an experiment tells us about a hypothesis depends not only on the actual results of the experiment, but also on some arbitrary decision we made beforehand about rejection regions?

The only way that statement can be defended is by a "no true Scotsman" type argument, as we already have a method that gives different conclusions. What is unreasonable about it?Before we can decide whether a method is reasonable or not, we need to decide what goal we want it to accomplish. Then we can say that it's reasonable if it accomplishes that goal, and unreasonable if it doesn't.

So what's the goal of a statistical significance test?

I think it's to help us decide whether a hypothesis is true or not. The decision to "reject the null hypothesis" should depend on, and only on, how much evidence there is that it is false.

So if two different experiments give us the same amount of evidence against the truth of a hypothesis, it makes no sense to reject the hypothesis in one case but not in the other.

Do you think that the results of Ben Tilly's experiments A and B give different amounts of evidence against the hypothesis of equal boy/girl probabilities? How could they? They're the same results!

The problem with significance tests based on p-values is that they take into account all sorts of experimental results that didn't happen (namely, all those in the predetermined rejection region). Where's the sense in that?

As Sir Harold Jeffreys wrote in Theory of Probability (third edition, pp. 384--385, emphasis in original):[some discussion of the χ2 statistic and p-values based on it, then...]

If P was less than some standard value, say 0.05 or 0.01, the law was considered rejected. Now it is with regard to this use of P that I differ from all the present statistical schools, and detailed attention to what it means is needed. The fundamental idea, and one that I should naturally accept, is that a law should not be accepted on data that themselves show large departures from its predictions. But this requires a quantitative criterion of what is to be considered a large departure. The probability of getting the whole of an actual set of observations, given the law, is ridiculously small. Thus for frequencies 2.74 (6) shows that the probability of getting the observed numbers, in any order, decreases with the number of observations like $(2\pi N)^{-\frac{1}{2}(p-1)}$ for χ2 = 0 and like $(2\pi N e)^{-\frac{1}{2}(p-1)}$ for χ2 = p - 1, the latter being near the expected value of χ2. The probability of getting them in their actual order requires division by N!. If mere improbability of the observations, given the hypothesis, was the criterion, any hypothesis whatever would be rejected. Everybody rejects the conclusion, but that can only mean that improbability of the observations, given the hypothesis, is not the criterion, and some other must be provided. The principle of inverse probability does this at once, because it contains an adjustable factor common to all hypotheses, and the small factors in the likelihood simply combine with this and cancel when hypotheses are compared. But without it some other criterion is still necessary, or any alternative hypothesis would be immediately rejected also. Now the P integral does provide one. The constant small factor is rejected, for no apparent reason when inverse probabiltiy is not used, and the probability of the observations is replaced by that of χ2 alone, one particular function of them. Then the probability of getting the same or a larger value of χ2 by accident, given the hypothesis, is computed by integration to give P. If χ2 is equal to its expectation supposing the hypothesis true, P is about 0.5. If χ2 exceeds its expectation substantially, we can say that the value would have been unlikely to occur had the law been true, and shall naturally suspect that the law is false. So much is clear enough. If P is small, that means that there have been unexpectedly large departures from prediction. But why should these be stated in terms of P? The latter gives the probability of departures, measured in a particular way, equal to or greater than the observed set, and the contribution from the actual value is nearly always negligible. What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure. On the face of it the fact that such results have not occurred might more reasonably be taken as evidence for the law, not against it. The same applies to all the current significance tests based on P integrals. [footnote: On the other hand, Yates (J.R. Stat. Soc., Suppl. 1, 1934, 217--35) recommends, in testing whether a small frequency nr is consistent with expectation, that χ2 should be calculated as if this frequency was nr + 1/2 instead of nr, and thereby makes the actual value contribute largely to P. This is also recommended by Fisher (Statistical Methods, p. 98). It only remains for them to agree that nothing but the actual value is relevant.]

69dodge
8th January 2007, 10:28 AM
As for what else you say,
"statistical significance test rejecting a null hypothesis to correspond to the intuitive notion of us having reason to believe that the hypothesis is false"
I guess that as long as P(reject Ho|Ho)<P(reject Ho|Ha), there is such a correspondence. Of course, P(reject Ho|Ha) only is definable if we have a particular Ha in mind.An experiment has three possible outcomes: A, B, and C. On hypothesis H0, their probabilities are 0.02, 0.02, 0.96. On hypothesis Ha, their probabilities are 0.01, 0.04, 0.95.

I choose a rejection region of {A, B}, whose probability on H0 is 0.04, which is less than 0.05, its probability on Ha.

I run the experiment and the outcome is A, which is in the rejection region. Does this result therefore constitute evidence against H0 and in favor of Ha? Or the opposite?

The opposite, obviously.

Why should I care about the probability of possible outcomes that happen to be in the rejection region, if they didn't actually occur? And if I don't care about them, why bother picking a rejection region to begin with?

drkitten
8th January 2007, 10:51 AM
An experiment has three possible outcomes: A, B, and C. On hypothesis H0, their probabilities are 0.02, 0.02, 0.96. On hypothesis Ha, their probabilities are 0.01, 0.04, 0.95.

I choose a rejection region of {A, B}, whose probability on H0 is 0.04, which is less than 0.05, its probability on Ha.

I run the experiment and the outcome is A, which is in the rejection region. Does this result therefore constitute evidence against H0 and in favor of Ha? Or the opposite?

The opposite, obviously.

I'm sorry, I'm perhaps not following this properly. But it seems that your experiment as proposed offers next door to no information at all -- and to the extent that it offers information, offers information in favor of H[sub]0[/sub[/i].

So, basically, you ran the wrong experiment.

Why should your poor choice of experiments be an argument for or against a statistical theory?

Ben Tilly
8th January 2007, 01:33 PM
That's not the standard statistical method, as I said in my post. I consider myself a statistician, and I say that number three is wrong. And, at the risk of sounding conceited, I would consider anyone who disagrees to not be a statistician. "the odds of getting the result you got from the experiment, or anything less likely" is not a meaningful phrase. In the example that you gave, every result is equally likely. Seven boys, then a girl, is just as likely as three boys, then two girls, then two more boys.

Let's get the ad hominems out of the way first, shall we?

This specific problem is one I first heard about from Dr. Laurie Snell. http://www.dartmouth.edu/~chance/jlsnell.html I have discussed it since with a number of people, including several statisticians who were tenured professors at different universities. I have no idea what your bona fides are to back up your self-identification as a statistician, but if you claim that anyone who disagrees is not a statistician, then you've made a claim that is very much on the outrageous side.

Now let's turn to actual matters of substance.

Proper statistical method requires that you decide on a rejection region before any data is collected.

The rejection region will depend on the set of results one might possibly observe, which in turn depends on the experimental design.

In experiment A the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons then a girl, 7 girls then a son, 8 sons then a girl, 8 girls then a son, 9 sons then a girl, and so on. The cumulative probability of being in this set is readily calculated to be 1/64.

In experiment B the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons and a girl (in any order), 7 girls and a son (in any order), 8 sons, or 8 girls. The cumulative probability of being in this set is readily calculated to be 9/128.

At a 95% confidence level the observed outcome is in the rejection set for experiment A but not for experiment B.

As I said, what the couple decides is a red herring. All that matters is the statistic used.

What the couple decides affects what the set of possible outcomes are, and therefore affects the odds that one might have gotten an outcome that would be taken for evidence that is as strong or stronger evidence against the null hypothesis than what was observed.

Which therefore affects the results of hypothesis testing.

The only way that statement can be defended is by a "no true Scotsman" type argument, as we already have a method that gives different conclusions. What is unreasonable about it? Mathematical theorems make no statements about anything but mathematical concepts, therefore Bayes' Theorem cannot say anything about "reasonable" methods except insofar as you are redefining "reasonable" to be a mathematical concept.

Ah. The reason that it is unreasonable is it gives different results, and you've decided that everything that gives different results is unreasonable.

This is true. Now let me defend the view that anything that gives different results is unreasonable.

According to Bayes' Theorem, under no prior set of beliefs should the difference in design of the experiments make any difference in your conclusions. If one takes the view that reasonable people start with a set of prior beliefs which they then continuously modify in the light of experience, then no reasonable person can ever draw the distinction between these two cases that hypothesis testing does.

Of course if you do not believe that reasonable people should have beliefs and modify those beliefs in the face of experience in a logical fashion, then you may not think that the results of hypothesis testing are unreasonable.

Cheers,
Ben

drkitten
8th January 2007, 01:48 PM
The rejection region will depend on the set of results one might possibly observe, which in turn depends on the experimental design.

In experiment A the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons then a girl, 7 girls then a son, 8 sons then a girl, 8 girls then a son, 9 sons then a girl, and so on. The cumulative probability of being in this set is readily calculated to be 1/64.

In experiment B the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons and a girl (in any order), 7 girls and a son (in any order), 8 sons, or 8 girls. The cumulative probability of being in this set is readily calculated to be 9/128.

In particular, if I understand the example right, the outcome probability space is different for different experiments.


For example, the probabiliiy of the couple having a single boy and a single girl is 0.25 for experiment A, the case where the parents just want one of each. The corresponding probability is zero for experment B, where they just want eight kids, irrespective of sex. Similarly, the probability of three boys and five girls is zero for experiment A, non-zero (I'm too lazy to figure it exactly) for experiment B.

Given that the underlying probability mass is different, the fact that the probability mass in the rejection region defined by the same words differs can hardly be considered to be a fault of the statistics.

Ben Tilly
8th January 2007, 04:19 PM
[...]

So what's the goal of a statistical significance test?

I think it's to help us decide whether a hypothesis is true or not. The decision to "reject the null hypothesis" should depend on, and only on, how much evidence there is that it is false.

So if two different experiments give us the same amount of evidence against the truth of a hypothesis, it makes no sense to reject the hypothesis in one case but not in the other.

Do you think that the results of Ben Tilly's experiments A and B give different amounts of evidence against the hypothesis of equal boy/girl probabilities? How could they? They're the same results!

The problem with significance tests based on p-values is that they take into account all sorts of experimental results that didn't happen (namely, all those in the predetermined rejection region). Where's the sense in that?

This is the key point. Experiments A and B differ only in what didn't happen but could have. Hypothesis testing takes those possibilities into account so you come to different conclusions. However it seems absurd that what your conclusion about what is true is based on what didn't happen. Bayes' Theorem allows us to quantify the reason why our intuition says that this is absurd. Therefore hypothesis testing leads to absurd distinctions being made.

Allow me to add more variations.

Experiment C is like experiment A except that the couple agreed to have children until they had a girl. Now the p-value drops to 1/128.

Experiment D is like experiment A except that the couple decided to flip a coin after each child to decide whether to stop the experiment. Now the p-value drops to 1/8192! (They would have been at the old p-value of 1/64 after 3 sons and a daughter!) This is a drastic change in the strength of our conclusion, yet the extra coin flips gave us absolutely no information about the likelyhood of sons versus daughters!

And so it goes. Things that should be irrelevant matter greatly in hypothesis testing. That they do is integral to the procedure.

Cheers,
Ben

Art Vandelay
8th January 2007, 04:24 PM
All of this testing and statistical significance relates to the behaviour of the average (mean or median depending on circumstances) value of a parameter for some group of objects.You're alluding to an important point, but you don't have it quite correct. The average of the population is a parameter. It makes no sense to speak of the "average value" of a parameter; a parameter has only one value. Statitiscal tests compare one parameter to another. Usually, that parameter is the average of the population, but sometimes other parameters, such as the standard deviation, is considered. And, of course, the parameter is a simplified measure. In the example you gave, you talked about a difference of 0.1mmHg, and said that it might not be clinically significant. Well, that's not quite the point. More to the point, the average blood pressure and the average utility may not be the same. For instance, suppose that a bp of 200 means a 50% chance of dying in the next year, while a bp of 180 means a 30% chance of dying in the next year, and a bp of 150 means a 20% chance of dying in the next year. And suppose, magically, everyone's bp is exactly equal to one of those three values. Now suppose that for drug A, the distribution is as follows: 50% 200, 10% 180, 40% 150. Average = 178. For drug B, its 10% 200, 80% 180, 10% 150. Average = 179. Since drug A reduce the average bp, it's slightly better, right? But if you calculate the death rates, drug A has death rate of 36%, while drug B has a death rate of 31%. Death rates are more important that bp, but it's a lot easier to test bp. And even if we did try to measure death rates, then there are more factors to consider, such as whether a 10% of death and 90% chance of perfect health is better than 1% death and 99% chance of very poor health.

Here is an example where the mere asking of a question that is answerable by reference to the behaviour of group averages means that you have analysed the situation wrongly.More precisely, it's an example where the mean value isn't as important as some other measure, such as the percentage of balances above £20,000.

drkitten
8th January 2007, 04:25 PM
However it seems absurd that what your conclusion about what is true is based on what didn't happen.

Huh? That makes no sense to me whatsoever.

"I lit the fuse, but the firecracker didn't explode. Therefore it must have been a dud."

"That's absurd!"

"What do you mean, that's absurd?"

"Well, how do you know that a leprechaun didn't come out and pee on the fuse while your back was turned?"

".... um,.... what?"

Ben Tilly
8th January 2007, 05:01 PM
In particular, if I understand the example right, the outcome probability space is different for different experiments.

It sounds like you understand the example right.

For example, the probabiliiy of the couple having a single boy and a single girl is 0.25 for experiment A, the case where the parents just want one of each. The corresponding probability is zero for experment B, where they just want eight kids, irrespective of sex. Similarly, the probability of three boys and five girls is zero for experiment A, non-zero (I'm too lazy to figure it exactly) for experiment B.

The odds you were too lazy to figure out are 7/32.

Given that the underlying probability mass is different, the fact that the probability mass in the rejection region defined by the same words differs can hardly be considered to be a fault of the statistics.

The statistics calculates exactly what it said it would calculate. I did not mean to imply fault in the calculation of the statistics.

The problem lies in how people interpret and act on those statistics. We decide whether or not to reject a null hypothesis. We will make decisions and carry out actions differently after these two experiments. Is it reasonable to do so?

Well let's take the most reasonable of all possible procedures for drawing an inference. And that is to use Bayes' Theorem. Suppose, for instance, that the experimenter starts with the following prior expectations:

50% chance of the couple having boys vs girls be 50-50.
20% chance of having the odds be 55-45.
20% chance of 45-55.
5% chance of 100-0.
5% chance of 0-100. What should the expectations be after observing either experiment A or B? Well it turns out to be the same, you just apply Bayes' Theorem. Under option 1 the odds of the observed outcome is 0.00390625, so the odds of option 1 and the observed outcome is 0.001953125. Option 2 gives 0.0068509585546875 for odds, so that plus the outcome is 0.0013701917109375. Option 3 gives 0.0020551819921875 so that plus the outcome is 0.0004110363984375. Options 4 and 5 say the result was impossible.

So let's crank it into Bayes' formula. According to the experimenter's expectations, the probability of the outcome was 0.003734353109375 so our revised expectations are 52.3% for option 1, 36.7% for option 2, 11% for option 3 and 0% for options 4 and 5.

This change in expectations is true whether the experiment that was run is version A or B.

In short, by the most reasonable method we can find for adjusting our expectations in the light of further evidence, the differences in experimental design are absolutely and completely irrelevant. In fact it isn't hard to prove that, no matter what set of prior expectations the experiment had, the design difference will be irrelevant.

So when we take the step of using the results of hypothesis testing to draw an inference and make a decision, we are making our decisions in a way that is not consistent with any set of possible prior expectations. And we are doing so because (as 69dodge pointed out) we are explicitly taking into account in our decision the likelyhood of things that didn't happen. (Note that Bayes' formula completely ignores the might have beens that didn't happen - they can't matter to it.)

Cheers,
Ben

PS Note that I am not arguing for throwing out hypothesis testing. As I said before, it gets simple to interpret results when alternatives either produce nothing or give very complex answers. While acting according to what hypothesis testing tells you can lead to some absurd choices, most of the time it leads to fairly reasonable decisions.

Ben Tilly
8th January 2007, 05:07 PM
One thing I have wondered. If 95% significance level is good enough, for example, does that mean that 1 in 20 tests actually be wrong.

The short answer is, "No."

The medium answer is, "That statement shows the confusion that most people have about hypothesis testing."

The long answer is, "1 in 20 times when the null hypothesis is true, we will incorrectly reject it. However we have no idea how often the null hypothesis is true, and without knowing the correct hypothesis we have no way to determine how often we do not correctly decide to reject the null hypothesis."

The nutshell is that hypothesis testing only concerns itself with limiting the odds of making one type of error (incorrectly rejecting the null hypothesis) and says absolutely nothing useful about the true odds of any hypothesis. It is very often misinterpreted as doing so, and that is always a mistake in someone's understanding.

Cheers,
Ben

Ben Tilly
8th January 2007, 05:32 PM
Huh? That makes no sense to me whatsoever.

Why is it absurd? It is a fact that, no matter what beliefs you have about the world, possibilities that didn't happen shouldn't affect how you change your beliefs. (Your beliefs about the odds of that not happening might matter, but the things that didn't happen don't.)

Mathematically it can't. Arguing that it should is like arguing that your bank account is going down because there is dust blowing on Mars. (Actually it is worse than that because there is a logical possibility that you will lose money from a bet about whether dust is blowing on Mars.) You are caring about what is irrelevant.

"I lit the fuse, but the firecracker didn't explode. Therefore it must have been a dud."

"That's absurd!"

"What do you mean, that's absurd?"

"Well, how do you know that a leprechaun didn't come out and pee on the fuse while your back was turned?"

".... um,.... what?"

I don't see how you think this analogy relates to the discussion. Unless you are trying to support my point. (Which is that possibilities that didn't happen, like the lerechaun peeing on the fuse, are totally irrelevant.)

Cheers,
Ben

drkitten
8th January 2007, 05:32 PM
The problem lies in how people interpret and act on those statistics. We decide whether or not to reject a null hypothesis. We will make decisions and carry out actions differently after these two experiments. Is it reasonable to do so?

Er --- yes, it it? Different questions and backgrounds provoke different experimental designs, which in turn generate different actions. There's an implicit "duh" in there somewhere, I think.



Well let's take the most reasonable of all possible procedures for drawing an inference. And that is to use Bayes' Theorem. Suppose, for instance, that the experimenter starts with the following prior expectations:

50% chance of the couple having boys vs girls be 50-50.
20% chance of having the odds be 55-45.
20% chance of 45-55.
5% chance of 100-0.
5% chance of 0-100. What should the expectations be after observing either experiment A or B? Well it turns out to be the same, you just apply Bayes' Theorem.

But why on Earth should the experimentor start out with that particular set of prior expectations?

The problem with Bayesian analysis is that it just pushes the assumptions back one more level., and furthermore, it specifically ignores information (such as the stated intentions of the couple).


This change in expectations is true whether the experiment that was run is version A or B.

In short, by the most reasonable method we can find for adjusting our expectations in the light of further evidence, the differences in experimental design are absolutely and completely irrelevant.

I think I still fail to see why this is a Good Thing.

drkitten
8th January 2007, 05:35 PM
Why is it absurd? It is a fact that, no matter what beliefs you have about the world, possibilities that didn't happen shouldn't affect how you change your beliefs. (Your beliefs about the odds of that not happening might matter, but the things that didn't happen don't.)

This is gibberish. You say that something not happening shouldn't affect how I change my beliefs -- but my beliefs about the odds of something not happening are, by definition, part of my beliefs, and will be changed as a result of something not happening.

If I think such-and-such is a dead cert, and it doesn't happen, I'm certainly changing my belief!

Robin
8th January 2007, 06:01 PM
OK, what about an example. Say I am going for the JREF Million and my claim is that if a person is in an isolated room looking at a series of randomly selected pictures on a computer screen and I am in another isolated room looking at four pictures, one of which is the image being viewed by the sender, then I will click on the correct image 35% of the time.

What is the design of the experiment, what is the value of p and the size of the sample that would get me the million?

Art Vandelay
8th January 2007, 06:50 PM
And what is the reason for this requirement?So that a proper statistical statement can be made. Otherwise, you're just data mining.

There's no way to look at the results of an experiment directly, and see what they tell us about a hypothesis?There's no way to make a porper statistical statement about the result, without setting up a statistical test beforehand. One can make as sorts of conclusions, such as "That looks very convincing" or "I don't think I'll be eating plutonium after seeing what it did to that guy", etc. One just can't make a statistical statement.

What an experiment tells us about a hypothesis depends not only on the actual results of the experiment, but also on some arbitrary decision we made beforehand about rejection regions?What statistical decisions we make depend on the decisions beforehand. And the decisions are arbitrary regardless of when we make them; making them before collecting the data ensures that we aren't creating them ad hoc. Think of an archer hitting a target: we make decisions about how good the archer is based not only on where the arrow goes, but on arbitrary decisions made beforehand, such as where to put the target. If we put a target on an elm tree, and an archer hits the elm tree, and then we move the target to an oak tree, and another archer hits the elm tree again, we conclude that the first archer is better than the second, even though their arrows went to the exact same place. If we don't have any targets, and we try to engage in reasoning like "well, the elm try is smaller, so the first archer is better", that's rather fallacious.

Before we can decide whether a method is reasonable or not, we need to decide what goal we want it to accomplish. Then we can say that it's reasonable if it accomplishes that goal, and unreasonable if it doesn't.Well, I think that "efficient" is better than "reasonable" to express the concept you're talking about.

I think it's to help us decide whether a hypothesis is true or not. The decision to "reject the null hypothesis" should depend on, and only on, how much evidence there is that it is false.I disagree. Also, you need to define "how much evidence".

So if two different experiments give us the same amount of evidence against the truth of a hypothesis, it makes no sense to reject the hypothesis in one case but not in the other.Sure it does. If you think that some procedure is inefficient, you should reject it before the experiment. You can't look at the procedure after the experiment, declare that it's inefficient, and say that you're therefore going to use some other.

Do you think that the results of Ben Tilly's experiments A and B give different amounts of evidence against the hypothesis of equal boy/girl probabilities?Yes.

How could they? They're the same results!No, they're not.

Consider this: there's a game show called "Deal or No Deal". There are 26 briefcases. A contestant chooses a briefcase, then opens the rest, one by one, until they either accept a deal to sell their briefcase, or there are only two briefcases left (at which point they are offered the option to switch). One of the briefcases contains a million dollars. If all of the contestants hold out until there are two briefcases, then 1/13 will still have the million dollar briefcase in play, and 1/26 will have the million dollar suitcase. So there is no benefit to switching.

No suppose there were some variant of the game where instead of the contestant choose which suitcases to open, the host opens suitcases, and always opens suitcases that don't have a million dollars. Now, in 100% of the cases, the million dollars will still be in play, but the contestant will have the million dollars only 1/26 of the time. So now the contestant should switch.

So if one player plays the first game, and ends up with the million dollars in play, and another plays the second game, and also ends up with the million dollars in play, are those the same result? Do they both give the same amount of evidence against the null hypothesis of "I don't have the million dollars"? Should both players come to the same conclusion?

The problem with significance tests based on p-values is that they take into account all sorts of experimental results that didn't happen (namely, all those in the predetermined rejection region). Where's the sense in that?Well, of course they do. How could they not? If you only look at what happened, then whatever happened happened, so every experiment will have the exact same result (whatever happened, happened).

For an experiment to be worth anything, there must be at least two sets of results, with different conclusions for each. So every conclusion must be based on the fact that it's not in the other set.

As Sir Harold Jeffreys wrote in Theory of Probability (third edition, pp. 384--385, emphasis in original):But why should these be stated in terms of P? The latter gives the probability of departures, measured in a particular way, equal to or greater than the observed set, and the contribution from the actual value is nearly always negligible. What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.
Well, that's based on the conception of the test as having the calculations done after the data is collected, and gives more support to the rule that they should be done before data is collected. With the latter conception, there is a rejection region determined beforehand, and all one needs to do is check whether the result is in that region; p-values are not needed. Now, it is possible to set up an experiment such that one can use p-values, but I don't recommend it. Addressing Jeffreys' argument further, he focuses on the things that didn't happen and are "greater departures". But there are also the things that didn't happen and are "lesser departures". The issue is not just how large the former is, but what the ratio of the former to the latter is. Both are "things that didn't happen", but the latter is stuff that "might have happened, just as likely", and therefore isn't included in the p-value.

An experiment has three possible outcomes: A, B, and C. On hypothesis H0, their probabilities are 0.02, 0.02, 0.96. On hypothesis Ha, their probabilities are 0.01, 0.04, 0.95.

I choose a rejection region of {A, B}, whose probability on H0 is 0.04, which is less than 0.05, its probability on Ha.

I run the experiment and the outcome is A, which is in the rejection region. Does this result therefore constitute evidence against H0 and in favor of Ha? Yes. Your pulling A out of the rejection region is just data mining. You're constructing a new set around data that you already have.

I can play this game, too. Suppose that we split A into two further subsets, A1 and A2. Under Ho, the probabilities are .019 and .01; under Ha, they are .001 and .009. And suppose the result is in A2. Now, if we view it as in A2, it supports Ha. If we consider as being in A, then it supports Ho. If we see it as being in {A,B}, then it supports Ha again. You can't just pick and choose which set you want it to be consider it to be a member of.

The opposite, obviously.Why?

Why should I care about the probability of possible outcomes that happen to be in the rejection region, if they didn't actually occur? Because points don't have measure, sets do. To have a probability, you need a set. And you need to decide what set you're using beforehand; otherwise you're engaging in data mining. So they aren't "possible outcomes that... didn't actually occur"; the rejection region must be considered as a whole.

Ben Tilly
8th January 2007, 07:04 PM
Er --- yes, it it? Different questions and backgrounds provoke different experimental designs, which in turn generate different actions. There's an implicit "duh" in there somewhere, I think.

Different experimental designs testing different questions should generate different reactions.

However here we are talking about identical data produced by an identical physical process which will be interpreted differently by the same person. It isn't obvious that there should be a difference in that case. (And indeed the point of the argument is that there should not be one.)

But why on Earth should the experimentor start out with that particular set of prior expectations?

I was using that as an example.

While there is no reason that an experimenter would start with that particular set of expectations, the experimenter probably starts with some set of expectations. And no matter what set of expectations that experimenter has, experiments A and B should not lead to a different posterior conclusion.

The problem with Bayesian analysis is that it just pushes the assumptions back one more level., and furthermore, it specifically ignores information (such as the stated intentions of the couple).

Not guilty in either regard.

The complexity with Bayesian analysis is that it explicitly avoids making an assumption. It even avoids making it possible to accidentally make an unwarranted assumption by misunderstanding a critical term (like "confidence"). As for "ignored" information, that information is not ignored. It is merely provably irrelevant.

I think I still fail to see why this is a Good Thing.

I assume that you have beliefs about the world. I assume that as you encounter more information, you update your beliefs about the world. Well if you're able to quantify your beliefs about the world, then Bayes' formula describes how you absolutely, logically should update your beliefs in light of further information. If you update your beliefs in any other way then you are being illogical.

If you don't understand why this is the case, then review the explanation of Bayes' Theorem either online or in the probability book of your choice.

Of course in practice people don't do this because it is too complex for us to do on the fly. We are, after all, illogical creatures. However Bayes' Theorem is the ideal method of drawing inference. As we've known for over 2 centuries, there is no other method that has nearly as strong a logical foundation as this one.

So if you're using a method of inference that gives results that are nonsense according to Bayes' Theorem, then that is a fact that should make you sit up and pay attention.

Ben

Ben Tilly
8th January 2007, 07:15 PM
This is gibberish. You say that something not happening shouldn't affect how I change my beliefs -- but my beliefs about the odds of something not happening are, by definition, part of my beliefs, and will be changed as a result of something not happening.

If I think such-and-such is a dead cert, and it doesn't happen, I'm certainly changing my belief!

I admit it was poorly phrased.

What I meant is that your prior beliefs about things happening or not happening factor into your updated beliefs after you observe what you observe. But your beliefs should change solely based on what you did observe, and not based on what you didn't observe.

In other words if you see a cat I shouldn't be able to change your beliefs about what you are seeing there by saying, "That isn't a dog." That simply isn't relevant, it is a cat and that is that.

If this still doesn't make sense, then we should drop this subthread and accept that this was a confusing phrasing on my part which didn't convey anything useful.

Cheers,
Ben

69dodge
8th January 2007, 07:31 PM
I'm sorry, I'm perhaps not following this properly. But it seems that your experiment as proposed offers next door to no information at all -- and to the extent that it offers information, offers information in favor of H[sub]0[/sub[/i].

So, basically, you ran the wrong experiment.

Why should your poor choice of experiments be an argument for or against a statistical theory?It's definitely not a great experiment. The most probable result is C, which is about as likely on either hypothesis, and so would tell us very little. But a good statistical theory should let us extract as much information as possible from whatever result happens in whatever experiment we do. If the result of my poorly designed experiment should happen to be A, as unlikely as that result was a priori, there's no reason to ignore what it can tell us. Since A was twice as likely on one hypothesis as on the other, its occurrence provides evidence in favor of the one and against the other.

It doesn't matter how likely B or C were on either hypothesis, because, although they were more likely a priori, it turns out that they didn't happen. What happened was A, so its likelihood on the two hypotheses is all that matters.

Art Vandelay
8th January 2007, 07:48 PM
Let's get the ad hominems out of the way first, shall we?Huh? Ad hominems? What do you mean? You're the one trying to make an argument from authority. I just responded by disputing the alleged authority. That's hardly "ad hominem".

I have discussed it since with a number of people, including several statisticians who were tenured professors at different universities.And none of them took issue with your step three?

I have no idea what your bona fides are to back up your self-identification as a statistician, but if you claim that anyone who disagrees is not a statistician, then you've made a claim that is very much on the outrageous side.It's a rather basic principle of statistics. Without it, you're not doing statistics.

In experiment A the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons then a girl, 7 girls then a son, 8 sons then a girl, 8 girls then a son, 9 sons then a girl, and so on. [/qquote]This is a bit of an abuse of the word "experiment", as it is more of an observational study than an experiment.

[quote]According to Bayes' Theorem, under no prior set of beliefs should the difference in design of the experiments make any difference in your conclusions. Theorems do not speak of "should". I'm rather suspicious of your repeated, and unsupported, claims of what BT "says".

If one takes the view that reasonable people start with a set of prior beliefs which they then continuously modify in the light of experience, then no reasonable person can ever draw the distinction between these two cases that hypothesis testing does.Sure they can.

Of course if you do not believe that reasonable people should have beliefs and modify those beliefs in the face of experience in a logical fashion, then you may not think that the results of hypothesis testing are unreasonable.I don't see what they have to do with each other.

However it seems absurd that what your conclusion about what is true is based on what didn't happen. Not to me.

Bayes' Theorem allows us to quantify the reason why our intuition says that this is absurd. /quote]How?

[quote]Therefore hypothesis testing leads to absurd distinctions being made.Just bewcause you don't understand the reason doesn't mean they are absurd.

Now the p-value drops to 1/8192!You should put a space between 8192 and !. 1/8192! is less than 10^-25000.

This is a drastic change in the strength of our conclusion, yet the extra coin flips gave us absolutely no information about the likelyhood of sons versus daughters!It is not, on average an increase in the "strength" in our conclusion. It's an increase in the "strength" when we get the result, but we will get the result less often, resulting in no average increase.

Things that should be irrelevant matter greatly in hypothesis testing. More prescisely, one can design an experiment in which issues that aren't part of what's being tested will have an effect.

And we are doing so because (as 69dodge pointed out) we are explicitly taking into account in our decision the likelyhood of things that didn't happen. (Note that Bayes' formula completely ignores the might have beens that didn't happen - they can't matter to it.)That's not true.

The nutshell is that hypothesis testing only concerns itself with limiting the odds of making one type of error (incorrectly rejecting the null hypothesis) and says absolutely nothing useful about the true odds of any hypothesis.More precisely, it deals with quantifying that error. How large one wants it to be is up to the experimenter. And theree is susually an effort made to avoid the other type of error; it's just that it can't be quantified.

Why is it absurd? It is a fact that, no matter what beliefs you have about the world, possibilities that didn't happen shouldn't affect how you change your beliefs. If possibilities that don't happen don't affect your beliefs, then possibilities that do happen shouldn't matter, either.

(Your beliefs about the odds of that not happening might matter, but the things that didn't happen don't.)If my beliefs matter, then how can the things not matter?

Mathematically it can't.Why not?

Art Vandelay
8th January 2007, 08:34 PM
But a good statistical theory should let us extract as much information as possible from whatever result happens in whatever experiment we do. You can choose the most efficient statistic. Or you can choose a particular alpha. You just can't, in general, do both.

Since A was twice as likely on one hypothesis as on the other, its occurrence provides evidence in favor of the one and against the other.Then you should have thought about that before you started tghe experiment. Here's something to try: go into a casino, go to a blackjack table, and hit on everything less than 21. Now, suppose you hit on 20, bust, and then see that the dealer ends up with 19. You should tell the dealer that, with 20, it was more likely that you'd win than that you'd lose, so you should get your money back, and it's silly to ignore that information. See how well that plays out.

But your beliefs should change solely based on what you did observe, and not based on what you didn't observe.But those are logically indistinguishible. Observing X is the same thing as not observing not X.

In other words if you see a cat I shouldn't be able to change your beliefs about what you are seeing there by saying, "That isn't a dog." The knowledge that it's not a dog absolutely may change my beliefs.

However here we are talking about identical data produced by an identical physical process which will be interpreted differently by the same person. No, it's a different process.

While there is no reason that an experimenter would start with that particular set of expectations, the experimenter probably starts with some set of expectations. And no matter what set of expectations that experimenter has, experiments A and B should not lead to a different posterior conclusion.They don't.

It is merely provably irrelevant.Yet you haven't presented the proof.

If you update your beliefs in any other way then you are being illogical.Nope.

If you don't understand why this is the case, then review the explanation of Bayes' Theorem either online or in the probability book of your choice.That's both a fallacious and arrogant thing to say.

So if you're using a method of inference that gives results that are nonsense according to Bayes' Theorem, then that is a fact that should make you sit up and pay attention."Nonsense" is not a mathematical term.

OK, what about an example. Say I am going for the JREF Million and my claim is that if a person is in an isolated room looking at a series of randomly selected pictures on a computer screen and I am in another isolated room looking at four pictures, one of which is the image being viewed by the sender, then I will click on the correct image 35% of the time.

What is the design of the experiment, what is the value of p and the size of the sample that would get me the million?P depends on the data from the experiment; perhaps you mean to ask what alpha is? I don't know what JREF uses, but I would imagine it would be one in a million, or stronger. To get that, you would have to answer correctly 10 times in a row (and you would, assuming that your claim is correct, have a probability of 27 in a million of doing so). If you have 100 pictures, then you would have to identify 45 correctly (giving you a 1.5 chance). With 1000, you would need to get 310 correct, giving you a 99.6% chance.

Ben Tilly
8th January 2007, 10:45 PM
Let's get the ad hominems out of the way first, shall we?Huh? Ad hominems? What do you mean? You're the one trying to make an argument from authority. I just responded by disputing the alleged authority. That's hardly "ad hominem".

Ad hominem means "of the man", and an ad hominem argument means one where you are appealing for or against the person making the argument rather than for the quality of the argument. So you claimed in essence, "I am an authority and anyone who is will agree with me." I respond by saying, "So and so is an authority and disagrees with you." This is an ad hominem argument on both of our sides.

I was acknowledging that was happening before moving on to the actual discussion of substance. Which you'll note I have not been conducting through "argument from authority", but rather by presenting detailed examples and calculations.

I have discussed it since with a number of people, including several statisticians who were tenured professors at different universities.And none of them took issue with your step three?

Not only did they not, but it was one of them who first lead me through that calculations. Several of them pointed to places in the literature where I could find further debate on whether Bayesian statistics should be used more.

I have no idea what your bona fides are to back up your self-identification as a statistician, but if you claim that anyone who disagrees is not a statistician, then you've made a claim that is very much on the outrageous side.It's a rather basic principle of statistics. Without it, you're not doing statistics.

Who is arguing from authority now?

Would you mind explaining why it is a basic principle of statistics? If it is, you should be able to provide a cogent reason why it should be so. While you're at it, I would appreciate an explanation of how you would analyze the results of both experiments A and B.

(As an aside, I hate the fact that the interface here loses nested quotes. Because as things stand I have no idea what the exact phrasing of the claim is unless I want to go back and track it down by hand. And I'm getting really tired of cutting and pasting in the previous discussion to add context for what I am saying...)

In experiment A the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons then a girl, 7 girls then a son, 8 sons then a girl, 8 girls then a son, 9 sons then a girl, and so on. This is a bit of an abuse of the word "experiment", as it is more of an observational study than an experiment.

I don't care what term you wish to use to describe the situation as long as the situation described is clear. However I'll note that the hypothetical mother who was described had to do a lot more than just observe.

According to Bayes' Theorem, under no prior set of beliefs should the difference in design of the experiments make any difference in your conclusions.Theorems do not speak of "should". I'm rather suspicious of your repeated, and unsupported, claims of what BT "says".

My claims are a matter of easily established fact. If you wish to convince me that I am wrong, all that you need to do is produce a set of prior beliefs which would lead to a different set of posterior beliefs after observing case A and B. I am quite confident that you will fail.

The cause of my confidence is that Bayes' Theorem says that P(X given Y) is P(X and Y)/P(Y). (With appropriate amendments for probability density functions if you wish to go from discrete to continuous distributions.) This concrete formula provides a clear way to factor in ones prior expectations and the observed results. It provides no way for the difference in experimental design to matter.

However, surprise me. Please.

If one takes the view that reasonable people start with a set of prior beliefs which they then continuously modify in the light of experience, then no reasonable person can ever draw the distinction between these two cases that hypothesis testing does. Sure they can.

Example. Please.

By that I mean give me a detailed set of prior beliefs which, when modified according to Bayes' Theorem in the light of these two experiments, leads to different conclusions. If you succeed I will be both astonished and fascinated to see how it happened.

Of course if you do not believe that reasonable people should have beliefs and modify those beliefs in the face of experience in a logical fashion, then you may not think that the results of hypothesis testing are unreasonable.I don't see what they have to do with each other.

I strongly suspect that if you try and fail to provide me with the requested example, the connection wi