View Full Version : p-values, Samples, and Populations
TellyKNeasuss
21st February 2008, 07:29 PM
When computing a correlation coefficient, it is SOP to also compute the p-value - the probability that the computed correlation could have been that large when there really wasn't any actual correlation between the 2 variables. There are 2 main reasons why a false correlation might be computed: because ordinarily you only have a subset of the populations and the subset might not be representative; and because the data has errors. My question is: Does the p-value have the same significance if you have the entire population? For example, if a company asks you to compute the correlation between salary and years of education and gives you access to all the personnel records, you would have the entire population and the only possible error in the correlation would be errors in the personnel records. Would the p-value be meaningful in this case?
Walter Wayne
21st February 2008, 08:22 PM
Not at stats guy, but just thinking out loud.
For any pair of variables, you shouldn't be surprised to see a non-zero correlation between them even if they are completely unrelated. You need to know how far from zero that correlation should be in order to suspect there might be a relationship between the two variables.
Walt
3bodyproblem
21st February 2008, 08:30 PM
When computing a correlation coefficient, it is SOP to also compute the p-value - the probability that the computed correlation could have been that large when there really wasn't any actual correlation between the 2 variables. There are 2 main reasons why a false correlation might be computed: because ordinarily you only have a subset of the populations and the subset might not be representative; and because the data has errors. My question is: Does the p-value have the same significance if you have the entire population? For example, if a company asks you to compute the correlation between salary and years of education and gives you access to all the personnel records, you would have the entire population and the only possible error in the correlation would be errors in the personnel records. Would the p-value be meaningful in this case?
Dear God man why? statistics is for...well no one knows. most probably gamblers and their relatives.
[B - Special Functions, F+ - Statistics]
Piggy
21st February 2008, 08:35 PM
For example, if a company asks you to compute the correlation between salary and years of education and gives you access to all the personnel records, you would have the entire population and the only possible error in the correlation would be errors in the personnel records. Would the p-value be meaningful in this case?
Are you attempting to calculate the correlation for that exact population only?
Or are you attempting to extrapolate from that population in order to make assumptions that you believe would hold true for other populations -- for example, the staff you expect to have 10 years from now, or the staff at a new plant you plan to open?
In the latter case, your population is a sample, even though it's currently all you have.
In the former case, you're not dealing with statistics.
TellyKNeasuss
21st February 2008, 08:47 PM
Are you attempting to calculate the correlation for that exact population only?
Or are you attempting to extrapolate from that population in order to make assumptions that you believe would hold true for other populations -- for example, the staff you expect to have 10 years from now, or the staff at a new plant you plan to open?
In the latter case, your population is a sample, even though it's currently all you have.
In the former case, you're not dealing with statistics.
So if I were attempting to calculate the correlation for just that exact population, and say that I computed r = 0.5, then could I conclude with certainty that (if the data are all correct) years of education explains exactly 25 percent of the variance in salaries? Would the p-value give any information as to the accuracy of the data (maybe this was a poor example, because in many situations the data will be known to have inaccuracies)?
Piggy
21st February 2008, 08:54 PM
So if I were attempting to calculate the correlation for just that exact population, and say that I computed r = 0.5, then could I conclude with certainty that (if the data are all correct) years of education explains exactly 25 percent of the variance in salaries? Would the p-value give any information as to the accuracy of the data (maybe this was a poor example, because in many situations the data will be known to have inaccuracies)?
If you are only concerned with that group, what you're doing is measurement. As you say, your only concern regarding error has to do with GIGO -- "Is my data correct?"
If your input is correct, your measurement is accurate.
There's no question of statistical "noise".
However, if you then ask a question like, "If we start a new plant in another state, how likely is it that we'll see the same relationship between education and salary there?", then you're in a different ballgame.
Piggy
21st February 2008, 08:58 PM
Or, to put it another way, if you're not using samples, you do not have a statistical problem on your hands.
bpesta22
21st February 2008, 09:32 PM
Or, to put it another way, if you're not using samples, you do not have a statistical problem on your hands.
It's still stats; I think the difference is between descriptive and inferential stats. If dealing with a population, you're not making inferences, but you can still describe what's going on in the population.
p values are misleading because the sample size can make trivial correlations "significant". All the p value tells you here is the probability of getting the correlation you did, assuming the real correlation (known only by god) was zero.
Since you have the whole population, the correlation is what it is. It describes the population, but you still don't know if the correlation is real or just a fluke. In this sense, the p value can be meaningful.
When you have say 30-100 subjects, p values can be informative. With larger samples, really small R's become significant. It's better to interpret the size of the correlation and the % of variance explained like you did. To me, .50 is a pretty big correlation. To others, it might not be impressive.
I guess it depends on what you're trying to conclude.
Walter Wayne
21st February 2008, 09:41 PM
I disagree with you on that Piggy. He is attempting to describe/summarize a population which is one part of statistics. Even if he only collected data on salaries, the population of salaries would still have a distribution, and that distribution would have a mean, median, mode and variance. Of course, if he only did that he would just be doing descriptive statistics. Now if you do descriptive statistics where you sampled population is the entire population, then you have no error bars on the mean, median etc. (assuming you collect accurate data).
Telly, from your description, your trying to do inferential statistics. In this case one still doesn't have an error bar on the correlation, but want to know how confident one should be of his conclusion. And this is where the care must be taken in infering anything from the correlation value. With only 3 people in the company, you can't infer very much from an r of 0.5. So given a correlation of A and B, there are 4 possible conclusion.
- Variation in A is responsible for variation in B.
- Variation in B is responsible for variation in A.
- A third common factor is responsbile for both A and B.
- The correlation of A and B is merely a result of chance.
The p-value is an attempt to quantify the likelihood of the fourth possibility. Once one has ruled that out ... well you still have three out of the four possibilities left.
Walt
P.S. And Pesta said it better.
Piggy
22nd February 2008, 04:22 AM
Ok, gotcha. Thanks for the clarification!
[We need a lightbulb-icon]
dakotajudo
22nd February 2008, 07:03 AM
Another thing to consider is that the p-value of the correlation coefficient will give you some estimate of whether the relationship between observations is linear.
So, you just as well do a regression - that give you a p-value for slope and intercept. Then you can compare the linear model to other models. I suspect the calculation of p for correlation and regression are equivalent, but I'd have to work through them by hand to be sure.
For example,
So if I were attempting to calculate the correlation for just that exact population, and say that I computed r = 0.5, then could I conclude with certainty that (if the data are all correct) years of education explains exactly 25 percent of the variance in salaries?
You might better say that a linear relationship between education and salary explains 25% of the variance. Education might explain more of the salary, but only if you use a different model.
It may be that education is a better predictor of salary, but it's asymptotic - a four year degree predicts a 50% boost in salary, but an additional year of grad school only adds 10%. Maybe a four year degree accounts for 30%, a two-year masters adds another 30%. (OK, I'm just making up numbers, but you can see the point).
Regression against a different model might give a different r.
© 2001-2009, James Randi Educational Foundation. All Rights Reserved.
vBulletin® v3.7.5, Copyright ©2000-2010, Jelsoft Enterprises Ltd.