IPPCR 2019 Overview of Hypothesis Testing Part 3 of 5

IPPCR 2019 Overview of Hypothesis Testing Part 3 of 5


>>Paul Wakim: Hello.
I’m Paul Wakim. I’m chief of the biostatistics and clinical
epidemiology service at the NIH clinical center, and this is part three
of hypothesis testing. And in this part, we’re going to
talk something really exciting that is happening
really right now, spring 2019. And you’ll see
what I’m talking about. It’s about the ASA
and the reporting of p-values. The ASA stands for the American
Statistical Association. But let me start with
the reporting of the p-value. And let me start with
how not to report p-values. This is the old way, and we really need
to change this practice. We, the scientific community. And what I’m talking about
is when, you know,
you publish your manuscript and you put a table and you
have stars and then a footnote. And it says p-value
of less than .05. We need to stop doing that.
The [laughs] irony of this is that
I’m a co-author of this paper. This was back in 2009,
and I basically — I remember even back then,
I didn’t like that. I didn’t like the star,
p-value less than .05. And I told the lead author
about it. That — you know,
can’t we put the actual value of the p-values.
And basically, the answer was, “Well this is the format
of the journal. This is the way
the journal wants it. And so, that’s why we had
to do it this way.” But it’s changing.
This is going to change. This is, of course,
10 years ago. So, it is changing. But just want to make sure that
this is not the way to do it. Instead, how to report it is you
just put the value of the p-value,
like in this table. They actually, the authors,
give the value of the p-value. And you say, okay, fine. Well, it is important
because look at those two p-values
that are highlighted. One is .062, and one is .042.
Same table. Two p-values. Now, if we were to put stars, you know, the first one
would not get a star, because it is greater than .05. And the bottom would get a star
because it’s less than .05. Well, let me make things
very, very clear, .062 is not very different from .042.
It’s really not different. So, to have one with a star
and one without a star gives the impression
to the reader that, oh, yeah, one is
statistically significant, and the other one is not. When, in fact, the two
are pretty much the same. So, again, put the value
of the p-value. And we’re going to talk,
believe me, we’re going to talk more
about his in this segment. So, that’s what I want
to talk about, the American
Statistical Association. And just give you a background about the American
Statistical Association. According to its website, it was founded
in Boston in 1839. So, it’s 180 years old. It is the world’s largest
community of statisticians. It is the second oldest
continuously operating professional association
in the country, and it has members
from all over the world. So, it’s not just
American statisticians. So, the point here is that
it is a very respectable and been around for a while. It’s a very respectable
association. And so, what about the American
Statistical Association? Well, in March 2016,
the ASA issued a statement on p-values
and statistical significance. And it says, this is the first
the ASA spoken so publicly about a fundamental part of
statistical theory and practice. The first time. And it’s a 180 years
old association. So, it was a big deal.
It was a big deal back in 2016. And I’m going to talk about
what this is about. But I’m just giving you sort of,
first the history, and then we’re going to
go into more detail. Then in October 2017,
there was an ASA symposium on statistical inference.
And the title was, “Scientific Method
for the 21st Century: A World Beyond p less than .05.” This was a two-day conference
talking only about p-values. I attended that conference
because it was here in Bethesda. And so, it was really
a very good conference and very eye-opening
information. And then, in March 2019 —
so we’re just talking about, what, two months ago,
well three months ago — there was a special issue
of the American Statistician. And the title of
that special issue was, “Statistical Inference
in the 21st Century: A World Beyond p less than .05.” This special issue started
with a 19-page editorial paper, 19 pages, editorial paper, and then followed
by 43 additional papers just on the p-value. And in it, the ASA says
this is a significant day in the history — or the announcement
of the special issue, it says, “This is a significant day
in the history of the ASA and statistics.” So, again, it’s a very
exciting time. And it’s a big deal,
and we’re just starting. The transition period
is just starting. And I’m going to talk more
about all this. So, let’s start
with the statement. So, this is the 2016 statement.
And it had basically — just to give you a summary
of what it was about, it gave six principles.
And it started by saying, “Let’s be clear, nothing
in the ASA statement is new.” So, this is an accumulation
of decades of things that were misinterpreted
and published with the wrong interpretation.
And the ASA said, “Okay. Let’s, once and for all,
let’s make things clear.” So, nothing new
in that statement, but just sort of pushing it as,
you know, scientific community be aware of all these things.
And the six principles were: p-values can indicate
how incompatible the data are with
a specified statistical model. Two, p-values do not measure
the probability that the studied hypothesis
is true or the probability
that the data were produced by random chance alone.
This is what we talked earlier. Three, scientific conclusions
or business or policy decisions should not be based only on whether a p-value
passes a specific threshold. Four, proper inference requires
full reporting and transparency. Five, a p-value
or statistical significance does not measure the size
of an effect or the importance of a result. You probably may have heard,
you know, statistical significance
versus clinical significance or clinical importance,
different concepts. Not to be mixed. And six, by itself,
a p-value does not provide a good measure of evidence
regarding a model or hypothesis. And it says, “No single index
should substitute for scientific reasoning.” Just one number,
we can’t have just one number. And then let’s say,
that’s the conclusion, just based on one number. Have to have also
scientific reasoning behind it. And we’re going to talk more
about this in this segment. Now, that was the statement
in 2016. Let’s talk about
the special issue in March 2019. It says, “Moving to a world
beyond p less than .05.” So, what is this about? Well, the special issue —
so again, sort of a summary — it says, “Don’t base
your conclusions solely on whether an association
or effect was found to be statistically
significant, i.e.,
p-value passed some arbitrary threshold such
as p less than .05.” Don’t do that. Don’t believe that
an association or effect exists just because it was
statistically significant. Don’t believe that an
association or effect is absent just because it was not
statistically significant. Don’t believe that your p-value gives the probability
that chance alone produced the observed
association or effect or the probability that
your test hypothesis is true. A lot of “don’ts.”
Don’t conclude anything about scientific
or practical importance based on statistical
significance or lack thereof. And it says,
in that special issue, “We know it’s
a lot of ‘don’ts.'” This time we’re not just going
to give you the editorial, the authors of the editorial,
they’re saying, “Okay, this time, we’re not just
going to give you ‘don’ts,’ but we also going to
give you some dos.” And they do. They do in those
43 additional papers. So, what’s the bottom line
of this special issue. The bottom line is the term
statistically significant, statistically different,
p less than .05, non-significant, there is an effect,
there is no effect, and any similar expressions
should not be used at all, whether expressed in words,
by asterisk in a table, or any other way. Very categorical about it.
Just don’t do it. P-values can be used. Now, that’s what I particularly
like about the special issue and the editorial,
is that they’re not saying, “Let’s completely
ban p-values.” No, they’re saying,
“They can be used, but when they are used, they should be reported
as a continuous quantity.” So, give the value
of the p-value, not a threshold, not yes
no based on the threshold. They should not be dichotomized.
It’s the same thing. It should not have
a threshold and say, yes, less, or greater than.
Give the value. And think of p-values
as measuring compatibility between hypotheses and data
and interpret internal estimates as compatibility intervals. You know, again,
it’s a transition. It’s easy to say,
but it’s really hard. Because we have to change
the whole thinking and, you know, even though
I’m aware of this special issue. I read the editorial. And I’m very much supportive
of it, very much so. It is hard. It’s a hard thing.
It’s going to take time. And it’s going to take effort.
And we have to get used it, but we have to move
in that direction. I forgot to say that also,
with this editorial in March, there was also
a special paper in Nature also in March 2019,
in Nature Magazine. And there was
a National Public Radio piece also on March 20th,
2019 about this editorial. So, you may think, oh well, I’m kind of maybe exaggerating
what the ASA is saying. You know, maybe they’re not
that strong about it. Maybe they’re just saying,
“Well we recommend,” or “We would prefer.” Well, just to be clear,
let me show you this. This is in the editorial,
section two. The title says, the title, “Don’t say
statistically significant.” I mean you can’t be
more direct as this. It says the ASA statement
on p-values and statistical significance, that’s the 2016, stopped short,
just short of recommending that declarations of statistical
significance be abandoned. We just didn’t
do that in 2016. But guess what,
we are doing it now. Just don’t say
statistically significant. Let’s stop that. So, they summarize —
in the editorial, there’s kind of a summary
of all of this and moving to a world
beyond p less than .05. And they’re all good
points here. And I think it’s important
for the scientific community to train ourselves
to think that way. Accept uncertainty.
You know, you do an experiment. You find a result.
Well, all this is uncertain. It’s not definitive.
It still has uncertainty. Just because the data
shows something, it doesn’t mean that’s it. In one shot,
we discovered the truth. Just because we have
a statistical model, doesn’t mean that we got
rid of all uncertainty. Things are still uncertain.
Accept uncertainty. Be thoughtful.
Just don’t rely on one number. You look at one number.
That’s it. We’re done. That’s the conclusion,
based on one number. No, what does it mean? Does that make sense
scientifically? What’s the uncertainty
around it? Just to be a little bit
more thinking than just reacting
to just one number. I mean, I like this statement. It says, “A declaration
of statistical significance is the antithesis
of thoughtfulness.” And that’s basically
what it’s about. Be open.
Basically, it says, you know, just give, register
your clinical trial. Put your raw data,
unidentified out there. Put your code of analysis
out there. Let people try to reproduce
what you did. Sort of, let’s be open.
And let’s share. And we’re all going
to benefit from it. But just more transparency. What did you do?
Was this prespecified? So, be open.
Be modest [laughs]. Again, just accept that what we
find with one experiment is not, “that’s it.” We found the cure.
We found this. No. There’s still uncertainty.
We still have to confirm it. We still have to reproduce it.
So, modesty. And institutional
change is needed. And that’s basically huge. Because basically what it says is journals
have to change their thinking, and investigators have
to change their thinking, statisticians have
to change their thinking. Education, the way we teach
statistics, have to change. Universities incentives
to publish what’s significant
has to change. Lots of things
have to change. Again, it’s not
going to be easy. It’s not going to be quick, but we really have to move
in that sense. And like I said, you know, I’m still reviewing manuscripts
where I am a co-author. And I said, “Okay, I’m going
to start a new thinking. I’m going to get
rid of anything that says
statistically significant.” Well, it’s easier
said than done. And then I say,
“Okay, what do we say instead?” And it’s not obvious. So, it is going
to take some effort. And the ASA puts these
as an acronym ATOMIC for: accept uncertainty,
thoughtful, open, modest,
and institutional change. So, what is the connection
between p-value and sample size? I want to talk about that. So, you have
a randomized control trial. You’re comparing
two treatments, the folic acid plus vitamin
B6 versus placebo. Your primary outcome measure
is systolic blood pressure. And you get the results from 130
clinically healthy individuals. And the difference between
the two treatment arms, between the two treatments, the reduction
in systolic blood pressure is 3.7 milliliters of mercury.
Okay. And the confidence interval
is from .6 to 6.8 millimeters. And the two-sided p-value
is .02. All right. And that’s published in
that article in that reference. So, what I’m going to do is
I’m going to show you something. So, we’re going to talk
about the distribution of the difference
between the two treatment groups when in fact
there is no difference. So, the difference is zero.
It’s centered at zero. This is the distribution
of the difference. So, we said the observed
difference is 3.7 millimeter and the standard deviation — so the variability
of the data — is 17.96 millimeters of mercury.
That’s the result. These are the results
from the clinical trial. And we also said
that the sample size was 130 and this was
the confidence interval. And the two-sided p-value
is .02. Again, two-sided p-value means the probability
of getting a result, what we did, 3.7 or more extreme
on both sides as the one we got under
the assumption of no difference, that’s .02. Okay, so why am I
doing all this? Well, what I’m going to do next
is I’m going to keep pretty much everything the same,
except for the sample size. We’re trying to answer
what’s the connection between p-value and sample size. So, the observed difference
is the same. The standard deviation
is the same. But now my sample size is 40. Now, you may say, you may think,
now, wait a second, if the standard deviation
is the same, how come this is flatter than
the other distribution with 130? Because this distribution
is based on standard error. Whereas, I’m keeping
the standard deviation constant. The standard error
is the standard deviation divided by square root of n. So, as n goes down,
the standard error goes up. In that case, the confidence
interval goes from -1.9 to 9.3, and the two-sided p-value
is .21. All I’ve changed
is the sample size. I didn’t change the results,
the observed difference and the spread of the data,
the variability of the data. I didn’t change that. All I changed
is the sample size. Different results. Now, let’s go the other way. Let’s say the sample size
is 250, instead of 130. The distribution
of the difference is much more narrower. The confidence interval
goes from 1.5 to 5.9. And the p-value is .0013.
Again, the data has not changed. Same mean difference.
Same standard deviation. Same spread.
Just the sample size. So, what is the connection
between p-value and sample size? If you keep everything the same, but you increase
the sample size, the p-value’s going to go down. So, there are two ways to get
small p-values, guaranteed. Sounding like a commercial. You know, if you want
a small p-value, I guarantee you, there are two ways
you can get a small p-value. You just increase
your sample size. Analyze a very large sample; you’re going to get
a small p-value. Guaranteed. The second way,
I’ll tell you later. But that’s one way. So, p-value and power. By definition, and this is
another misinterpretation, another thing that I, you know, it is quite often
is done incorrectly. By definition,
we cannot calculate p-value at the design stage,
by definition. Why by definition? Because p-value is the
probability of getting results as extreme or more extreme
than the one we got. So, we can’t do that
at the design stage, since we don’t have results. So, the p-value’s meaningful
only after results are known. By the same token, power is meaningful
only before results are known. And post-hoc power
is meaningless. Calculating power
after the experiment, based on the data,
is meaningless. What is the connection
between alpha and p-value? If the p-value
is less than alpha, typically .05, the null hypothesis
of no difference is rejected. And the result is declared
statistically significant at the 5 percent alpha level. I know what you’re thinking. If the p-value
is greater than alpha, the result is not
statistically significant at the 5 percent alpha level. I’m saying
statistically significant. That’s the old thinking. Now, it’s correct.
These statements are correct. They’ve been used the way
they’ve been used. So, it doesn’t make them wrong. Or, you know, we’ve been doing
the wrong thing all this time. But this is,
we’re trying to move away from this kind of thinking.
The problem is that sample size and minimum clinically important
difference are ignored when we talk about
just p-value. So, the new thinking is not
to reject the null hypothesis solely on the p-value,
not to do that. And do not dichotomize
your conclusion, statistically significant,
not statistically significant. There is an effect;
there is no effect. Don’t do that anymore. So, here’s a paper. This is
from the same special issue. And it’s a very,
very good paper. It basically says use confidence
interval along with p-values. We’re not banning p-values.
We’re saying, along with p-values,
use confidence interval. But even more than that —
so, let’s say we talk about the example
of systolic blood pressure and, you know, reduction
is going to the right. So, the right is a good thing.
No difference is zero. And we say, “Okay,
I found the confidence interval where the p-value is .09. And here’s the confidence
interval, covers zero. So, not statistically
significant.” And the author of this paper
is saying, you know,
just don’t do just that. Don’t rely just on the p-value.
Look at something else. Because you know,
you’re ignoring sample size. What if the sample size
was bigger? Same difference,
same difference, but the sample size is bigger. The confidence interval
is going to be narrower. The p-value is going to go down.
Now, you have a p-value of .01. All you’ve changed
is the sample size. So, now, it’s statistically
significant all of a sudden. Well, what if I told you that the minimum
clinically important difference is four millimeters of mercury
for the systolic blood pressure? So, in any other words,
anything, any reduction less than 4 really
is not that big of deal. Well, in this case,
that bottom confidence interval is that really
an interesting result? And the author claims,
maybe not. Even though p is .01. So, that goes back
to the thoughtfulness. Think more than
just one number. Now, again,
just because of sample size. But let’s say I increase
the sample size, but now I find
a confidence interval that’s completely above
the minimum clinically important difference. The p-value is .001, but I’m not
just relying on the .001. I’m also looking at
the confidence interval and comparing it to the minimum
clinically important difference. And now, I can be confident that the reduction
in systolic blood pressure is indeed more
than four millimeters. Or at least, again, we go back
to accept uncertainty. It looks like, it seems like
we can be pretty confident, or the data is compatible
with the hypothesis that it is, the reduction is more
than four millimeters. See, it’s very easy to fall
in the trap of the old thinking. Here are the references
for this segment. So, in summary,
do not dichotomize p-values or statements
related to p-values. When reporting results
on differences, use p-values, the continuous
value, point estimates — that’s the dot in
the confidence interval — confidence intervals
in the lower and upper limit, along with the minimum
clinically important difference. And that’s where
the thoughtfulness comes in. What is the minimum
clinically important difference? It’s not a statistical question.
It is a clinical question. And so, my questions to you are what is the connection
between p-value and alpha; and what is the connection
between p-value and sample size? Thank you for watching.

Leave a Reply

Your email address will not be published. Required fields are marked *