>>Paul Wakim: Hello.

I’m Paul Wakim. I’m chief of the biostatistics and clinical

epidemiology service at the NIH clinical center, and this is part three

of hypothesis testing. And in this part, we’re going to

talk something really exciting that is happening

really right now, spring 2019. And you’ll see

what I’m talking about. It’s about the ASA

and the reporting of p-values. The ASA stands for the American

Statistical Association. But let me start with

the reporting of the p-value. And let me start with

how not to report p-values. This is the old way, and we really need

to change this practice. We, the scientific community. And what I’m talking about

is when, you know,

you publish your manuscript and you put a table and you

have stars and then a footnote. And it says p-value

of less than .05. We need to stop doing that.

The [laughs] irony of this is that

I’m a co-author of this paper. This was back in 2009,

and I basically — I remember even back then,

I didn’t like that. I didn’t like the star,

p-value less than .05. And I told the lead author

about it. That — you know,

can’t we put the actual value of the p-values.

And basically, the answer was, “Well this is the format

of the journal. This is the way

the journal wants it. And so, that’s why we had

to do it this way.” But it’s changing.

This is going to change. This is, of course,

10 years ago. So, it is changing. But just want to make sure that

this is not the way to do it. Instead, how to report it is you

just put the value of the p-value,

like in this table. They actually, the authors,

give the value of the p-value. And you say, okay, fine. Well, it is important

because look at those two p-values

that are highlighted. One is .062, and one is .042.

Same table. Two p-values. Now, if we were to put stars, you know, the first one

would not get a star, because it is greater than .05. And the bottom would get a star

because it’s less than .05. Well, let me make things

very, very clear, .062 is not very different from .042.

It’s really not different. So, to have one with a star

and one without a star gives the impression

to the reader that, oh, yeah, one is

statistically significant, and the other one is not. When, in fact, the two

are pretty much the same. So, again, put the value

of the p-value. And we’re going to talk,

believe me, we’re going to talk more

about his in this segment. So, that’s what I want

to talk about, the American

Statistical Association. And just give you a background about the American

Statistical Association. According to its website, it was founded

in Boston in 1839. So, it’s 180 years old. It is the world’s largest

community of statisticians. It is the second oldest

continuously operating professional association

in the country, and it has members

from all over the world. So, it’s not just

American statisticians. So, the point here is that

it is a very respectable and been around for a while. It’s a very respectable

association. And so, what about the American

Statistical Association? Well, in March 2016,

the ASA issued a statement on p-values

and statistical significance. And it says, this is the first

the ASA spoken so publicly about a fundamental part of

statistical theory and practice. The first time. And it’s a 180 years

old association. So, it was a big deal.

It was a big deal back in 2016. And I’m going to talk about

what this is about. But I’m just giving you sort of,

first the history, and then we’re going to

go into more detail. Then in October 2017,

there was an ASA symposium on statistical inference.

And the title was, “Scientific Method

for the 21st Century: A World Beyond p less than .05.” This was a two-day conference

talking only about p-values. I attended that conference

because it was here in Bethesda. And so, it was really

a very good conference and very eye-opening

information. And then, in March 2019 —

so we’re just talking about, what, two months ago,

well three months ago — there was a special issue

of the American Statistician. And the title of

that special issue was, “Statistical Inference

in the 21st Century: A World Beyond p less than .05.” This special issue started

with a 19-page editorial paper, 19 pages, editorial paper, and then followed

by 43 additional papers just on the p-value. And in it, the ASA says

this is a significant day in the history — or the announcement

of the special issue, it says, “This is a significant day

in the history of the ASA and statistics.” So, again, it’s a very

exciting time. And it’s a big deal,

and we’re just starting. The transition period

is just starting. And I’m going to talk more

about all this. So, let’s start

with the statement. So, this is the 2016 statement.

And it had basically — just to give you a summary

of what it was about, it gave six principles.

And it started by saying, “Let’s be clear, nothing

in the ASA statement is new.” So, this is an accumulation

of decades of things that were misinterpreted

and published with the wrong interpretation.

And the ASA said, “Okay. Let’s, once and for all,

let’s make things clear.” So, nothing new

in that statement, but just sort of pushing it as,

you know, scientific community be aware of all these things.

And the six principles were: p-values can indicate

how incompatible the data are with

a specified statistical model. Two, p-values do not measure

the probability that the studied hypothesis

is true or the probability

that the data were produced by random chance alone.

This is what we talked earlier. Three, scientific conclusions

or business or policy decisions should not be based only on whether a p-value

passes a specific threshold. Four, proper inference requires

full reporting and transparency. Five, a p-value

or statistical significance does not measure the size

of an effect or the importance of a result. You probably may have heard,

you know, statistical significance

versus clinical significance or clinical importance,

different concepts. Not to be mixed. And six, by itself,

a p-value does not provide a good measure of evidence

regarding a model or hypothesis. And it says, “No single index

should substitute for scientific reasoning.” Just one number,

we can’t have just one number. And then let’s say,

that’s the conclusion, just based on one number. Have to have also

scientific reasoning behind it. And we’re going to talk more

about this in this segment. Now, that was the statement

in 2016. Let’s talk about

the special issue in March 2019. It says, “Moving to a world

beyond p less than .05.” So, what is this about? Well, the special issue —

so again, sort of a summary — it says, “Don’t base

your conclusions solely on whether an association

or effect was found to be statistically

significant, i.e.,

p-value passed some arbitrary threshold such

as p less than .05.” Don’t do that. Don’t believe that

an association or effect exists just because it was

statistically significant. Don’t believe that an

association or effect is absent just because it was not

statistically significant. Don’t believe that your p-value gives the probability

that chance alone produced the observed

association or effect or the probability that

your test hypothesis is true. A lot of “don’ts.”

Don’t conclude anything about scientific

or practical importance based on statistical

significance or lack thereof. And it says,

in that special issue, “We know it’s

a lot of ‘don’ts.'” This time we’re not just going

to give you the editorial, the authors of the editorial,

they’re saying, “Okay, this time, we’re not just

going to give you ‘don’ts,’ but we also going to

give you some dos.” And they do. They do in those

43 additional papers. So, what’s the bottom line

of this special issue. The bottom line is the term

statistically significant, statistically different,

p less than .05, non-significant, there is an effect,

there is no effect, and any similar expressions

should not be used at all, whether expressed in words,

by asterisk in a table, or any other way. Very categorical about it.

Just don’t do it. P-values can be used. Now, that’s what I particularly

like about the special issue and the editorial,

is that they’re not saying, “Let’s completely

ban p-values.” No, they’re saying,

“They can be used, but when they are used, they should be reported

as a continuous quantity.” So, give the value

of the p-value, not a threshold, not yes

no based on the threshold. They should not be dichotomized.

It’s the same thing. It should not have

a threshold and say, yes, less, or greater than.

Give the value. And think of p-values

as measuring compatibility between hypotheses and data

and interpret internal estimates as compatibility intervals. You know, again,

it’s a transition. It’s easy to say,

but it’s really hard. Because we have to change

the whole thinking and, you know, even though

I’m aware of this special issue. I read the editorial. And I’m very much supportive

of it, very much so. It is hard. It’s a hard thing.

It’s going to take time. And it’s going to take effort.

And we have to get used it, but we have to move

in that direction. I forgot to say that also,

with this editorial in March, there was also

a special paper in Nature also in March 2019,

in Nature Magazine. And there was

a National Public Radio piece also on March 20th,

2019 about this editorial. So, you may think, oh well, I’m kind of maybe exaggerating

what the ASA is saying. You know, maybe they’re not

that strong about it. Maybe they’re just saying,

“Well we recommend,” or “We would prefer.” Well, just to be clear,

let me show you this. This is in the editorial,

section two. The title says, the title, “Don’t say

statistically significant.” I mean you can’t be

more direct as this. It says the ASA statement

on p-values and statistical significance, that’s the 2016, stopped short,

just short of recommending that declarations of statistical

significance be abandoned. We just didn’t

do that in 2016. But guess what,

we are doing it now. Just don’t say

statistically significant. Let’s stop that. So, they summarize —

in the editorial, there’s kind of a summary

of all of this and moving to a world

beyond p less than .05. And they’re all good

points here. And I think it’s important

for the scientific community to train ourselves

to think that way. Accept uncertainty.

You know, you do an experiment. You find a result.

Well, all this is uncertain. It’s not definitive.

It still has uncertainty. Just because the data

shows something, it doesn’t mean that’s it. In one shot,

we discovered the truth. Just because we have

a statistical model, doesn’t mean that we got

rid of all uncertainty. Things are still uncertain.

Accept uncertainty. Be thoughtful.

Just don’t rely on one number. You look at one number.

That’s it. We’re done. That’s the conclusion,

based on one number. No, what does it mean? Does that make sense

scientifically? What’s the uncertainty

around it? Just to be a little bit

more thinking than just reacting

to just one number. I mean, I like this statement. It says, “A declaration

of statistical significance is the antithesis

of thoughtfulness.” And that’s basically

what it’s about. Be open.

Basically, it says, you know, just give, register

your clinical trial. Put your raw data,

unidentified out there. Put your code of analysis

out there. Let people try to reproduce

what you did. Sort of, let’s be open.

And let’s share. And we’re all going

to benefit from it. But just more transparency. What did you do?

Was this prespecified? So, be open.

Be modest [laughs]. Again, just accept that what we

find with one experiment is not, “that’s it.” We found the cure.

We found this. No. There’s still uncertainty.

We still have to confirm it. We still have to reproduce it.

So, modesty. And institutional

change is needed. And that’s basically huge. Because basically what it says is journals

have to change their thinking, and investigators have

to change their thinking, statisticians have

to change their thinking. Education, the way we teach

statistics, have to change. Universities incentives

to publish what’s significant

has to change. Lots of things

have to change. Again, it’s not

going to be easy. It’s not going to be quick, but we really have to move

in that sense. And like I said, you know, I’m still reviewing manuscripts

where I am a co-author. And I said, “Okay, I’m going

to start a new thinking. I’m going to get

rid of anything that says

statistically significant.” Well, it’s easier

said than done. And then I say,

“Okay, what do we say instead?” And it’s not obvious. So, it is going

to take some effort. And the ASA puts these

as an acronym ATOMIC for: accept uncertainty,

thoughtful, open, modest,

and institutional change. So, what is the connection

between p-value and sample size? I want to talk about that. So, you have

a randomized control trial. You’re comparing

two treatments, the folic acid plus vitamin

B6 versus placebo. Your primary outcome measure

is systolic blood pressure. And you get the results from 130

clinically healthy individuals. And the difference between

the two treatment arms, between the two treatments, the reduction

in systolic blood pressure is 3.7 milliliters of mercury.

Okay. And the confidence interval

is from .6 to 6.8 millimeters. And the two-sided p-value

is .02. All right. And that’s published in

that article in that reference. So, what I’m going to do is

I’m going to show you something. So, we’re going to talk

about the distribution of the difference

between the two treatment groups when in fact

there is no difference. So, the difference is zero.

It’s centered at zero. This is the distribution

of the difference. So, we said the observed

difference is 3.7 millimeter and the standard deviation — so the variability

of the data — is 17.96 millimeters of mercury.

That’s the result. These are the results

from the clinical trial. And we also said

that the sample size was 130 and this was

the confidence interval. And the two-sided p-value

is .02. Again, two-sided p-value means the probability

of getting a result, what we did, 3.7 or more extreme

on both sides as the one we got under

the assumption of no difference, that’s .02. Okay, so why am I

doing all this? Well, what I’m going to do next

is I’m going to keep pretty much everything the same,

except for the sample size. We’re trying to answer

what’s the connection between p-value and sample size. So, the observed difference

is the same. The standard deviation

is the same. But now my sample size is 40. Now, you may say, you may think,

now, wait a second, if the standard deviation

is the same, how come this is flatter than

the other distribution with 130? Because this distribution

is based on standard error. Whereas, I’m keeping

the standard deviation constant. The standard error

is the standard deviation divided by square root of n. So, as n goes down,

the standard error goes up. In that case, the confidence

interval goes from -1.9 to 9.3, and the two-sided p-value

is .21. All I’ve changed

is the sample size. I didn’t change the results,

the observed difference and the spread of the data,

the variability of the data. I didn’t change that. All I changed

is the sample size. Different results. Now, let’s go the other way. Let’s say the sample size

is 250, instead of 130. The distribution

of the difference is much more narrower. The confidence interval

goes from 1.5 to 5.9. And the p-value is .0013.

Again, the data has not changed. Same mean difference.

Same standard deviation. Same spread.

Just the sample size. So, what is the connection

between p-value and sample size? If you keep everything the same, but you increase

the sample size, the p-value’s going to go down. So, there are two ways to get

small p-values, guaranteed. Sounding like a commercial. You know, if you want

a small p-value, I guarantee you, there are two ways

you can get a small p-value. You just increase

your sample size. Analyze a very large sample; you’re going to get

a small p-value. Guaranteed. The second way,

I’ll tell you later. But that’s one way. So, p-value and power. By definition, and this is

another misinterpretation, another thing that I, you know, it is quite often

is done incorrectly. By definition,

we cannot calculate p-value at the design stage,

by definition. Why by definition? Because p-value is the

probability of getting results as extreme or more extreme

than the one we got. So, we can’t do that

at the design stage, since we don’t have results. So, the p-value’s meaningful

only after results are known. By the same token, power is meaningful

only before results are known. And post-hoc power

is meaningless. Calculating power

after the experiment, based on the data,

is meaningless. What is the connection

between alpha and p-value? If the p-value

is less than alpha, typically .05, the null hypothesis

of no difference is rejected. And the result is declared

statistically significant at the 5 percent alpha level. I know what you’re thinking. If the p-value

is greater than alpha, the result is not

statistically significant at the 5 percent alpha level. I’m saying

statistically significant. That’s the old thinking. Now, it’s correct.

These statements are correct. They’ve been used the way

they’ve been used. So, it doesn’t make them wrong. Or, you know, we’ve been doing

the wrong thing all this time. But this is,

we’re trying to move away from this kind of thinking.

The problem is that sample size and minimum clinically important

difference are ignored when we talk about

just p-value. So, the new thinking is not

to reject the null hypothesis solely on the p-value,

not to do that. And do not dichotomize

your conclusion, statistically significant,

not statistically significant. There is an effect;

there is no effect. Don’t do that anymore. So, here’s a paper. This is

from the same special issue. And it’s a very,

very good paper. It basically says use confidence

interval along with p-values. We’re not banning p-values.

We’re saying, along with p-values,

use confidence interval. But even more than that —

so, let’s say we talk about the example

of systolic blood pressure and, you know, reduction

is going to the right. So, the right is a good thing.

No difference is zero. And we say, “Okay,

I found the confidence interval where the p-value is .09. And here’s the confidence

interval, covers zero. So, not statistically

significant.” And the author of this paper

is saying, you know,

just don’t do just that. Don’t rely just on the p-value.

Look at something else. Because you know,

you’re ignoring sample size. What if the sample size

was bigger? Same difference,

same difference, but the sample size is bigger. The confidence interval

is going to be narrower. The p-value is going to go down.

Now, you have a p-value of .01. All you’ve changed

is the sample size. So, now, it’s statistically

significant all of a sudden. Well, what if I told you that the minimum

clinically important difference is four millimeters of mercury

for the systolic blood pressure? So, in any other words,

anything, any reduction less than 4 really

is not that big of deal. Well, in this case,

that bottom confidence interval is that really

an interesting result? And the author claims,

maybe not. Even though p is .01. So, that goes back

to the thoughtfulness. Think more than

just one number. Now, again,

just because of sample size. But let’s say I increase

the sample size, but now I find

a confidence interval that’s completely above

the minimum clinically important difference. The p-value is .001, but I’m not

just relying on the .001. I’m also looking at

the confidence interval and comparing it to the minimum

clinically important difference. And now, I can be confident that the reduction

in systolic blood pressure is indeed more

than four millimeters. Or at least, again, we go back

to accept uncertainty. It looks like, it seems like

we can be pretty confident, or the data is compatible

with the hypothesis that it is, the reduction is more

than four millimeters. See, it’s very easy to fall

in the trap of the old thinking. Here are the references

for this segment. So, in summary,

do not dichotomize p-values or statements

related to p-values. When reporting results

on differences, use p-values, the continuous

value, point estimates — that’s the dot in

the confidence interval — confidence intervals

in the lower and upper limit, along with the minimum

clinically important difference. And that’s where

the thoughtfulness comes in. What is the minimum

clinically important difference? It’s not a statistical question.

It is a clinical question. And so, my questions to you are what is the connection

between p-value and alpha; and what is the connection

between p-value and sample size? Thank you for watching.