Innovative Clinical Trial Designs in NHLBI Related Research Areas – Meeting at ASH

Innovative Clinical Trial Designs in NHLBI Related Research Areas – Meeting at ASH


All right. Thank you– ooh, I am on. Thank you all for coming. A few housekeeping items. This is being
recorded in the back, so you’re all aware of that. If you’re interested
in getting slides, they are available
on the ICTR website. So you can log into that. If you set up an
account, you will get emailed on future things
that are put onto that website. In terms of the internet
access here, you should be using
the Horton Grand, and the password is 311ISD. 311ISD, which should give
you access to the web. If you need to go
to the bathroom, those are right
out– so basically go that way, which involves
a quick jog to the left and then a quick
turn to the right, but they’re basically
right behind that wall in a little hallway. Let’s see. Attendance is enough
that I can kind of know– how many of you are clinicians? Clinicians, one,
two– statisticians? A couple of those. OK, so a mix. That’s all good. I just wanted to kind of get an
idea of what you’re talking to. What we’re going
to do today, we’re going to go through
a number of topics. Some of the structure is going
to look like the webinar that’s been posted online,
but we’re going to add a lot more meat
to it as we go forward. So a quick description,
this is also part of the webinar
that was online. Kind of talking about what
a gold standard RCT is. So keep in mind, typically this
is a two-arm trial treatment versus control. It has basic elements
of randomization. And all of this is designed
and has been done for about 60, 70 years or so. When we’re talking
about innovative trials, by necessity, we’re
talking about breaking a few of those components
or doing them differently. And so what we’re
going to do today is we’re going to talk about
a number of innovations, going to talk about the
motivation for each one. As opposed to the
first webinar, we’re going to give examples and some
main ideas of each one of them. So we’re going to start to
quantify some of what they do and why. And certainly, all of the
trade-offs involved in this. So if you pick an
innovation, sometimes there is, quote, a “free lunch.” Other times you
have to make sure that you account for the
sample size might be increased, the power might be
decreased, and you have to weigh the
trade-offs involved in that. So keep in mind,
modern RCTs go back to the streptomycin
trial in 1946. There’s a publication
that came out in 1948 describing that trial. Really kind of the last
innovation as randomization. Prior to this
people, either used alternating assignment or pretty
much assigned things at will. And at some level,
this has been rather– it’s been stuck at this point. This 1948 trial is run
over and over again, and in lots of ways,
we have advanced. And so we want to talk about it
the ways that we’ve advanced. Certainly a standard
RCT does well. So we want to make sure the
deviations are for a reason. So again, we’re
going to talk about what are the features in
this gold standard, what are the negative features,
what are the trade-offs? The main components
here, we have a lot of components that
are designed to avoid bias. These include randomization. Randomization is
kind of a catch-all. It avoids biases on essentially
anything you don’t know. We also routinely stratify. That’s intended to
directly equalize groups on biases that we
may know in advance. We also employ blinding. And often we have a
fixed sample size, so we avoid the investigator
being able to kind of look at the data until they
get the answer they want. So all of these
things, avoid biases. The rest of this is
aimed at obtaining interpretable results. So we want to make sure we
often do a two-arm trial. That’s the sense that if we see
a difference between groups, we’ll directly attribute
it to the two groups. We randomized, we
saw a difference, the only thing they differed
on was the two arms, this caused the difference. We often try to look at
a homogeneous population, again, trying to
avoid the uncertainty. We use lots of validated
standard clinical endpoints. All of this is really intended
to make things as simple as possible in the trial so
that we can get clean answers. The trouble is that they’re
very expensive answers. So we often have to do 100
population, that’s effectively all you can do. It also asks us to look
at really narrowly-focused questions. So if I say you only
have two-armed– [INAUDIBLE] that’s
effectively all you can do. It also asks us to look
at really narrowly-focused questions. So if I say you only
have two-armed– [INAUDIBLE] what’s
the population? Oh, I’m going to focus on this
group of people, when perhaps I really want to know well
maybe it works in everybody, but maybe it only
works in this subset. What’s the population? Oh, I’m going to focus on
this group of people, when perhaps I really want
to know, well maybe it works in everybody, but maybe
it only works in the subs– And so these are
conducted at– you have one or more
interim analysis. I’ll give an
example in a moment. And at each of
these interims, we can we can stop, make a
decision, we might continue, look at the data again, we might
continue, so on and so forth. As I said, these have titles. Utility stopping,
group sequential, sample size re-estimation. A Goldilocks design as
an alternative here, which is intended to account
for incomplete information at each of the interims. All right, so an
example of this. As I said, these
have several titles, at you have one or more
interim analysis I’ll give an example in a moment
and at each of these interim Ms we can stop make a decision we
might continue look at the data again we might continue
so on and so forth. As I said, these
have several titles. Utility stopping,
group sequential, sample size re-estimation. A Goldilocks design as
an alternative here, which is intended to account
for incomplete information at each of the interims. All right. So an example of this. Suppose that I’m testing
a dichotomous response. So everybody is a yes or no. And suppose that my control I
anticipate a 30% response rate. And in my treatment,
I’m expecting 50%. This is my hope. So going in, this is
I’m powering the trial for a 30% versus 50% effect. A standard trial here might
have about 100 patients per arm. Gets us 83% power. So a standard way of
conducting this is I would enroll 200
patients, 100 in each arm, I would do at the end of it a
standard kind of normal test. If my p-value is less than 0.025
one-sided, then I’d say, hey, the treatment is efficacious. OK, well that
requires 200 subjects. I’m going to try to
convince you in a moment that what you’ve just bought is
a horribly expensive insurance policy against bad luck, and
you typically don’t need it. So let me ask a question. I’m going to take a poll here. What if we actually observe? So this isn’t what if
the truth of the drug. This is what if I actually
get in my trial 30 out of 100 responders on control and I
get 50 out of 100 on treatment? So I powered it for that effect. Suppose it actually
happens, suppose I actually get that effect. What’s the p-value? More, equal, or less than 0.025? So I powered it to get the
significance at this level. How many people think it’s more? How many people
think it’s equal? How many people think it’s less? How many refuse to answer? Fair enough. All right, it’s
actually really small. It’s about 0.0016. And that’s for designing
a trial at an 80% power. If you design a
trial at 90% power and you get the observed effect,
the p-value is about 0.0006. So if you actually get what
you expect might happen, you’re far into the
realm of significance. So why did we need 200? What were we really
doing this for? I had an interesting
conversation, CEO of a small biotech. We were doing a kind of
adaptive design trial, and he asked me the question,
why did you pick 200? It happened to be
the size of his trial as well, not the same as this. But in any case,
he was an engineer. For what it’s worth, I broke
five generations of engineers to become a statistician,
much to the disappointment of my father. So anyway, I have these
conversations a lot. Essentially– and
interesting, I’m putting some words
in his mouth, but I think what was essentially
the expectation here is that I picked 200
because somehow I knew when the p-value
would be significant. That somehow I was going
to get a pattern like this. So at n equal 100,
the p-value’s 0.27. And as I go to
150 and 200, 200’s kind of the magical time where
the p-value gets under 0.025, and I picked 200 because I
knew that’s what would happen. And you certainly can
get data like this. But in general, data can
come in lots of forms. I didn’t cherry-pick
this particular example, I just basically sat on my
computer and flipped coins. And this is also a pretty
standard trial that comes up. p-value’s 0.00015 at 100,
I’m already significant, and it’s tiny at 0.000001
when I get to 200. So I’m a position– I didn’t need all
those patients, I’m incredibly significant. And certainly what I did
is I flipped these coins under the assumption of 30/50. This is what I powered for, this
is what all statistical method said you need 200 subjects. Well, in this case, I didn’t. I got it a lot
earlier than that. Here’s another one. This is one, it’s– again, I’m flipping
30 versus 50. This one never gets
to significance. We got the 0.1, 0.06, 0.19,
it’s going up and down. This is a trial that
we know that if it has 83% power, 17% of the time
you don’t get significance. This is one of those 17%. All right. So what happens in
lots of these trials? I just gave you
three and showed you what happened at 100 to the
p-value, at 150 to the p-value, and at 200 to the p-value. So I generated 50 trials. And all of these
lines, I have plotted each of these lines– this is
the p-value over time at 100, 120, 140, 160, 180, and 200. All of these are simulated
30% control, 50% in treatment. Now one thing that’s
certainly going on is all of these
lines are going up. Not uniformly, so some
of them go up and down, but the general trend is up. The other thing
that’s going on here is these are the
lines of significance. That’s the– for
0.025, 0.01, and 0.001, I’m plotting the
z-score on this axis so I didn’t put lots of
tiny p-values in my graph. So what happens here? Well at 100, you can
see a lot of these are already significant, and
they become highly significant. At 150, even more
are significant. So most of these,
you can see there’s a whole bunch of trials,
they’re significant and they’re significant
far before I got to 200. So why did I get 200? What I’ve done is I’ve
bought an insurance policy. I’ve picked 200 not because
that’s when the p-value is going to get above 0.025. What I’m waiting for
is the slow ones. So like this red
one, I’m waiting for trials like
that to finally have gotten above that
line of significance. And I’ve chosen that 200– when I say I have 83%
power, 83% of the lines are in the significance region
above 0.025, and another 17% are below. And what I’m doing when I choose
a sample size of 200 is I’m waiting for that to occur. I’m waiting long enough
for all of these laggards, and I’m not taking
advantage of the fact that most of the trials
have been significant for a long time
before I got to look. All right, so a group sequential
is a way of handling this. We have to pick
interim analysis, and you can pick a couple– 100, 150, 200. You can pick many– 100, 120, 140, 160, 180, 200. So I can pick lots
of interim analysis. Generally the more you pick,
the more efficiency you get. There is diminishing returns. So one thing we often
do is we basically calculate, if you do two, if
you do three, if you do four, if you do eight,
here are the results. And you need to
decide how complicated it is to run an interim versus
the statistical efficiency involved. We’ve had trials
where it’s incredibly easy to run an interim and you
may as well run tons of them. Other times, oh, we’ve got
to have a huge DSMB meeting, there’s all kinds of reviews,
and that may be an incentive to run less. So here is the kind of
example that you might see, and this is 120, 140, up to–
sorry, 100, 120, 140 up to 200. One thing that’s going on
here, you’ll see at 100, I’m going to declare
success if the p-value is less than 0.0031. I need a highly significant
result in order to get that. Now keep in mind, there are
a few trials that hit this. That’s 0.001. So some are going to make it. At 120, it gets a little bit
lighter, at 140, a little bit. And these are going up. And finally at the end, I
need a p-value of 0.0183. The way I tend to
think about this, if you go to like a carnival
game, you’re standing in front, they’d say, OK, you’ve got to
throw a ball through a basket in order to win a prize? There’s a certain
sized hoop that you’ve got to throw it into. Ordinarily if you’re
going to get one throw, I’m going to let you have– you need a p-value of 0.025. There’s a certain sized hoop
you have to get it into. If I’m going to give
you multiple throws, so I’m going to let you throw
it– how many is in this? 100, 120– that
are six different throws that you’re going to get. I can’t give you
the same sized hoop for every single one of
them, because what happens is that increases your
chance of getting one even if you’re a bad shot. So what all these are
aimed at, the reason that all these
p-values have changed, is because you don’t get
one throw at a big hoop, you’re getting sick
throws at smaller hoops. It’s designed so you
don’t lose any power, you don’t lose any
type I error, but it’s designed to allow you to
win this trial earlier. And again, the notion going back
to this graph, what I’m doing is I’m just figuring out
when this line finally gets high enough, I’m going to
go ahead and declare success. I’m not going to wait for 83%
of the trials to get there, I’m going to stop this trial
for success whenever a trial actually gets high enough. It turns out, in this particular
trial, 28% stop at 100. And that’s, again, 30 versus 50. This is what I’m powering for. So that I’m not– this is
not for a dramatic effect. These are just ones
that got lucky early. 12% at 120, 11%, 11%, 10%. 7.9% win at 200, and then here
is 18% that lost at the end. So these are ones that went
through the entire sequence and never actually made it
through any of these hoops. If I look at this,
my power is 81.8%. Now remember, I was at 83.3%. So the first trade-off
here is I’ve lowered power. By looking earlier
and splitting all– by throwing through six hoops,
I’ve lost a little bit of power by doing this. What I’ve gained is my expected
sample size is now 148. Some of these trials stop at
100, 120, 140, they stop early. A few of them go
all the way to 200. So I’m basically saving
25% of my sample size, and it has cost me 1.5% power. So I have to decide
about that trade-off. I can actually pay that
trade-off in a different way. Suppose I say that I’m
just going to increase my maximum sample size. So now I’m going to
interims from 100 up to 220. So this is a possibility
I may need a larger trial. But what does this do? So again, I’ve
split this all out. You can see, here’s
20% of my trials, they’ve had to go to 200. No, this one. 21.5% reached 220. So I’ve made 21% of my
trials longer at 220. And in addition, a lot
of them are smaller, my expected sample size is 156. So if I have to fund a lot of
trials, if I’m doing this– how many of you have run more
than one trial in your life? How many people have run
more than five in your life? How many have run more
than 100 in your life? OK, there is a maximum– OK, just making sure. By the way, I know an
Alzheimer’s researcher who’s run like 100 trials
and never had a success, and you really ought to be
hitting the type I error rate at this point, so
that’s a little unlucky. But in any case, this is– so we’re saving sample
size on average. If I have to do this repeatedly,
I have to go up to 220 every once in a while, but I
get to go down to 140, 160 a lot more. So that’s why I get to save
sample size in doing this repeatedly. It’s also worthwhile
to consider futility. So suppose we were
140 patients in, and what we observe is 15 out
of 70, that’s 21% on control, and 19 out of 70,
that’s 27% on treatment. Where am I right now? I’ve got p-value of 0.2147. OK, that’s not bad. Certainly not 0.025. One question to ask, is
this trial worth continuing? So certainly if you are
on a standard DSMB– I’m on a lot of
DSMBs, we would look at this– in the absence of
a rule that says to stop, what would the discussion be? Well, this isn’t going
as well as we hoped, but there’s no evidence
it causes harm, the trend’s in the
right direction. We have no reason
to stop this trial, we’re going to go
ahead and continue it. So one question to
ask is, well, how likely is it that we’re actually
going to win this thing? We currently have
a p-value of 0.21. We know that we need
to get that to 0.018. And if we– you could
imagine applying this to a non-group sequential
design where you have to get the p-value to 0.025. It would look very close. But here, we’ve got
to get to 0.018, and we’ve got to do it by 220. OK. In order to get a
p-value of 0.018, that requires about a
15% observed effect. Your friendly statistician can
back-calculate that for you. All right, so we
need a 15% effect. We’ve got about a
6% effect right now, and we’re 140 patients in. So at 220, in the next 80
patients, what do we need? Well we need a 32% effect
on those next 80 patients, and right now we’re
running at six. So something’s got to give here. Either we’ve been
unlucky to date and our luck is going to
have to change significantly, or this might be a trial that
we’re just not on pace to win. We often use predicted
probabilities in order to calculate this. So I can ask the question,
what is the likelihood that this is going to win? There are two pieces to
the uncertainty in that. In the next 80 patients,
there is sampling variability. If I told you– if mother
nature came out of the forest and said the true rate
is 40%, in 80 patients you’re going to get
40% plus or minus. So that’s sampling variability. The other thing is,
mother nature does not come out of the forest
and tell you it’s 40%. So you don’t really know
what the rate is either. So there’s uncertainty
in the parameter. What predicted
probabilities do– and I am I’m omitting
the details here, but can certainly
refer to papers, and I’ll be happy to
write blogs on this or whatever people
would like to see– but a predicted
probability incorporates both of those uncertainties. And you can calculate
the predicted probability of success is 6.9% here. It’s quite low. And I hope that actually
matches the intuition, 6% on the first 140
need 32% on what’s left, that’s a big shift. It’s not very likely to happen. All right, so I’m going to
add futility to the example that I just had. And at each interim I’m
going to stop for futility if the probability of eventual
success is less than 5%. Now that’s 6.9 I just
did, I would have let that go with this rule at 5%. I would have continued
on to the next interim. The value that you pick
here, this is something, again, your friendly
neighborhood statistician ought to be computing. If I do 5%, if I do 10%, if I do
15%, what are the consequences? The things that you’re
trying to manage here, you certainly want to
stop poor treatments. If things aren’t going well
because your drug is not working, you would
like to stop the trial. You also– remember that even
in good scenarios, sometimes you get unlucky. That 17% of trials
that didn’t win, I don’t want to go
out to 220 for those. If I’m going to lose,
I’d rather lose at 140 than lose it to 220, it’s
just easier on everybody. The other thing that
I have to do here is I don’t want to declare
a futility too often, because every once
in a while, there’s a trial that looks bad early
that comes back and wins. And if I declare a futility
too aggressively, what happens is I’m going to lower my
power because I’m going to eliminate that possibility. So here what I’ve done, we have
probability of eventual success is at 5%. I incorporated that rule. And so now what you
can see here is these are the same p-values
up to 220, this is the probability of winning
at each of the interims, and then there’s a probability
of declaring futility. There’s a small
number of trials, they look really
bad really early. And so those 3% stop. And then just a few stop
kind of it each interim. I want to emphasize that
most of the trials that are stopping for futility,
if you would let them go, they would have lost. Again, we’re trying to
find that 17% that failed and get rid of them earlier. I have made a few mistakes. So remember, the power was 85%. Futility has cost me 1.7%
power by putting it in. 1.7% of trials I stopped
for futility that would have gone on to success. However, that 83.3, I’m
now exactly the same as the trial you started
with, n equal 200 that had no
adaptation whatsoever. I’ve increased the
sample size, I’ve put on the group sequential,
I’ve put on futility, my overall power from all
of those things combined is the same. Only 10% of my trials
are reaching to 220. So this little
extra that I spent, I’m not doing it very often, and
almost 80% I’m actually saving. So certainly I’m trying to get
rid of this insurance policy that I bought. And the expected
sample size is 150. So I gained a little
bit in this scenario. This is when the drug works. Futility is not designed
for when the drug works, it’s designed for when
the drug doesn’t work. So what I illustrated
here was that I didn’t cost too much power. This is what happens
under the null. I actually can stop almost
60% of trials at 100. So if the drug does nothing,
30% versus 30%, I’m out early. I can get rid of 80% of the
trials at 140 or before. So I’ve still got 83%
power in the alternative, I’m stopping on the null,
the expected sample size here is 123. They’re all out really early. Now if I were having to
fund a lot of trials, what does this do
to my population? I took a quick survey on
NHLBI areas and kind of what the success rate is in trials,
imagine that you had to fund a bunch of trials and 30% of
the drugs work and 70% don’t. Remember, trials, it’s hard
to develop drugs and have everything work here. If you were in that
boat, what would happen– we’ve obtained the same power. The expected sample size here,
in the 30% of trials that work, we averaged 150 subjects. In the 70% that didn’t,
we averaged 123 subjects. Overall for this population, I
averaged 131 patients a trial. Some of them are trials
that work, some of them are trials that
fail, the futility works better than the
success stopping, actually, and you can fund 52% more
trials than you could otherwise by doing this kind of thing. So again, the flexible sample
sizes, the expected sample size significantly reduced some
numbers on that 25% to 40%. I haven’t talked
about this, but I said it was so uncertain
when significance occurs, if you actually didn’t
know it was 30 versus 50, maybe it’s 30 versus 70,
maybe it’s 30 versus 50, maybe it’s 30 versus 40, that
increases the uncertainty all the more. You want to have more
interims to account for that. And finally, you do
have to pay for this. You have to account for
the multiple analysis, those p-values they
can’t all be 0.025. You may have to increase
the maximal sample size. And also, keep in mind, if
you have safety requirements, stopping early might
invalidate those. If I need 200 subjects to get
an adequate safety database, I can’t do this. If I need to make sure I have
adequate numbers of people in different ethnic
groups, for example, this may be difficult to do,
you have to make sure that those kind of
operational concerns are also handled in
this, so all of those are part of this
group sequential. All right. I’m going to switch gears. My colleague will. You wanna ask questions? Questions? Yes, sir? Oh, can we give you the
mic for the recording? Adam [INAUDIBLE]
from [INAUDIBLE].. So what we find is,
DSMBs, as you noted, do they listen to the
futility and efficacy rules that you put in? How binding are these, and
what type of information do you put in the
protocol to make sure that the interim
analysis and futility have a lot more context around it
so when they get to a meeting and he’s able–
should we stop or not, there’s more context behind
what went into the boundaries? So the question was
essentially, what goes into a DSMB, what goes
into the protocol, what goes into the charter, how
is this actually implemented? I’m thinking three
things, I’m going to start going through them,
I’m going to end up with two. There are three kinds of people,
those that can count and those that can’t. So anyway, one thing,
certainly in the protocol, it needs to be very explicit. Futility will be declared
if the predicted probability is less than 5% and
here is the calculation, either in the
protocol or in the SAP reference from the protocol. Usually at the kickoff meeting
to the DSMB if not prior, you also would have a
discussion with the DSMB. If you’re the sponsor, you want
to make your expectations clear here. No, this is the
rule for futility, and we want this to be
followed, so that everybody is clear on that. The other thing that you
would show is example trials. Here is the kind of data. If you’re at 140 and this
is the data that you see, these are our
criteria was stopping. And so that’s a reason– effectively you’d end
up saying things, look, if we have less than a
6% effect at n equal 140, we want you to stop, and
that’s what this rule says. And at that point, if the
DSMB, if there’s uncertainty, you can have a long
conversation about it and nobody’s blinded yet
because there isn’t any data. So you want to be upfront about
what the expectations are. I have heard horror stories of– and academic settings
are a little different. Private sponsors are really
picky about making sure that futility
rules are followed. I know of cases where the DSMB
went rogue and essentially said, we kind of
want to see what’s going on in this one subgroup. So they really ran the
trial twice as long after the fertility
rule had been hit. They have not been hired as a
DSMB member very often anymore. But anyway, you
want to make sure that those expectations–
really, it’s about, you don’t want this
to be embedded. This is part of the discussion
and laying it all out in advance, and
making sure everybody has bought-in before
anybody has to be separated. All right. All right. So the next topic we’re
going to talk about is single-arm trials. This kind of falls
under the same banner as how do we minimize
our resources when we’re running a clinical trial. So one way, as
Kert described, is to have a flexible sample size. Another thing we sometimes
do is run a trial where all of our
participants receive the investigational therapy. So these are what we
call single-arm trials. There’s no placebo arm or
no control arm in the trial. So when we run a single-arm
trial, what that really means is we’re assuming that
we have some information external to the trial that we’re
going to bring into the trial. So we’ve essentially
made some assumption about what patients
might have looked like if we had a control arm. So we’ve made an assumption
about what our average rate is, about what our response rate is. Assuming that that’s
what the behavior is for patients that would have
been assigned to a control arm. So this might come from
historical data, for example. Maybe we’ve run another
trial in this population. It’s common to do
this in rare diseases. One of the big
reasons that we might consider a single-arm trial is
that patients are hard to find. So if we have a hard
time recruiting patients, finding patients, a single-arm
trial might be beneficial here. So we’ll talk in
detail a little bit about some of these
gains and losses. One of the main reasons
that we would consider a single-arm trial
is the sample size is typically going to
be much, much smaller than what you would
need if you were running a randomized trial,
because you’re only going to have one arm. So instead of having to
randomize subjects between two arms, you just have one
arm and you need about half as many subjects. It’s also– one
of the big reasons is it’s a lot easier to enroll
patients in a single-arm trial potentially. So if patients are unwilling
to be accrued to a trial where they know they have a
chance of getting a placebo arm, and maybe they would
prefer to enroll to a trial where they know they’re going
to get an experimental therapy, then it may be hard to
enroll to a randomized trial because patients just don’t
want to commit to that. So those things make the
single-arm trial beneficial, but some of the losses– and we’ll go through
in detail and describe these– one is that you need
that historical estimate that you’re going to compare to. So you need to have some
assumption about what your control arm rate is. And as we’ll see,
that if we’re not accurate about that, if we
are off in that estimate, that we can see some
severe losses in terms of our power and our
type I error rate. And one of the other big
issues with single-arm trials is, we’ve violated one
of those kind of pillars that Kert described at the
beginning of the session today. We’ve lost blinding. And because of that,
we can see some biases both from the
patients– they know they’re receiving the
experimental treatment; and also from their assessors,
their clinicians who also know that these are the patients that
received experimental therapy. So we lose that randomization,
we lose the blinding, and because of that,
we start to see biases. So one of the innovative
trials that we are going to talk about
now is, how can we use historical information
that we have but do it in a way that we’re not running a
single-arm trial and all of the caveats that
come along with that? And what we’re going to do is
borrow historical information in a statistical sense. Essentially what
we’re going to do here is augment a control arm
in a randomized trial with historical data. And what that allows
us to do is now instead of doing just a
one-to-one randomization– so half of our patients
get control, half of them get the treatment, because we’re
going to now start borrowing information from
the historical data, we can adjust that
randomization ratio. So instead of a
one-to-one, we can maybe do a two-to-one, a three-to-one,
or even more extreme so that more of our
patients are receiving the experimental
therapy, fewer of them are receiving the control arm. But because we’re using
that historical data, that helps to augment our
data on the control arm. So the benefits that we get
from this is we retain blinding. So we now still have that– we removed some of those
biases that we might see in a single-arm trial. The other thing is that the– since we are still
collecting data in our trial about
control subjects, that acts as a verification. So if we know what the
historical data was and we get data in the trial,
we can see how closely those two things align. So in a single-arm trial, we’re
just assuming that we know it. In a randomized trial, we get
to compare and see how similar or how different was the control
data in our current trial to the historical data. So historical
borrowing– and we’ll talk through a little bit of
the details of how this happens. The gains here is
that because we are using that
historical information, we require fewer subjects
in our randomized trial. So the sample size
benefits, compared to a standard one-to-one
randomized trial, are significantly smaller. So that lowers our risks. And we talked earlier about how
single arm trials may be easier to enroll if patients are averse
to receiving a placebo arm, they may not want to enroll
to a randomized trial. So here, if we’re able to
run a randomized trial that does borrow from
the historical data, it can potentially be
easier to accrue because we need fewer control subjects. Most of the subjects
that enroll to the trial are going to be assigned
to the treatment arm. So that helps to mitigate
that risk of our accrual being small being
slower than expected. As before, if we were
doing a single-arm trial, we would need a
historical estimate. If we’re going to borrow
from historical data, well, we need the
historical data. So we’re relying on
having some information, either from a previous
trial, maybe a registry that we can use. And the other thing is– so what this design
is kind of built on is how closely the
current control data is to our historical data. And we’re kind of banking on
that being true that these two things are similar. If it turns out that our
current control data differs from what we’ve
seen in the past, we have a limited
backup plan for what to do if that doesn’t match. So we’ll talk through a
concrete version here. Suppose that we’re going
to run a single-arm trial and we’re looking at a
dichotomous outcomes– this is maybe a responder
rate kind of an outcome. And are what we
would like to show is that our responder rate is
greater than some estimate p0. So p0 is essentially
are our historical rate that we’re going to compare to. So where do we get that? So one option is that we have
an expert or a panel of experts who agree that 30% is the
right number just based on their expertise. It could be that
we’ve seen some data from a small natural
history study. So from patients
who weren’t treated or who were given kind
of a standard treatment. So maybe in this small
natural history study, we’ve seen three patients that
responded out of 10 patients, so we have about a
30% rate here too. Other ways that we
can get information about this control
rate, maybe we look at retrospective
chart reviews. Maybe we have some data
from a large clinical trial. So for example,
suppose there was a trial that had 200 patients
and 60 of them were successes. Again, that comes out
to about a 30% rate. But of course, if we
were able to enroll a large clinical trial,
there’s no reason why we should have to
run a small trial now. So now suppose that we’re
going to test the hypothesis that our response rate
is bigger than 30%, and our rule here was
we’re going to run a trial with 20 subjects. And if we see 10 or more
of our subjects that respond to the
treatment, we’re going to declare that trial a success. So if we apply this rule, it
turns out that our type I error rate– so if our
rate is truly at 30%, our type I error
rate is about 5%– 4.8%. And if we assume that are
our treatment has a 60% rate and we’re comparing to
that 30% on the control– so have a 30% improvement, this
rule of 10 out of 20 successes, that gives us the power of 87.2. All right, so we’re done, right? We have over 80% power, we
have 5% type I error rate. But the problem
here is that we’ve made an assumption that 30% is
the correct historical rate. And there’s not always a lot
of certainty around that. If we look at the
literature, we’ll see a lot of times
in clinical trials that are run very
similarly, there’s variability in
what that rate is. So in these characteristics
that I just showed you, a 5% type I error
rate and an 87% power, this is all predicated on
that 30% being the right rate. So we’ve chosen that, then
we’ve assumed that it was true. But what happens if that 30%
is actually not the right rate? So we said on the previous
slide that our type I error rate was about 5%– 4.8%. So what we’ve done when we
calculated that type I error rate is we assumed that are true
response rate on the treatment and our true response rate
on the control arm is 30%. And it turns out that
if we run a trial and we count 10 out
of 20 successes, then that gives us a type
I error rate of 4.8%. But this definition never asks
whether 30% is the right rate. So suppose instead that both the
treatment and the control arm are both a 40% instead of a 30%. So here again, there’s no
difference between the control and treatment, it’s
a null scenario. But now instead of both being
at 30%, they’re both at 40%. What happens? We think that a type I
error rate is still 5%? So it actually turns out
that if the true rate is 40%, the probability
of a successful trial under that rule that we’ve
created is no longer 4.8%, it’s actually gone up to 24.5%. So this is our type I error
rate if in fact our treatment and control are both 40%
instead of 30% that we assumed. And it gets worse from there. So if we’re even more wrong– so if the truth is actually
a 50% rate on both control and treatment, our type I error
rate goes up to almost 59%. So we’re nearing 60%. So the graph here– what this is showing
is on the x-axis is the true control rate. So remember, we assumed
that we had a rate of 30% So that’s this horizontal
green line right here. And if in fact that is true,
our type I error rate is 4.8%. So that’s where this line
crosses that green threshold. So the green line
here is at 4.8%. So what happens is
as our truth starts to differ from
what we’ve assumed, as it starts to
get bigger, what we see is the type I error
rate starts to go up. And the further away
from 30% that we get, the more we increase
our type I error rate. What happens to power? So unfortunately, our
troubles aren’t limited just to type I error rate. We also have bad things
happen to our power. So remember, what
we were looking for was a 30% difference. So 30% on the control arm versus
a 60% on the treatment arm. And we had seen that
we had about 80% power if the truth is that
our control rate is 30%. So what happens if our control
rate is actually smaller? So what if the control
rate is actually 20% and we get an effective 50%? So we still have a 30%
difference from 20 to 50. But what happens is now our
control rate has drifted, and what happens, as we’ll
see on the next slide, is that our power
starts to go down. So now on the y-axis,
instead of type I error rate, we’re looking at power. Again, if we looked at our
30% truth on the control rate, what we’re seeing is 87% power. But as our true control
rate starts to go down and we still assume
a 30% improvement, our power is now less
than what we had assumed. So we have bad things
happen on both sides. We start to get increases
in our type I error, we start to get
decreases in our power. And this is all because
we’ve made an assumption that 30% was the right rate. And when we were wrong about
that, bad things happen. So why did we run a
single-armed trial in place of– yeah. OK. Sorry. Are there questions, comments? So this idea of our
type I error rate in our power changing as a
function of the control rate, why didn’t we talk about this
when we run a randomized trial? Why am I talking about this
only when we’re talking about single-arm trials? Well we can see,
what I’ve done now is added a line to our plot
that shows what happens if we did a randomized trial. So now we still have a
20-patient trial, but instead of all 20 of those going
to a treatment arm, I’ve now randomized
10 of those patients to treatment, 10
of them to control. So there’s a line– it
may be difficult to see, it’s an orange line right here– that shows the type I error
rate across the same range of control rates. And as you can see,
that orange line is always below our 5% rate. So in a randomized trial,
the type of an error rate– in the tails, as we start
to get really far away, we see that that
actually go down. But across the range
of control rates, are type I error
rate is controlled. So we’ve limited the
chance of a false positive here, that that red line
that was our single-arm trial that we saw just a minute ago. What happens for power– again, we’ve added
this orange line, this is now our
randomized trial– is again fairly
steady, fairly flat over the range of control rates. It does start to drop. But the other thing
that we see here is the single arm
trials, over a wide range here, the single-arm
trial has higher power, and that’s really a
function of the sample size. So the randomized trial,
we have fewer subjects on our treatment arm. So the main point behind
this is that history– in this case, we’ve
assumed something about our control rate– leads us to the best
guess of our parameter, but we could be wrong. Sometimes history
leads us astray. What we see is that our best
properties for our trial occurs when the truth is
close to our best guess. So if we have the sweet
spot where if our truth is– if our assumption is
close to the truth, we get good behavior,
but when our truth starts to differ from
what we’ve assumed, we start to get biases. So if our truth is
actually a lot different, we can either see inflated
type I error rate, or we can see some
big losses in power. All right. So now what we’re going to
transition to talking about is how we can use
historical borrowing– so use an innovative
statistical method to use to run a randomized
trial so that we get those benefits
of randomization, but try to control not
needing a large study. So now I’m going to
consider an example, we’re running a phase II trial. And our sample size, the
most that we can afford is about 210 subjects. I’m going to
allocate my subjects in a two-to-one randomization. So out of those 210
subjects, 140 of them are going to go to the
treatment arm and 70 of them are going to go
to a control arm. And then I’m going to
use historical data to kind of argument that data
that I get on the control arm here. So suppose that we have
this historical data. So maybe this was a
previous study that was run on the control therapy. And out of those 120 subjects
in that previous study, 70 of them– 72% of them were responders. So that’s a 60% rate from
that historical study, and that’s what we saw
in the control arm. So I’m going to show
you two examples here. I’m going to show you a
trial that kind of ignores that historical
data, and then I’m going to show you a trial
that uses that historical data and just pools it in with the
data from our current trial. So we take those 72
responders that we got in at our current trial, and
we add them to whatever we see in our new trial. So we go back to
this graph again. So the green line here, this
is a 5% type I error rate. What we see, this is our trial
that pools the data together. So we take those 72
responders that we got on the previous
trial, we add that data to what we get in the
new trial and what we again see is inflation of our
type I error rate. Remember that that historical
data had a rate of 60%. So if the truth is
60% and we get data around that same number
in our current trial, our type I error
rate is actually a little bit lower than 5%. But as the truth
starts to differ from what we saw in
that historical data, we again start to see this
inflation of our type I error. So what happens
to the borrowing? So here, again, there’s a
reference line here at 60%. The purple line is what happens
if we pool the data together– so we take those 72 patients
from our previous trial we add them in with
our new patients. What we see is that
power starts to increase when we do the
borrowing as compared to a randomized trial that
doesn’t do the borrowing. So the orange line
here is if we just do that two-to-one
randomization but we ignore those 72 patients
that were responders on the previous trial. So if in fact our true
control rig is 60%– so it’s on point with what we
saw in that previous study, we see these big gains between
pooling that data together and ignoring it. If our true rate starts
to differ from that 60%, though, we see the
power for the pooling to start to go down
quite substantially, where it starts to plateau
if we ignore that data. So what we’ve done
in this example was we’ve just decided
to completely 100% use the data that we saw
in the previous trial. We can also think about rules
that allow us to use that data, but in some sense downplay
it so that we’re not using the data to the full extent. So in– what we call
these are power priors. What these are static
weights that basically say, hey, I’m going to take
that historical data and I’m going to weight
it by some amount. So a weight of zero here would
correspond to no borrowing, so I ignore that
historical data. A weight of one
corresponds to pooling, so that’s what we saw
in our previous example. But now we can allow
weights that are in between. So for example, you could
assign a weight of 50%. So what that means is
each of those subjects in the historical data set
essentially counts as half of a subject in the new trial. So you’re kind of
accounting for the fact that that historical
data is potentially different in some way than
in your new population. So this is an
attempt to recognize the potential for drift
in your population, and account for the fact that
subjects in the new trial might be a little bit
different from subjects in the past trial, and so
we’re going to weight them a little bit differently. So for our statisticians
in the room in the room, the way that we handle
this in Bayesian methods is what’s called a power prior. So what we get is if we were
pooling the data together, what happens is that
historical data essentially just gets treated like it
comes from the current trial. So we take our prior times
our historical likelihood times our current likelihood. What happens when we do a weight
that’s in between zero and one is essentially we’re
taking that historical data and raising it to a power. So what happens is if the
weight is equal to one, then this becomes
equal to pooling. If weight is equal to zero,
then this historical likelihood goes away. And anytime we have
a weight that’s different from zero or one,
that essentially just adds some kind of a weight
to the historical data. So how do we choose
a weight for W? So common choices here
might be half or a third. So again, this is an
attempt to recognize that patients in
the current trial are somehow different
than patients in the past. So we’re going to
weight them maybe a half or maybe a third of the
patients in our current trial. As we start to get
values near zero, that essentially becomes like
ignoring the historical data. It’s possible that we could
use values bigger than one. So what that does is
essentially add more weight to that historical data. So if we have a really large
weight, what that essentially does is act like a
single-arm trial– so it makes the current
data that you’ve collected not as
important, it gets overwhelmed by that
historical data. So essentially, assuming
weights bigger than zero, assuming large values
of W, it is essentially equivalent to
assuming that you have a massive data set from the
past and essentially know the answer. All right. So now what I’ve
done on this graph is show different values
of W– so different ways that we could weight
the historical data. So let’s see. So what we saw in the
past is our purple line– this is right here–
that was if we just assume that the historical
data counts just like our current data. So a weight of one
is that purple line. So this is our type I error
graph, and as we saw before, we have this
inflation as we start to differ from our 60% rate. But now what we see is that this
inflation of our type I error rate is dependent on how much
we weight the historical data. The more weight that we
give to the historical data, the more type I
error inflation we see as our rate starts
to differ from the truth. But as we put less weight
on the historical data, we see less of that
type I error inflation. So this line right
here, for example, is weighting the historical
data at just 10%. And we see that we still
do have some type I error inflation as our rate starts
to drift away from 60%, but it’s much lower
than if we just treat our historical patients
the same as our current data. So what happens to our power? So again, I’ll point out that
the line we looked at before– it’s a little hard to
see, but here’s our rate if the weight is equal to one. So again, we see this kind
of continuum of these lines that the less we wait
our historical data, the more we start to approach
the randomized trial; and the more we wait
our historical data, the more we start to see
these drops in our power. But again, most of the
time, as long as we’re close to our historical rate– sorry, as long as the truth is
close to the historical rate, we see this spot where
our power is always increased by the borrowing
relative to doing a trial that ignores that historical data. So how do we choose
a weight for W? How do we know
what’s the right way to weight the historical data? What we would really like
to do is have some mechanism that if the true parameter is
close to the historical data, we borrow a lot and
we put a lot of weight on that historical data. But if our true
parameter is different from the historical
data, then we would like to
downgrade that data and not count it quite so much. The problem, of
course, is that we don’t know what the true
parameter is, so how do we decide? Well, as in most
cases, the current data is a good way for us to judge. So what we’re going
to do is let the data help us decide
how much we should weight the historical data. So if we’re running our trial,
and then the current study, the data comes in around
85%, and remember, our historical
rate was 60%, well that tells us that something
has likely changed. The parameter is
likely different from that historical
rate and maybe that’s a situation
where we don’t want to weight our historical
data quite so high. So any method that
we can consider that assesses the agreement
between the historical data and our current data is what
we call dynamic borrowing. So what happens is we
assess that difference between the current
and historical data, and we decide on a weight
that’s based on that agreement. So as in most things,
no method is foolproof. So it’s possible that the
current data leads us astray. So what could happen is
that there’s high drift– so really the current– I’m sorry, really, the
historical data differs from the truth, but just randomly our
current data happens to agree with the historical data and
we decide to borrow a lot, even when we shouldn’t. So that’s possible. It’s also possible that
there’s really in truth no drift– the
historical data is very much on point with the truth. But just for some random reason,
the current data randomly deviates, and so we
decide to not borrow so much when we should. Question? Yes, if you don’t mind
me asking a question now. Mm-hmm. So when are you making the
decisions about the weighting? Is this being done in an
iterative manner or not? And then the second question
is, what about a contemporaneous cohort? Would that be similar to
historical bothering– bothering. Sometimes it is bothering. Because it sounds like quite
a bit of work, actually, and I don’t know about that. Right. So the question is, when do
we decide how much to borrow? So what we’re going to
do is actually create a statistical model that kind
of does this automatically. So when we run the analysis,
the statistical model does this assessment
to see the agreement between the historical
data and the current data and automatically adjusts
the degree of weighting. Does that answer your question? On multiple occasions? You can do this borrowing
just once during the trial. You can do interims where
you assess it multiple times. Each time you do the analysis,
it would reassess that. Was there another
question in the back? [INAUDIBLE] Oh OK, thank you. OK. [INAUDIBLE] So I think
the follow-on question to that would be, though, you’re
designing the trial upfront with an n and trying to figure
out how many subjects you need, and at that point, you don’t
take the historical borrowing into account, right? When you’re preliminarily
designing the trial or do you? Because the question is,
if it doesn’t turn out as your interim analysis
or continuous analysis actually shows
you, then what does that do to the
initial trial design and how do you sort
of make up for that? I think that would be I think
the follow-on question to that. Yeah. So I think if you’re considering
a design where you’re going to potentially borrow from
this historical information, that should be part of the
pre-specification that you do. So you should be
considering this upfront when you design the trial and
write down the analysis plan. And what else was
I going to say? Yeah, Kert. Hi. I think I got one. So another aspect to this,
this is somewhat getting back to this insurance policy idea. What you’re gaining here
is essentially upfront, you’re making an assumption. That by using this
historical data– I’ve seen multiple
clinical trials, they all come at a 30% rate, I
expect to see another 30% rate. I’m going to save
30% of my sample size by doing it this way. What you would
want in your design is it to be prospective that
if I can go from 1,000 to 700, I use 700. But if in fact they don’t
agree, it does make sense to have this possible extension
prospectively into your design, where at the beginning,
the protocol says, I’m going to go to 700,
this is the analysis, if certain conditions
aren’t met, then the trial will
automatically go to 1,000. You could also–
we’ve designed trials where we’ve
randomized two-to-one, making use of the fact that
we’re borrowing these controls. And if in fact they
don’t agree, not only does the trial
potentially get larger, but it also switches to
one-to-one randomization because it’s not borrowing
those controls anymore as much. So all of this is, we’re
making an assumption, we’re trying to make
use of it, but you have this prospective
backup plan which makes the trial larger
if the assumption’s not met. Another question. So I’m going to ask this at
the risk of having been late, so I apologize if you’ve
already gone through this. But this makes tremendous
sense in terms of– sense to me in terms
of trial efficiency. What I worry about is if the
natural history of the disease or the care is
changing, then you are going to be slow
to recognize that by continuously
borrowing from prior data to inform your existing trial. Is there a way of handling that? Is there a philosophy
around that concern? Why don’t you show the
dynamic borrowing graphs? OK. So I think I think some
of this will be addressed in what we’re about to show. So the question is
really about the drift and if there is a drift. And potentially if it’s
slow to materialize, are you going to be
able to recognize that during the
course of the trial, I think that’s part
of the question. All right. So I’m going to go through
some of the details of how we actually implement
this borrowing. So the methodology
here is called Bayesian hierarchical models. What these models do
is they explicitly have a parameter that assesses
the variability from study to study. So in our case, if we
have our current study and we have the historical
study, what the model is doing is assessing the variability
and the response rate from the historical data
to our current control. And this parameter that actually
measures that variability has a direct relationship to
how much weight we assign. So if the variability is small– so if our current data looks a
lot like the historical data, we have small variability. The model’s going
to borrow a lot and we’re going to make heavy
weight on that historical data. But if that parameter
for the study variability, if it says that
these two studies differ very widely, then that means
we’re going to have a smaller weight on the historical data. So in the model, as we
estimate this parameter, and we do that through
Bayesian methods, we get a posterior distribution
that determines the weight. So as we saw, the more borrowing
happens when you agree, less borrowing happens
when you disagree. So we do have some
statisticians in the group, the mechanism for this. So in general, we’re going to
let PC be the historical rate– sorry, the control rate
in our current trial. So C here stands for current. And then we have
potentially more than one historical studies. So in the example we
just had one data set, but we could
potentially have more, maybe these are
several small data sets that we’ve
looked at that all have the same kind
of control treatment. So P1 through P big
H, these are the rates from our historical studies. What we do is we
model the responders. So Y is the number of responders
on our current data accrued, and we assume that’s binomial. Historical data also has
a binomial distribution. And then what we do is
we assume that this set of historical control
rates has a distribution. Here we do the modeling
on the logit scale. And then what we do is
we assume that this group of historical rates has
a normal distribution with some mean and some
standard deviation tau. And this parameter tau is
what we just talked about it. That’s the parameter
that measures the study-to-study
variability, and that’s what’s going to control the
degree of our borrowing. So tau, this is the
most important parameter for the borrowing. If we just assumed that
we knew tau in advance and we don’t estimate
it from the data but we just assume
that we know it, that basically
corresponds to giving a specific static weight. But in Bayesian methods,
what we’re going to do is actually estimate tau as
we start to collect our data. And then what happens is
as we start to estimate tau and we see that there’s a high
study-to-study variability, we start to downweight
that historical data, but if we see that there’s
agreement between our two studies, then we start to
increase our borrowing. So what that does is it creates
what we call dynamic borrowing. So the degree of borrowing is
not pre-specified in advance. Rather, the model is
pre-specified in advance, and then as we
accumulate the data, we estimate that
variability, and the model determines how much we borrow
and it assesses that drift. So I think we’ve got that. So what are the
properties of this model? So we previously
looked at graphs where we had a static weight,
either a weight of one where we fully pooled the data
together, or a weight of zero or we completely ignored it. Now when I’m showing this
bright lime green line, this is what happens if we
do a hierarchical model. So we now allow
the data to dictate how much of the borrowing we do. So on the left this is
our type I error graph. What we see is as
the truth starts to differ from that
historical data, we do start to see some
type I error inflation. But importantly, that type I
error inflation is bounded, and it eventually
starts to go down. So what happens is the
model starts to recognize– as you get way out here so far
away from your true parameter, it starts to recognize,
hey, these two things don’t look the same,
it starts to downweight that historical data and
your type I error rate starts to come down. And similarly, in the power,
we also see a mild power loss compared to downweighting. So here’s the
historical borrowing. And you can see that
compared to if we just pool, we do lose a little
bit of power, but if we compare
to a design that ignores that historical
data– that’s the orange, we do see improvements
in our power. In terms of power loss–
so if our true parameter differs from the data
that we’re borrowing from, again, if we just pooled
that data, what we see are these huge losses in power. We ignore the data
that kind of plateaus the historical borrowing, and
again, it starts to plateau. So as your current data
starts to differ too much from the historical
data, the model recognizes that, starts to
downweight the historical data more so you don’t see these
dramatic drops in the power as we did before. So I’ll quickly go through
an example of this. So we had a trial
that was originally designed as a
non-inferiority trial with a fixed sample
size of 750 subjects, one-to-one randomization. So 375 subjects per arm. There were some historical
data that was available. There were actually two
historical studies here. If we use the
historical data and we use the hierarchical
model, what we found is that we could do a trial that
had 600 subjects rather than 750– so about 20% fewer subjects. Here we changed the
randomization ratio to two to one– sorry– yeah. Instead of one-to-one, we
changed it to two-to-one. So put 400 patients on
treatment, 200 on control. And for the expected drift–
so within the range where we thought there might
be some difference from the historical rate,
the type I error rate was controlled, and we
actually had comparable power to our original design. So we were able to
save 20% fewer subjects for about the same power and
about the same type I error rate. Let me pause again for
questions and see– I wanted to make sure I
addressed your question. Questions? So I think– It may not be appropriate
for this discussion, it’s more of a philosophical
one that if the natural history of the disease– that is, the focus of a trial
is actually changing, then the assumption around
how close this trial is to very good, very well-run
multiple studies with event rates before, one shouldn’t
assume that this trial should be weighted less
in its new control rate than the prior ones. And if we all do that
with all of our trials, we’re going to miss the idea
that the biology is changing or that the treatment effect is
not what it was 10 years ago. I think what you– you raise I think
the key point here, is what is your expectation? There is a sweet spot
where there is a benefit, and outside of that sweet
spot there is a loss. And really, it’s an
expectation of what is our belief on how often we’re
going to hit that sweet spot and how often we’re going to
have excessive drift on there. And in a lot of cases,
we expect the sweet spot to be likely to be hit, and
that doesn’t mean every time but enough of the time that this
is worthwhile in repeated use. I think kind of the other
philosophical point, if I think the natural
history of the disease is changing quickly, I
also have the worry– if yesterday’s trial is
not relevant to today, then today’s trial may not
be relevant to tomorrow. And so I have that other issue
of going forward as well. So I think that becomes
just a really hard problem to deal with, and this isn’t
designed to solve that. I think– OK. So I think this is a more naive
question than Neil’s, but it comes down to the
same issue, because I think the issue that
maybe a lot of us would be struggling
with is when do you even consider doing this? Because it sure
sounds like you could save a lot of
time, and certainly from an NIH perspective,
a lot of money if someone was able to use
historical borrowing. But when do people
consider this one? Neil raised the question of
the changing natural history of the disease. Another question I would
have is, do you even consider it, for instance,
if the historical data is retrospective
historical cohorts. Do you even consider– versus, let’s say, a
prospective cohort. Do you consider
observational study data versus prior-randomized
clinical trial data? Because a lot of the examples
that you’ve been using have been other randomized
controlled trials. In some of our
rare diseases where we want to use a lot of
these innovative methods in design, the best
historical data we have may be retrospective cohort
data, it might even be if we’re lucky
some observational prospective cohort data. It’s less likely in some
of our rarer diseases to be another randomized trial. So I mean, do you even
think about using this in those kinds of situations? So it depends. One of the key issues– so if
you go to the FDA perspective on this, they’re very upfront
that they fully accept the statistical methodology. And the hangup is
essentially, do we trust the historical data
that’s being brought to bear, which is exactly the
right question to ask. It does depend on the setting. There’s a pharmaceutical
group called TransCelerate which they’ve been collecting– essentially a bunch
of pharma companies have agreed to data dump
a repository exactly for this purpose
among many others, but this is one of
their key work streams. And they’re writing
some white papers on under what conditions do
we think this is valuable? What that white paper
will say is, essentially you need to look at the
area, areas like glaucoma, for example, very
stable over time. PCR rates in breast cancer,
very stable over time. Depression studies, you can get
lots and lots of variability, and essentially it’s so large
this might not be a good idea, so you really have
to know the area. In terms of rare
diseases, the smaller the trial you’re running today,
the broader that sweet spot. There’s so much
variability in the study that anything you can do
to reduce the variability is bigger than any risk you
have to creating a bias. And so in rare diseases, you
have the interesting trade-off that the data might
not be as nice, but because the sweet
spot is so large, it can be in your benefit. So one thing we want to
do is when we design this, we write out the sweet
spot is this big. You need to be comfortable
that your rate is within these bounds or this
isn’t going to work for you. And then it becomes– I mean, I do not want to call it
a gamble, but in the long-run, it’s, do you believe
the sweet spot? And if you’re going to
be right 80% of the time, then this is worth doing. If you’re going to be right 40%
of the time, then this is not. And it’s a matter of obtaining
that confidence in repeated use. And I think– I didn’t ask
the second part of my question very clearly now that we’ve
gone through some of this, but to get away from the problem
of the historical patients as we’ve been discussing,
could you use this methodology for a contemporary
control That’s not included in the study,
but is an observational cohort or on another registry? So that’s what I was
trying to get to. Do you have something to say? I’ll say it. Yeah, I don’t want to hog– –thinking. This is– sorry, this is
a dear to my heart area. So contemporary–
the general issue of drift, one part of that
is separation in time. Other parts of it
are going to be different sides, different
standard of care, other aspects. So I think the general
notion of how close they are, time is one aspect of that
and other factors also come into play. So contemporaneous gets rid of
the time aspect that may not get rid of different
sites, other differences between what’s going on. We’ve had a lot of use. We run platform trials which
are 10-year continuous trials. Those are nice because you
can see the whole time course, and you can get an idea how
much have they been changing over time, and you’ve got
almost the self-mechanism for checking this assumption. So contemporaneous,
I think they’re good, but it doesn’t eliminate all
the possible discrepancies. Let’s see, I think we
need to take a break. Deraus is informing me that
we’ve got goodies in the back. We’re going to take 10
minutes, will that work? Let everybody get some food. If it takes 12 minutes,
we’ll take 12 minutes. So we’ll let everybody
get some food and rest. We’re going to go ahead
and get started back up, so feel free to grab your
coffee and your goodies. All right. So we’re going to get
started back up again. So this morning– sorry, the
previous session, we mostly talked about ways that we can
efficiently use our resources. So we’ve talked about ways
that we can make our sample size flexible, ways
that we can use data that we have in hand to help
make our trials more efficient. One of the other broad topics
that we want to talk about is how innovative
trials can help us to answer broader questions. So a standard randomized
clinical trial requires focus. So typically what
we see is a trial that is a one-to-one
randomization looking at one treatment versus control. We have a very specific
homogeneous patient population that we’re looking at. We’ve made an assumption
about our treatment effect. And a lot of times, our
standard randomized trial doesn’t look at a lot
of different things. Like we don’t look at
treatment combinations, we don’t look at different
what we call domains. So there’s reasons
historically for why we have this focus in
our clinical trials, but one of the
things that we often think about when we’re in the
process of designing a trial is we try to anticipate if
the trial doesn’t come out the way we hoped. If we have our p-value at
0.06, for example, what would we regret
and what would we think we would have done
differently in our trial? Would I think, oh,
I should have looked at a different population? Or I should have had
a bigger sample size? Or I should have looked
at a different dose? And so we try to
think about what are those things in
our trial that we might regret if we don’t do them. And if we can think about
those things in advance, and then we can make those
changes to our design up front to help reduce
that anticipated regret. So innovative trials,
rather than focusing on a small question often
aimed at answering a bigger question or a set of questions. So because of this, we
can broaden our focus so that we’re not just maybe
looking at a single population or maybe not just
looking at a single dose. And this makes our
trials more robust. So robust meaning that
we’re robust to some of these uncertainties. So if we have a narrow
focus and we’re just designing a trial with one
dose versus our control arm, we’re making an assumption that
we’ve picked the right dose. Or if we just look
at a population, certain characteristics,
we’re making an assumption that that’s the right
population where the treatment benefit occurs. And if there’s uncertainty
about that, there’s uncertainty about what the right dose is
or what the right population is and we choose incorrectly, there
are no obvious consequences to that. So if we can make our trial
more robust and looking at broader questions–
multiple doses, multiple populations,
not only does that help with accounting for
the uncertainty that we have and maybe giving us some
protection against making a wrong assumption,
but it also allows us to look for things like
interactions between treatments that you don’t get
to do if you just selected one single treatment. So this idea of looking
at more questions, this is not a new idea. So a lot of us
may have been told that when you’re running
a clinical trial, you can only change one
thing at a time, right? But actually, Ari
Fisher, one of the giants in the statistical
world, recognized in 1926 even that asking a very
narrow set of questions is not necessarily
the best idea, and if we can answer
logical or ask logical, well-thought-out
questions, that we have a better chance
of answering those questions. So one of the designs
that we can talk about to ask broader questions
is a factorial design. So the idea here is
that instead of asking to compare just one single
treatment versus control, we may be interested
in combinations or different types of
treatments in combination with one another. So for example, we may have
a trial where we have maybe– what I call a domain. So Domain A, you can either
give big A or little a. So maybe this is– in the example
we’re going to use, maybe this is patients
who are on a ventilator. So you have patients that’s
either on a ventilator, big A, or not on a ventilator,
that’s little a. And maybe you’re also
interested in another treatment, antibiotics. Patients say they’re on
antibiotics or they’re not antibiotics, so that’s Treatment
Domain B. Patients can either get big B, which means they’re
assigned to antibiotics, or little b, which
means they’re not. So a trial that just
looks at ventilator versus no ventilator, or a trial
that just looks at antibiotics versus no antibiotics,
that would be a trial with a
single narrow focus. And what that doesn’t
allow us to do is look at any interactions
between these things. So what we could
do instead is look at a trial that looks at the
combination of these things. So you could do big A in
combination with big B; big A in combination with little
b; a big B and a little b. So in the example that we
use, we just had two domains. You can generalize this. So maybe our example is not just
a ventilator and antibiotics, but maybe it’s a ventilator,
antibiotics, and steroids. We now have three domains,
and each of those domains has two possible treatments
with it or without it. And we could run
a trial where we look at all of those
different combinations. So now we’ve got ventilator,
yes; antibiotics, yes; steroids, yes; versus
ventilator, yes; antibiotics, no; steroids, yes. And looking at all of
these things together allows us to answer
a broader question. So this design is called
a factorial design. And these types of
designs are always going to be better
than a separate trial for each variable
because it allows us to look at the interactions
between those two things. So for example, if
you were running a trial where you
were just looking at steroid use in combination
with antibiotics– so big A, big B, and versus neither of
those things– so a patient either gets both or
they get neither, that doesn’t allow you
to answer the question of whether the
antibiotics are important or whether the
steroids are important. So if you see a benefit
in both of them together, which one is it that
causes a benefit? Or is it really only
when they’re together? And you can’t
answer that question unless you do a factorial
design that allows you to look at that interaction. And let’s see– so
if the factors really do interact, only the trial
that looks at all of them, only a factorial
design is the answer to finding that interaction. If these factors
don’t interact, you can still estimate the
effects of all of the factors, and the nice thing
about a factorial design is that the cost
is actually similar than if you just explored
all of them independently. So an example of
this type of design is called the prospect trial. This is a pediatric
respiratory trial. It’s an NHLBI-funded trial. This was looking at the
comparative effectiveness on two factors or two domains. So the first one was the
positioning for a ventilation. So patients were
either supine or prone while receiving the ventilation. And then the type
of ventilation was either conventional or a
high frequency ventilation. So now we have in combination,
there are four possibilities. So we have four
arms in our study. And the trial was designed
to have interaction– sorry, to have interims where we could
consider either dropping a row or dropping a
column of our table. So potentially we could
have a situation where we stopped all of the arms
that were in the prone position and continue the
supine position, or we could do the opposite
where we continued the prone and stopped the other. We could also drop
one of the rows. So for example,
you could stop all of the arms that had the
high frequency ventilation and continue the conventional,
or do the opposite and stop the conventional
and continue the other. So this is an example
of a factorial design that allowed us to look at
these different domains. So as we said, the
factorial design is really the only way that you
can detect these interactions if they exist, the
only way that we can account for which
of those factors contributes to our
treatment benefit. These trials can be
slightly harder to plan and implement when you
have multiple factors. But the nice thing is that if
you don’t have interactions, it’s almost free to
do this, so why not? And if the
interactions are there and you want to estimate
them, this does typically lead to a larger sample size. So that’s one way that we can
address broader questions, is a factorial design where
we look at multiple treatment domains. Another way we can look
at broader questions is by doing a dose
ranging study. So many of the trials
that we see today have a small number of treatment
arms, maybe one or two doses versus control. So we have often
made an assumption that those are the
right two doses that we should be looking at. But if we make a guess, it’s
a possibility that we’re wrong and it may be the best dose
isn’t actually one of those one or two that we’ve chosen. So a dose ranging trial,
what this is going to do is a trial that
incorporates more doses. So three doses, six
doses, we’ve looked at trials that have many
more doses than that. A lot of times when we
do dose ranging trial where we have multiple
doses, we want to have a model that
explicitly looks at the relationship between the
efficacy and the dose level. And a lot of times
will often look at adaptive features such
as maybe eliminating arms that aren’t performing well. So we can start to
drop arms or maybe change the allocation
to our doses– so if some doses start to look
like they’re performing better, we increase the
randomization to those arms. So we can have adaptive
features and we’re going to talk about those a
little later this afternoon. We’ll talk about something
called response adaptive randomization or RAR and
give an example of that. That was a little
teaser, so we’re going to talk more about
dose ranging trials, but wanted to talk about
response adaptive randomization before we go into that. So the gains of this is– well, first of all, we
have a better chance of picking the right dose. If we can look at a
broader set of doses instead of just
picking one or two, we have a better chance
that we’ve picked the right dose in our trial. So the losses, a lot
of times we think if we’re going to add more
doses, that’s more arms, isn’t that more patients? So if I’m going to
add another dose, does that mean I now have to
almost double my sample size? And the answer is
usually actually no. So with modern modeling, so
when we add this dose response models and we start
to add features like adaptive
randomization, there may still be a small
increase to the sample size, but it’s actually
a marginal cost. So the cost of going from
three arms to four arms is actually not that large. So one of the other
things that happens with dose ranging
trials, we often pick a dose response model. Again, that’s an assumption. Basically we’re
looking at what’s the form of the relationship
between efficacy and dose. So that does rely
on an assumption, so that has to be
carefully considered. So we’ll circle back around
to dose response modeling in just a moment. But first wanted to talk a
little bit about enrichment. It’s kind of a buzz word lately. So the idea is
sometimes we don’t know the right
population of patients to enroll in our trial. And enrichment trial
recognizes that there’s some uncertainty about the
population in which population actually has the benefit. So an enrichment trial, you can
do this maybe one of two ways. You can either start with
the largest population and then start to narrow
that down and start to focus on the population
where you’re seeing benefit, or you can go in the
opposite direction. So maybe start with
a small population, and then as the
trial moves on, you start to expand into a
larger, broader population to see if that
benefit still holds. Some examples of
this, we’ve seen this in a lot of oncology trials. In particular,
when we’re looking at a targeted therapy that’s
tested in multiple tumor types– so the idea is that maybe
it’s not as important where the tumor originated, but
rather the molecular target of the agent and
whether it’s going to be beneficial in patients
with those types of tumors. So in oncology, we see
these trials, we sometimes call them basket trials,
where instead of just running a trial in this histology
and then running a separate trial in lung
cancer and running one in breast cancer, what
we can often do instead is enroll a broad population. So we enroll patients in
different types of tumors, and then we use our methodology
to try and figure out which of those populations
is there a benefit. We’ve also seen an example– or some examples in other areas. Stroke, and we’re going to
walk through an example here, where the enrichment was
done to determine the time window between when
the stroke occurs and when the patient gets to
the hospital and the stroke severity. And the idea is, does the
effectiveness of the treatment differ across this population,
or is it effective everywhere? So enrichment trials, the
main idea here in the gain is that we can identify where
the treatment is effective and where it’s not. So one of the
downsides is that we do have some more complicated
statistical methodology to account for the fact that
we’ve got multiple groups and if we have multiple
chances to win. So we have to account for that. But one of the big gains
is that this is robust. Because we’re looking
at a broader population, there’s a better
chance that we’re going to ultimately end up with
the right population instead of just guessing at the
outset and potentially getting it wrong. These trials tend to be
larger than if we were just doing a trial in one
homogeneous focused population. But then, of course, the
potential benefit of that is it might be easier to enroll
because we’ve now got a broader inclusion/exclusion criteria. So an example of this is a trial
called DAWN run by Stryker. Jeff Saver was the
PI of this trial. And this is investigating the
use of a device called TREVO for the treatment of stroke. Thanks to Roger Lewis who
provided these slides for us. So before the trial
was started, there were some papers that came
out showing that endovascular therapy was highly
effective in some, but the population for
where that benefit happens was a little unclear. So which patients are
most likely to benefit from this treatment? So at the start of the
trial, the investigators, they sat down and
tried to think about where are their uncertainties
about this device? What are the things
that we would like to learn in the trial? So this slide shows kind of
this set of uncertainties that they listed, things
that they don’t know that they would like to learn. One of those was
where is the benefit? Where does this occur? So in stroke, the
primary endpoint here was the modified Rankin
scale, mRS disability scale. So this is a seven-point scale,
kind of an ordinal scale. So zero means the patient
has no symptoms at all, a 6 on this scale means
the patient has died. So the question was,
is the treatment most beneficial for
subjects who are maybe in the lower range of that scale
or more beneficial for patients who are in upper range? Or is it beneficial
across the entire range? So that was one of the
things that we didn’t know going into the trial. Another question was
about who benefits. And one of the things
they wanted to look at was the core infarct size. So this is, as the
way I understand it, was a measure of the
damage to the brain tissue. And the one of
the hypotheses was that only those patients with
smaller core infarct sizes would benefit from
the treatment, but those with the
larger infarct sizes were less likely to benefit. But that was an
uncertainty, we didn’t know, and that was one of
the things that we would like to learn
about during the trial. One of the other
uncertainties was the magnitude of the benefit
for this device called TREVO. Does it have no benefit,
does it have a large benefit, does it have a small benefit? And the implications
here were, well, if it has a large benefit,
we could run a small trial, but if the benefit is smaller,
we would need a larger trial. This has kind of– the idea that we might
need a flexible sample size because our we have
some uncertainties about the size of
the treatment effect. So for all of these
questions, if we were to just make an
assumption and say, OK, we’re only going to
enroll patients who have the small core infarct
sizes, that would be making an assumption, and if that
assumption is incorrect, that could have
consequences for the trial that maybe the
device is beneficial and we just picked
the wrong population. So what we have done
instead is designed a trial that’s going to look
at all of those factors. First of all, we’re going to
talk about that first one, the mRS scale. So this is a seven-point scale. Typically in the analysis
of these types of trials, the scale is dichotomized. So for example, we
look at patients who had an outcome of 0, 1, or 2. So these are good
outcomes, patients have small-to-no symptoms. And we count the
rate of patients who have an outcome of
0, 1, or 2 versus those that have 3 through 6 in kind
of a responder rate analysis. So the downside of that is
that if the benefit occurs– if we move patients who
might have otherwise been a 4 and we move them
to a 3, this kind of analysis that just lumps
together the 3’s and 4’s doesn’t account for that. So we miss kind of the subtlety
in the treatment effect there. So one way to handle that is
to have an analysis method that looks at that scale. So it’s not just looking– does
a patient go from somewhere of 3 to 6 down to somewhere 0 to
2, but looking at how they move and what degree of
movement they have. So what we’ve done
here is constructed what we call a utility-weighted
mRS. So what this does is it takes each value of
that mRS score, 0 through 6, and assigns it a utility. So here, utilities that are– a utility of 0 is essentially
there’s no benefit, and a larger utility score means
that there is a bigger benefit. So there were a couple of
papers in the literature that had proposed
utility-weighted scores. So these top two
rows in the table shows some utility weights
that had been proposed. So for example, these were
on a scale from 0 to 10– so a utility of 0 is no benefit
to the patient, a utility of 10 is a good benefit. So patients who end up
with a mRS score of 0 were assigned a
utility weight of 10– so that’s a good outcome. Patients who died were
assigned a utility of 0. 5 as kind of an interesting one. So 5 on the mRS scale
is essentially a patient who is in a vegetative state. So one of these
proposed methods treats that– gives it a very
small utility score of 0.1. The other proposed
method actually gave this a negative
utility score, so that this
outcome was actually considered worse than the
outcome of 6, which was death. And then you can
see that in these of the other range of the
mRS’s, the two proposals, they agreed pretty closely. So in the DAWN
trial, the new trial, they actually used the
weights that are shown here. So outcomes of 5 and 6, both
received a utility score of 0. mRS of 0 got the
good utility of 10, and then the various
degrees in between. So then when the
analysis is run, that allows us to look a
little more subtlety of where the treatment effect is. So instead of just doing
this dichotomous looking at 3-to-6 versus
0-to-2, we can now have a better idea of the
magnitude of the benefit. So for example, a patient
going from a 4 to a 3, that’s a pretty large
increase in their utility. Whereas going from a 1 to
a 0 has a smaller change in their utility. All right, so that was one
question that we addressed. The other question
was, who benefits? And this idea of, should
we enroll only patients who have small infarct
sizes or what should we enroll a broader population? So the infarct size,
this was thought to be something they could
identify as an eligibility criterion for
entering the study, and might define
different populations and how they respond. So the strategy could
be that we start with the small infarct
sizes and then expand if we see benefit
there and start to enroll the larger ones. The other strategy,
what we call enrichment, is we start by enrolling
the broad population, And then we start to restrict
the enrollment criteria if necessary. And the DAWN strategy
actually did this later one. So they started by enrolling up
to 50 cc core infarct volume. And then they allowed
enrichment rules– so they had interim
analyses that would allow you to then prune
off certain sections of that so they could potentially
decrease that from 50 to 45, from 45 to 40, et cetera. And the way that
that was done was by looking at the likelihood
of a positive trial. And if you were more likely
to get a positive trial from decreasing from 50
to 45, then that was done and you enriched to
a smaller population. And I think I have another
slide about that in a moment. So that the third question that
we had some uncertainty about was the magnitude
of the benefit. So ideally, if the truth
is that this device doesn’t work at all, if
there’s no benefit, we would like to be able to
stop the trial early and save that sample size. Or, we would like
to say, OK, if there are certain regions of the
population where there’s no benefit, we would like
to stop enrolling just in that population,
but if we see benefit in other areas of
the population, we would like to keep going. So if there’s no benefit,
we want to be able to stop. If there is a small benefit
but it’s clinically important, we would like the
trial to keep going. So they set a large
maximum sample size of up to 500
patients, but then had interim analyses that
allowed us to start early. The other idea behind a
clinically-important difference and understanding
what that means was related to the
choice of the outcome. And instead of just using
the dichotomous measure, trying to pick an
outcome that’s going to be more sensitive
in recognizing where that benefit is. And finally, if we have a
large treatment benefit, that means we don’t really need
the full insurance policy, as it was described
earlier, of a large trial. So we’d like to build in
rules that allow our trial to stop early and potentially
as early as 200 patients if it’s likely that the trial
is going to be successful. So everything about
this design was pre-specified– they actually
wrote up a methods paper describing their methods. So the sample size
was a maximum of 500, and they had multiple
interim analyses. Starting at 150, 200,
all the way up to 450. And they pre specified
what types of decisions could be made at each
interims– so some interims they had the possibility
of enriching, some interims they had the
possibility of stopping the trial. The rules for
early stopping were based on the predicted
probability of success. And then the rules for
stopping for futility were based on if the
probability of success in any of those subpopulations
was less than 10%– sorry, in all of those
populations was less than 10%. The enrichment criteria
here was based, again, on the predicted
probability of success, and the idea was,
if we could increase the probability of winning
the trial by restricting the population by– and if by doing that, increase
the predicted probability by at least 10%,
then we would enrich. And we would
eliminate populations based on their core
infarct size if they had less than a 40% chance of
benefit in that population. And the enrichment here
was done from the top. So we started with infarct
sizes from 0 to 50, and then we would
calculate what’s the predicted
probability of success with the full population? What’s the predicted probability
of success if we restrict it to 45 and below? What’s the predicted
probability if we restrict it to 40 and below? And if by doing that,
enrichment could increase the probability
of a successful trial, then we would enrich. So some of the statistical
details around this. So this, as you can see, became
kind of a complicated trial, the enrichment,
the early stopping, so we had to do a
lot of simulations and make sure that the
trial was going to behave the way we wanted it to. One of the aspects
that we looked at was, what is the criteria
for a successful trial? So we were calculating
the posterior probability of the mRS– utility-weighted
mRS score being better than the control arm. If no enrichment happens–
so if all through the trial we enroll up to 50 cc
[INAUDIBLE] infarct volume, then the probability of benefit
has to be at least 98.6% to be positive. And this threshold
takes into account the multiplicities of our early
interims for early stopping. And that’s if no
enrichment occurs. If enrichment
occurs, we actually need a more restrictive
criteria because of the fact that we’ve kind of
cherry-picked which population we’re
going to be analyzing in the final analysis. So we started out with 98.6,
but if enrichment occurs, there was a rule for how we
would adjust that threshold to account for the enrichment. We had some statistical
models kind of similar to the models that we’ve
talked about before that shared information across
infarct-sized populations. So they had a model looking
at the benefit as a function of the infarct size. And we ran before the trial
started massive computer simulations to show
that this all worked, that we controlled type I
error, that we could understand how the enrichment happens. So here’s the paper,
this was published in the Journal of Stroke
for the DAWN trial. And this is a trial
that has actually run. It actually finished
and reported out in January of this year
in the New England Journal of Medicine, and actually saw
quite significant efficacy across the wide
range of patients. So I believe the outcome was
it never actually enriched, but the trial did stop early and
was successful at an interim– and I don’t recall
which interim it was. It was the first one– OK, so they saw the
wildly positive results in the first interim, they
were able to stop early. Pause to see if there are
questions on the enrichments? All right. And to hand it over to Kert
to talk about platform trials. On a pacer. So I’ve had
instructions, I have– there’s actual
tape on the floor, I’m not meant to
go past the tape. So I’m going to sit
here very carefully. All right. So I want to talk
about platform trials. These are trials that
enroll multiple treatments within the same structure. So we’re investigating
lots of things at once. They’re motivated by the notion
that if you look at a disease– and I’ve got all
Alzheimer’s up here– there’s been lots of
Alzheimer’s trials, unfortunately
they’ve all failed. What’s happened is if you take– I have an example here, 25,
there’s been more than 25. But as a society, what we’ve
done is we’ve invested 50,000 to 100,000 patients
in Alzheimer’s. We’ve failed over and over and
over and over and over again, and we’ve invested one to
2000 patients in each novel treatment, and we’ve
invested 50,000 patients in– 25 to 50,000 patients
under control. The net result of that is we
know a ton about the natural course of Alzheimer’s. And do we need 25,000 to 50,000
patients enrolled on control in Alzheimer’s? Or would it have been
valuable to do a few less and maybe test a few more
drugs while we were at it? We could have had more
shots on goal, so to speak. So what platform
trials are aimed at is trying to
make that happen? So this is just a graphic. This is really how we’ve
allocated patients. A whole bunch of
controls, and, I mean, certainly there’s a lot of
patients in each of these bars, but the relative
proportions, this is a little off from
what we would think would be the
natural thing to do. So a platform trial is a
special example of something called a master protocol. Master protocols are kind of
a another buzzword these days. Essentially they’re a
protocol that– imagine a protocol with no drug names. Sounds kind of odd. But if you’re investigating
Alzheimer’s, there are certain endpoints
that you’re interested in. Patients are visited
at particular times. They have certain medical
tasks, medical procedures. You can write a
lot of the protocol without really knowing what
the experimental drug is. We’re going to be
testing ADAS-Cog and we’re going to
do it at this rate, so we’re going to do this
kind of analysis at the end. So the master protocol
is really specifying all of that stuff–
the visits schedule, the endpoints, the
analysis, and so forth. A master protocol is
designed so that it can be run over
multiple treatments, and in particular, it
can be run perpetually. So you can say, I’m going to
start a trial in Alzheimer’s. There actually is something
called the EPAD trial in Alzheimer’s,
which is a platform. And at any given time, there
are three to six treatments an EPAD. It’s still getting
going at this point, but there will be three
to six treatments in EPAD. And drugs can go in
and out of that trial. And when a trial– when a drug comes
in, what happens, instead of having to write
an entire new protocol, you say, look,
most of this stuff, we’re going to
have the same visit schedule as all the other
drugs, the same endpoints, the same tests. But when a new drug
comes in, instead of writing an entirely
new protocol, what we’re going to do is we’re
going to add an appendix, and that appendix
says, this is what we’re going to do for
drug X within this. So usually that
appendix gives details of what the treatment is. So certainly all the
medical information, any specific safety risks,
any particular safety concerns if you have to
do extra kinds of testing. We have to do a certain kind
of blood test for this drug that we wouldn’t
ordinarily have to do. And if there’s any
patient exclusions. We ran an Ebola
trial, for example, and some of the drugs could be
used on pregnant women and some of them couldn’t. So the appendix would
say, oh, this one’s not applicable to pregnant women. A platform trial is a specific
form of master protocol. And it’s defined by
its being continuous. So certainly most
master protocols are intended this
way, but some of them we’re going to have six
drugs from the start. A platform trial
is usually intended to run for a long time,
and drugs are going to enter and exit over time. It’s also intended to have a
cohesive inferential structure. And so what do I mean by that? I’m not trying to run– platform
trials are often confused with cooperative networks. Where if you run a
cooperative network, you’ve got a large
patient infrastructure, but every trial that comes
in is really separate. So I’m going to run
this trial and I’m going to get patients
from this network, and I’m going to run
that trial and I’m going to get patients from
the network and so on. But the trials really aren’t
talking to each other. These are trials where they
aren’t trials stapled together. What happens is patients– when you come into
this trial, if there are six treatments
running in it, you can get randomized to
any of the six treatments. So it all has a single umbrella
that decides randomization and it has a single umbrella
that describes the analysis structure. One of the key
reasons for doing that is if you have multiple
arms and you’ve randomized across
everything, that allows you to pool controls
and it allows you to compare across different arms
because you had randomization as opposed to if you
didn’t, if you just ran this trial in this part
of the network and that trial in that part of
the network, you’d worry about potential
biases because you didn’t have this randomization. Yes? So Kert, that’s
very interesting. Are there– in these
master protocols as you’re describing them, do
the inclusion criteria for each of, let’s say, the
subtrials that are going on within the master protocol
have to be the same in order for it to work? Because I can envision
where you might be able to have a master
protocol for a trial, but the inclusion
criteria for trial A might be different than B, C,
or D, just maybe even slightly. So there are a couple different
directions you can go. And the efficiency that you
gain depends on the details. So I’m trying to remember–
what’s the European oncology trial? If you asked me any other time,
I’d think of it immediately. All right. Anyway, there is a trial
that for some reason I’m blanking on,
what it is designed to do is there are several
different oncology treatments, but every one of
them is intended for a different
molecular mutation. So they’ve got drug A is
intended for mutation A, drug B for mutation B, and so on. That is really more of a,
they are using the platform as a screening mechanism. If you imagine these
trials run separately, trial A would exclude out 80%
of the people they screen, trial B would do the same thing,
trial C– because they’re just looking– each one’s
looking for a small segment with their particular mutation. If you do all of these
together, then what happens is that screening
mechanism has gained you operational efficiency. Other trials are intended to
look at the same treatment and you’ve got
multiple drugs that work on the same population. That gives you
inferential efficiency because you can share
controls among that. So it really– you can
go both directions there, and the efficiency you gain– you get efficiency, but it’s a
very different kind depending on which direction you go. Oh, yes, sir? I have two questions. So one, how do you
get the companies to agree to a master protocol? Which, I don’t
know, maybe you can point to a reference
or something that you can point
for the lawyers to kind of debate about? So the second question is
more realistic is, is there a certain type of randomization
platform schema that you would recommend for a platform
trial where it’s more efficient to
bring in new drugs as other drugs
kind of get exited? How do you get
companies to agree? Very, very carefully. So the first thing
that you can always do is you can always get
a company to agree to be the second, the
third, or the fourth drug into the platform, because you
can effectively promise them the cost is cut in half. The thing that you can’t
ever do is convince someone to be the first drug
in the platform trial because they’re difficult
to get together. Now that’s a little
bit tongue-in-cheek, but that part’s absolutely true. The other sticking point,
which I think you probably were intending,
is companies often don’t want to play in
the same sandbox, period. One thing that I think is
fascinating about platform trials, there is a
direct comparison between drugs within a platform
and you randomize in-between them. Usually what has happened,
platform trials nowadays actually have, part
of their charter, is a statement that they will
never publish and compare the drugs together. And often that is enough. I-SPY 2– you asked for papers. The I-SPY 2 has been
running for about 10 years, it’s had lots of
companies go through. I don’t know if there’s
any publication that describes the interactions. Probably by design as
we never actually want to describe the
private interactions. There is a paper
that’s been submitted from the adaptive platform trial
coalition, which is self-named. But Derek Angus is
leading that, Derek Angus and Brian Alexander. And they I think
have some discussions on how to get
companies to buy in. So I’m not giving you a
good answer because it’s a delicate question,
but we have successfully done it for a number of trials. An example of this– so let me just give you a kind
of– this is the efficiency. So we’ve got an
emerging epidemic, this is inspired by our
work on the Ebola epidemic in West Africa. Multiple possible treatments
were suggested for Ebola. If you remember,
there was something called ZMapp, a number
of other antivirals, there were some suggestions
certain cancer drugs might be effective, and people were
thinking of combinations, all kinds of stuff going on. So one question was, if
you have this large space of possible treatments,
how do you look at those? What do you need to do– what’s the right way to
evaluate 20 different drugs? We’ve done this in all
Alzheimer’s as a society over the past 20 years. We’ve done drug A
failed, drug B failed, drug C failed, drug D failed– we’ve run a lot of
trials in sequence. So how do we look through
these 25 different issues? In order to make the problem
a little bit more concrete, suppose that under
the standard of care there is a 30% survival rate. That’s consistent with
our Ebola example, it might have been closer to 40. But I will say 30 here. We’re going to
make the assumption this is a hard disease. So I’m going to assume that only
10% of novel treatments work. 90% don’t. And that depends on
what area you’re in. If you’re in Alzheimer’s,
nothing’s worked in Alzheimer’s. Probably assuming 2%
might be optimistic. Sepsis, an incredibly hard area. Certain kinds of
cancer immunotherapies right now fortunately are
in much better situations– you might expect a larger
number in certain areas. But I’m setting up this problem. So 30% chance of survival,
and of this large bucket of possible treatments,
only 10% work. How do I find the golden
nugget among that, those large numbers? That 10% of novel agents that
work, they have 50% survival. So working here means I
increased from 30 to 50. All the rest do nothing. So 30/30. So certainly things could be
harmful, I’m not assuming that. So how do we start? So this is what we might
do in a traditional trial. We’ll pick one. We’ll run a trial. I’m going to do 100 patients
per arm, so I do 200 patients. 0.025 type I error. If that first trial
is successful, I go, yay, I found
a treatment and I declare the process a success. If that trial fails, then I’m
going to pick another drug out of this bucket and I’m
going to run another trial. So this is just
intended in a sequence. Grab a drug; run a trial; if
it works, great; otherwise, grab another drug,
if that works, great; otherwise,
grab another drug– one after the other
after the other. Now this is actually
a pretty good strategy if you can pick what the
absolute most promising drug is in advance. Generally speaking,
that’s hard to do. If you actually knew
which drugs work, you probably wouldn’t
need to do the trials. So this– if we’re in this– if our bucket has 10%
good drugs, what happens? So on average, we’ve got a
run through 9.8 treatments to find a successful drug. That’s about in line with
our 10% of the drugs work. Our n, we have to– so this n is the total number
of patients we had to treat– 1,966. And that’s effectively–
really, we’re running 10 different
trials on average in order to find that
one golden nugget. Let’s see, we assume
that accrual rate, so we found one in
12 months on here. So here’s another way
to do this search. Instead of looking at one-to-one
over and over again, well what does that do? My first one-to-one trial,
I enrolled 100 controls. My second one-to-one trial, I
enrolled another 100 controls. The third one, I ran
another 100 controls. That’s the same
control over again. So that doesn’t
make a lot of sense, let me at least combine those. So I’m going to do a
shared control design. I’m going to do five
agents at a time. So still 100 per arm,
but now five agents are going to get compared
to a single control arm. And again, I’m going to do
pairwise independent analysis. So now I run 600 patients– 100 controls, 100 on
each five active arms. If I find a successful
drug, I go, yay, I won. Otherwise, I go
back and I enroll another five arms at a time. So I run this sequence. This is good. And I get savings here from
sharing the control resource. The fact that I haven’t
enrolled 100 controls over and over and over again is
what’s gaining me efficiency. So my average
sample size, it now takes me 1,528 patients
in order to finally find a successful treatment. And I can do this in eight
months rather than 12 months. I can do a little bit better. Remember, we talked
about futility earlier being able
to stop bad drugs? I’m going to do that. So those five active
arms, if some of them aren’t performing
well, I’m going to stop them for futility. That gets me an additional
benefit, a large one. So that 1,528 I can drop to 971. So now I’m almost in half. So sharing controls,
using futility, I can now find an effective
treatment twice as fast. What a platform does is it
adds another piece to this. . So we’re still going to
investigate five agents at a time, but now
what we’re going to do is we’re going to run interims
every 150 total subjects. I was about to get
up, but I refrained. So the futility
for an arm here– so we’re still
declaring futility, what you can think
of this platform as doing is whenever an
arm stops for futility, I’m going to bring a
new arm in immediately. So I’m not running
a six-arm trial and waiting for it to finish and
then running another six arms. Whenever a slot opens, I’m
going to bring a new drug in absolutely immediately. So I’m just investing
things forward. Question? So is there a point in
this mechanism in which your controls can crossover? If it’s somebody
having repeated events? Can someone– oh,
so can somebody– so that depends. Some of these are designed
to allow crossovers. There is a platform
trial called PanCAN which it allows patients to be
enrolled on multiple therapies over time. In other ones– in
this one, we actually did not allow crossovers. Usual issues with crossovers on
washout effects and carryover effects would apply to this,
but in theory, it’s possible. So if you had
asthma, for example, and you wanted to
see what was the best abortive agent in the
middle of an asthma attack, and someone was on the
trial as a control, there was other agents,
after some period of time, the controls could
switch after they’d been controls for
X amount of time, because the events are
in themselves isolated although the patients
are the same? OK. I don’t know that
much about asthma, but it sounds plausible,
that sounds like the PanCAN. As long as you don’t think– if I’ve been on drug A– or control and I
switch to drug A, if you think that’s no
different than being on drug A, that usual
question, if the answer is– if you could do a crossover
in a regular trial, you could do a
crossover in a platform. So I just want to touch
on the– actually, go back to the eligibility
question for a second here. Because if you’re replacing
one of these arms, let’s say the new treatment
comes in with, for example, you can’t give it to pregnant
women who have Ebola, how does that modify the
overall exclusion criteria for kind of the platform trial? Does it have to
change or are they not eligible to be randomized
to one of the arms? How is that handled
operationally? Yeah. It’s a– oh,
there’s a follow-up. [LAUGHS] I’m a
little nervous now. So the– no, you’re good. So generally speaking,
it’s complex. What you would often do is have
to analyze, say, pregnant women separately. In the Ebola trial, the
two special populations were pregnant
women and children. And it was actually
an interesting debate because the pregnant women
issue, the obvious question, if you die of Ebola,
that’s bad for the baby. So it was actually–
it was a huge debate over which drugs
actually ought to be excluded for pregnant women. You’d have to make
a modeling decision. Do you want to combine
the pregnant women and the normal data– not normal, it’s the wrong
word– pregnant women and everyone else
together in the analysis and pool it when you’re
evaluating an arm? Do you want to view
pregnant women separately when you do it? You’d have to basically come
up with a statistical model to account for that, and I don’t
have an easy answer for you in that situation. From an enrollment perspective,
though, as they come in, they wouldn’t be eligible to
be randomized to that arm. Yeah. So the operational– sorry–
the operational [INAUDIBLE] is easier. If you had arms A,
B, C, D E, and you said that that C and E were
ineligible for pregnant women, then a pregnant
woman coming in would be randomized to A, B, or
D. And so potentially you’ve got biases in those
populations and that’s where the modeling
would come into play. And so then with kind of
changing exclusion/inclusion criteria potentially
for treatments and also different treatment
modalities, I mean, these essentially are
unblinded trials, then? Depends. So the EPAD in Alzheimer’s, for
example, some of the treatments are pills and some of the
treatments are injections. For the purposes of EPAD,
you can get a placebo pill or you can get a
placebo injection. So the way EPAD works
is you are randomized to arm A, B, C, D, or E.
If arm B is an injection, then you’re randomized, you have
a chance of getting a placebo injection or a active B.
If you’re randomized to C and it’s an oral, you
might get a placebo oral or you might get
active C. So it’s kind of a two-tier randomization. It’s a question of whether– in an EPAD, they actually
pool the oral placebos with the injection
placebos, which was actually a controversial choice. What was interesting, one of the
compelling arguments that was made is, we’ve spent 20 years
and haven’t been able to move the needle on Alzheimer’s. Somehow if we could give an
injection as opposed to a pill, and that will affect
Alzheimer’s, that would be a little strange. But it is a choice
on whether you think that different
placebo modality matters, and it will make life
more complicated. So I haven’t given
you a direct answer, we try to deal with
it as best we can. All right. Let’s see. OK, so this open platform. So again, you can think of
the adaptive plus every 50. This was essentially
the five-at-a-time, but then we waited for
the trial to finish. This is when we invest
the resources forward. So now you can see, we’re
down to 849 subjects. So we’re searching for
this needle in a haystack. We’ve gone from
1,966 down to 849. We’ve saved about
60% of the subjects, and we’ve really done
it in multiple ways. We’ve shared our control
arms, done multiple treatments at once, we’ve employed
adaptive features, and we’ve invested
any savings that we’ve gotten forward as best we
can by bringing in new drugs. This is in a paper, this
reference in clinical trials, essentially this table. There are lots of
examples of this. They have different features. There’s a paper by Janet
Woodcock and Lisa LaVange in the New England Journal of
Medicine really touting these. These are the
examples in the paper. So some of these
I-SPY 2s have been running for about I
think it’s eight years, but it’s been
running a long time. Lung-MAPS and other
oncology trial. ADAPT is really a concept trial
for resistant antibiotics. This is not the Ebola
trial I was describing, we did something for
the Gates Foundation. This is something with the NIH. This is another all Alzheimer’s
trial separate from EPAD, and then this is
an antipsychotic. There’s actually a whole bunch
of– so these are some more examples. I don’t intend to go
through these slides. The only thing this slide
is meant to indicate is this is getting
back to your question. There are just lots of weird
complexities that happen. And really, some of these
are blinded and some are not; some of these are embedded in
a learning health care system and some are not;
and all of them involve quirks to
deal with the issues. And there are
occasionally trade-offs. We will pool the placebo arms. We don’t think it’s going to
make a big deal difference, but we can investigate
twice as many drugs, so we’ve made that
trade-off consciously. Certainly these are a
challenge to set up. So if you’re going to
do a platform trial, getting the first drug
in it is a long haul convincing companies to
play in the same sandbox, to convince researchers to
play in the same sandbox. There are operational aspects
that are more complex. Informed consent is
often two-tiered. Often there is an
informed consent– are you willing to be
randomized into a drug, and then separately, are you willing to
be randomized to the placebo arm or the active arm
of this drug is how EPAD does informed consent– so you have to do it
twice, for example. On the plus side, the
second and future agents, this is a no-brainer
thing to do. If you have these
platform trials set up, it would be ridiculous to do–
if the whole world were set up where here’s a breast
cancer platform, here’s an Alzheimer’s platform,
Here’s a heart disease platform, here’s
a stroke platform, it would be
ridiculous for anybody not to enroll in these platforms
and go run their own trial with their own controls. What you’ve got
here is a warm base. Sites are up and running. There are larger
networks allowing you to find subsets of patients. Everything is better in
this particular world. And I think that is good. All right, so just as
a conclusion, again, these things, it’s
the startup cost versus the massive
savings at the end is really what the
trade-off is here. All right. Anna. All right. So the next topic is about
changing the randomization ratio during the
course of a trial. So the idea behind this is
instead of doing a one-to-one randomization, or if
we have multiple arms, one-to-one-to-one randomization,
the idea is that we’d like to efficiently focus our
resources to the arms that are most promising. So the place where we’ll
typically see these is dose ranging
trials, for example, that’s the example we’ll talk
about in a moment; platform trials as Kert just
described; factorial design– so if you can identify
among multiple arms which arms are the most promising
and then start to focus your resources on those arms. Different ways this
can be achieved. Kind of the simplest
method in many ways might be considered
arm dropping. So if you have
sufficient evidence that an arm is not performing
well, you get rid of it. Another way that
this could be done is a little more subtle
than arm dropping– that is adaptive randomization. So if you have an arm
that’s not performing well, you start to decrease the
allocation to that arm and start to increase the
allocation to the arms that are performing well. So you can see
that in the figure. Going to describe this by
walking through an example. This is a trial that was
designed for smoking cessation. The final endpoint
is looking to see if a patient is still
smoking in weeks 3 through 6 of the treatment. So the idea is in weeks 1
or 2, they may not be fully weaned off of the
cigarettes yet, but by weeks 3, 4,
5, and 6, if they don’t smoke in any
of those weeks, they’re considered a
success, a failure otherwise. The other idea in this
trial was that patients who drop out of the
trial and maybe don’t complete the full six
weeks of follow-up are also considered failures. So this could be– we assume
that they dropped out, maybe that’s because
they are smoking again. So we consider those
patients a failure. This trial has multiple doses. So in addition to
the placebo, we have six dose arms ranging
from 1 milligrams up to 100 milligrams. And what we would like to
do, the goal of our trial is to identify the dose that
has the maximal response. So the dose that has the
highest response rate. One of the quirks
of this is that we expect that a U-shaped
curve is possible. So the idea is that maybe
those top doses may have higher dropout rates, for example. Maybe those higher doses
come with more adverse events and patients are more likely
to drop out at those doses. And because those
doses, if you drop out, you’re counted as
a failure, it’s possible that the dose response
curve is not just increasing, but that actually
at some point, it starts to decrease because
those adverse events start to outweigh the benefits. So– oh, let’s see. So in this particular
case, we are assuming that U-shaped
curve is a possibility, and we’re looking for the
maximum effect, which may be at one of the middle doses. Other trials, instead of
looking for the maximum effect, they might be interested
in looking for something like an EDx, like ED90, which
would be the dose that gives you 90% of the maximal effect. Again, the idea
here is that as you start to get to the
higher doses, maybe those are doses that
have adverse effects. So if you can back
away from that and give a slightly
smaller dose and sacrifice only a minimal
amount of efficacy, maybe that’s the
optimal dose to give. So one way that we
often handle that is by looking for this ED90. So trying to find a dose that
has at least 90% of the benefit but maybe is not
the highest dose. Another way we sometimes
handle that idea, that trade-off between
efficacy and safety, is explicitly by using
a utility function, and the idea behind this is
to combine several endpoints. So maybe we can explicitly
define that trade-off between the safety
and the efficacy, and we do that with
a utility function, and then we try to
find the dose that has the maximum utility which
combines those endpoints. So our current trial is a
little bit simpler than that because we’re just
going to count patients who drop out as failures. And so that aspect
of that trade-off is just explicitly handled
in the primary endpoint. Some details about
the trial here. So we’re going to enroll
280 subjects into our trial. In addition to our goal of
identifying the best dose, we also would like
to calculate– this is a phase II trial,
we’re interested in what’s the probability that a
dose that we carry forward is going to be successful
in a future trial. So what we’re going to do
for each dose in the trial, is we’re going to
calculate– suppose that we choose this dose and
we carry this dose forward into a phase III, one-to-one
randomized trial with placebo, and in that future
trial, we plan to enroll 400 patients, 200 on each arm. And what I would like to
know is the probability that that dose is
going to be successful. And that could be used
as a fertility rule, it could be used as part
of our success criteria. So I’ll describe several
iterations of the design that we went through,
starting with what we call a basic trial. So in our basic
trial, we’re going to enroll 40 subjects to
each of the seven doses– so equal randomization. This may not be the
optimal strategy. Oftentimes when you
have multiple doses, it’s actually beneficial
to enroll more subjects to the control arm. But here, we’re going to
consider a simple example where just unequal allocation
to all of the doses. In our basic trial,
we’re not going to do any modeling of the
efficacy across doses. Instead, we’re just going to
analyze each dose independently without any borrowing
between the doses. So in order to do
that, what we’re looking at here is the response
rate we call Pd for dose d. And we’re just going to give
independent prior distributions to those doses. So we’re doing this
in a Bayesian fashion, and what that means is
at the end of the trial, our drug is going to be
considered successful if our posterior probability
of beating the control is higher than 99.2%. So right now, we don’t
have any adaptations, we don’t have any
early stopping. This is just a fixed
sample size trial of 40 per arm analyzing the
doses independently. And the reason that we
have this high threshold is a multiplicity adjustment. The fact that we have
multiple arms in our trial, we have multiple
shots on goal, so that we need to have an
adjustment that it’s harder to win each dose. All right. So we’ve described
some scenarios that we’re interested in. We’d like to see how
well this type of design behaves under these
different scenarios. So in the table here,
I’ve got the doses along the top and a
few different scenarios along the rows. So first of all, we’d like
to see how well the design performs in the null scenario. Well, we think that the control
arm has about a 20% rate. So about 20% of patients that
get a placebo stopped smoking. And then the null
scenario assumes that there is no benefit
on any of the doses, it’s 20% across the board. We’ve got a scenario
we call Harm. Here is where the lower doses
are equivalent to control, and then as you start
to increase the dose, the response rate
actually goes down. We’ve got a set
of scenarios where we have a treatment benefit. In all of these scenarios, the
benefit increases with dose– so that the highest doses are
more beneficial than the lower doses to various degrees. And then finally, we
have a set of scenarios we’d like to investigate where
the maximum benefit actually occurs somewhere in the
middle of the dose range. So here in the
inverted-U scenario, the best dose is
the 25 milligrams, and then the effect
starts to decrease as you get higher doses. And then we looked at a
scenario where that peak occurs earlier in the curve– so it occurs at the
five-milligram dose. So we’d like to see across
this range of scenarios what happens to this trial. So in this table,
what we’re showing is essentially the
power and the type I error rate, which is
in this column here. So under the null scenario, are
type I error rate is about 5%. If we look at the scenarios
where we have a treatment benefit, this particular trial
has a pretty modest power here. So ranging from 14%
up to about 56%, and in our inverted-U scenario,
somewhere around 45% to 50%. So this is an
underpowered trial. We don’t have the
highest power to detect these kinds of scenarios
that we’re interested in. The far column on the
right shows the probability of picking the best dose, and
the best dose is described here in parentheses. So for example, in
the positive scenario, we had about a 40% probability
of choosing the best dose. And in these
inverted-U scenario, we had about a 48% probability. So some of that depends
on how many doses are good and what’s the magnitude of the
difference between your best dose and maybe the
second best dose. All right. And then that give
me 10 minutes for OK. All right. Moved the wrong way. All right. So now we’re going to
take that basic trial that we described and we’re
going to add some modeling, and we’re going to consider
a dose response curve. And what this is just a function
that describes the efficacy as a function of the dose. Here in our example,
we can’t consider a curve that assumes a
monotonic relationship, so we expect there
is a possibility that the curve
increases for a while and then plateaus or
maybe even comes down. So we can’t assume something
like a logistic curve or Emax curve because those have
a built-in assumption that the effect
increases with dose. So we’re going to use a model
called a second order Normal Dynamic Linear Model or NDLM. So this is kind of a
smoothing spline function. So now I’m going to
go back to that table where we’re looking at the
probability of success. So now in each one of these
cells, we have two numbers. So the number on the
left is our basic trial that we just looked at before. The number on the right is now
we’ve done nothing else except add a dose response model. So under the null scenario,
we have about the same type I error rate, about 5%. But in some of our
other scenarios, we start to see some pretty
big increases to our power. So take a look, for example,
at the positive scenario. Previously we had 56% power. Just by adding on a
dose response curve, we’ve increased
that to 70% power, so we can see some
pretty big gains. The idea here is that we’re
able to borrow information across these doses. Let’s see– inverted-U scenario,
we saw some jumps from 50% to about 64% by adding
the dose response model. The probability of
picking the right dose, we see some modest gains
here, about 40% to 46%. Here it didn’t change
very much, about 55%. Another way we can think
about choosing the right dose or have we made the right
decision is, how many doses continue into the next trial? So what we’re
showing in this table is the probability that we
pick the right dose given that we can carry
one dose forward, the probability that we get
the right dose if we carry two doses forward, and the
probability that we get the right one if we
carry three doses forward. And here, this is conditional
on trials that were a success. So if we just take the subset
of trials that were successful, looked at which doses
we carried forward, the trend here is obviously
if we take more doses forward, we have a better chance
of choosing the right one in that step that
we take forward. So the next thing we’re
going to add to our trials– so we’ve looked at just going
from a basic trial to adding a dose response
model, now we’re going to start to change
the adaptation. So now instead of just
allocating patients equally across all the
doses, we’re going to do something different. So we’re going to
start the trial with what we call a
burn-in period where we have a fixed allocation. This could be one-to-one, this
could be some other ratio, but we’re going to have a
small portion of our trial that has this
allocation that’s fixed. And from that data, that
initial set of data, we’re going to estimate
the dose response curve. And then for the
next set of patients, we’re going to change
the allocation ratio and start to allocate more
patients to the doses that have the best efficacy. And we’ll continue
to iterate this. So we’ll have interims
that are frequently spaced, and each time we do
an interim, we’re going to update those
randomization probabilities. So how do we decide how to
change these allocation ratios? So one simple method is
we’re just for each dose, we’re going to calculate what’s
the probability that that is the best dose? And we’ll allocate in
proportion to that. So doses that have the
highest probability of being the best dose, those
get the highest randomization probability. And doses that have
lower probability of being the best dose are going
to have that smaller weight there. So in our example, so
to make this concrete– I should go back. So here, what we do is we
calculate this probability. What you can do is
actually also raise this to a power which helps you
control how aggressive you are. So if you calculate
this probability and then raise it to a
power of 2 or a power of 3, that allows you to be more
aggressive about going after the best dose. In our example we’re not
going to do that, we’re not going to put an exponent on it. So what we’re going to do
is start with a burn-in of 10 patients per dose. And after that burn-in period,
we’ll estimate our model and we’ll update our allocation
probabilities every four weeks. So the other kind of
bells and whistles– there’s lots of different
ways you can do this– is, if any of our doses has less
than a 5% probability of being the best, we’re going to
temporarily just zero out that dose so that
it’s going to receive no patients until
the next interim, and we’ll reassess
that probability at the next interim. So effectively what that does
is a temporary arm dropping, but it’s not a
permanent arm drop. So it’s just zeroing
out that dose, and then as we start to get
more data, we’ll re-evaluate it. In these trials where we
do adaptive allocation, how we treat the control arm turns
out to be really important. So remember here, our final
analysis is analyzing relative to control– so we’re comparing
the response rate of our best dose to control, and in order
to make a good comparison there, we need control patients. So we want to make sure
that we don’t allocate away from control. So for example, we don’t
want to just put the control arm in with the
adaptive randomization and say, well, if the control
arm starts to perform poorly, we’ll allocate away from it. If we start to allocate away
from control, what that does is it hinders our ability
to make comparisons to the control at the end of the
trial and that hurts our power. So here, what we’re going
to do is fix the allocation to the control arm. So two out of every
eight subjects are going to be allocated
to the control arm, and then the remainder of
those subjects in the block will be allocated adaptively. So here are our results. Now we’ve got the basic
trial, the NDLM trial, and now we’ve added the
adaptive allocation. And we can start to see
some large benefits again. Especially look at here,
the low is the best dose. So now by allocating adaptively,
we’ve gone to about 50% power, 47-ish to over 60% power. We see some gains in
our other scenarios. The positive scenario
went from 70 to 79% power. So because we’re now making more
efficient use of our resources, putting the subjects on the
doses that we care about, the ones that are
performing the best, we see some gains in the power. So again, some of the
gains here ethically. The idea behind
adaptive randomization is to avoid exposing patients
to arms that aren’t performing well, and instead,
give the patients who were enrolled in the trial
the best chance of getting allocated to a good arm. So these trials can be a little
more complicated to set up. Careful planning,
careful calibration, making sure you’re
not making decisions too early, and some
operational details about doing these interims and changing
the adaptive allocation ratios, so those have to be
considered as well. Questions? All right. Last topic. Everybody tired? OK. So I want to talk about rare
diseases and more informative endpoints. So I’m going to do this
by way of an example. This is a GNE myopathy
which many of you may know more about than I do. But essentially this is
a progressive disease where your muscles are
gradually replaced by fat. And it goes essentially
from your feet up. So what happens early in
the course of disease, you basically lose muscles
around your ankles. It moves up to your
knees, through your hips, through your arms, and
it is eventually fatal. Very rare. Worldwide prevalence, four
to 21 out of a million. There’s nothing that helps
this particular treatment. This is challenging to do– nothing to [INAUDIBLE]
particular indication. This is a challenging
disease to do research on. And the reason is, while the
prevalence is low, in addition, people are at different
stages of the disease. So people early in the
disease, an endpoint like six-minute walk
might be relevant because they’re losing
function in their ankles and they’re having trouble
walking and so forth. People late in the disease
are not ambulatory at all. So a six-minute walk is
irrelevant to people like them. You may be interested in
the ability of somebody to– grip strength or something
like that late in the disease. So it’s difficult to come up
with a clinical trial design that accounts for all of that. So what we did is we came up
with a disease progression model. So we essentially came up with a
statistical model that describe the entire disease course. We had natural history
data on 38 patients. Each one of them is at a
different place in the disease. So for example, this person,
their strength is a lot lower. Each of these curves– I’m not going to go through
all the details of this, but different
kinds of their knee extensions on here, their elbow
flexed strength, their shoulder strained, knee flex. These are different pieces
components of the disease. You can see this
patient has completely lost the knee and
dorsiflex, but still has strength in other areas. So each of these
patients, what we can do– I’m going to skip this model. What we essentially did is
we took that disease course and we said, this is what
happens to dorsiflex over time. This is what happens to
your knees over time, to your grip over time, to
your hip extension over time. And what we were able to do
is we could map each patient onto their disease age. So some patients
might be in this area where if we were going to
follow them– this is in years. So if they’re in
this area, you can see the dorsiflex
is the thing that’s going to give us the
most information. In this area of
the disease, this is a situation where the
hip extension or the grip gives us more information
about the progression of their disease. So what we tried to do is
take each person’s data and tried to map it
onto each of these. So this is the kind of fits that
we got to the natural history data. So this is one object
analyzed over several years, and you can see essentially
these are the observations and this model is
fitting it quite well. So this is saying, if we know
your five or six strength measures, we can
essentially map this to your at disease aged 10 or 20
or 30 depending on where you are in the disease. This also includes
the uncertainty bars around each of the
different ones. You can see, knee is
an interesting one. The knee behaves
very differently than the other measures. This is more fits. I recognize we’re a
little short on time. As an advertisement,
Melanie Quintana is giving a webinar on
disease progression models sometime early in the spring. There will be more detail on
this in that webinar as well. But I’m still
trying to indicate, this is a model that fits very
well when all is said and done to the natural history data. And it fits well even
though the patients are at different components
of their disease. What we did in this
is the trial design, those six endpoints,
what we did is the endpoint for the primary
analysis is the disease age. So for each patient, what we did
is we took those six measures, we mapped it back to a
disease age for each patient, and the primary analysis
is basically measuring, can we halt the
progression of the disease? If you come in at
baseline and you– if Anna’s at disease progression
age 10, two years later, has she gone to 12 or
is she still at 10? And if it were somebody
else, if Doray came in at disease age 30, two years
later, is she still at 30 or is she at 32? And so it was the time involved
in that we were measuring as the primary endpoint. The trial has 50 patients,
they’re enrolled three-to-one. Treatment to placebo. And again, if you think of this
as the disease progression– this is the overall disease
age, what we’re really trying to measure
is are we halting the progression of the disease? So it might be that Anna going
from 10 to 12 over two years is full progression, Anna going
from 10 to 11 over two years, we’ve cut the
progression in half. That’s our measure,
our primary endpoint. The purpose of this is to
combine all this information in the six endpoints
on strength and get us and adequately-powered
trial, which this does. So this ends up–
you can see, if we’ve got a treatment effect of
50%, we’ve slowed the disease progression in half. The trial power is 98%. In fact, if you slowed
the disease down to 75%– so it’s 25% slower, we
still have about 75% power. So this is, again,
a trial that tried to work on a very small
number of endpoints, and I recognize I haven’t given
it full description in order to– since we are running
out of time here, but there is more on this. This trial is what
came out of the NIH on rare and neglected diseases. It is currently being
run by NeuroNEXT or about to be run by NeuroNEXT
as one of their trials. Let’s see. So let me end here. Make sure I have questions. Please sign up for
the ICTR website. When you download slides, you
can sign up for an account. Our email addresses are on here. If you have any questions,
we’d be happy to answer those. And certainly in close,
anybody have any questions, comments to end? [INAUDIBLE] recorded this
within the next two weeks or so. [INAUDIBLE] or you want to share
something with a colleague, you’ll have the whole
audio/visual presentation [INAUDIBLE] to share. Thanks. Thank you all. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *