All right. Thank you– ooh, I am on. Thank you all for coming. A few housekeeping items. This is being

recorded in the back, so you’re all aware of that. If you’re interested

in getting slides, they are available

on the ICTR website. So you can log into that. If you set up an

account, you will get emailed on future things

that are put onto that website. In terms of the internet

access here, you should be using

the Horton Grand, and the password is 311ISD. 311ISD, which should give

you access to the web. If you need to go

to the bathroom, those are right

out– so basically go that way, which involves

a quick jog to the left and then a quick

turn to the right, but they’re basically

right behind that wall in a little hallway. Let’s see. Attendance is enough

that I can kind of know– how many of you are clinicians? Clinicians, one,

two– statisticians? A couple of those. OK, so a mix. That’s all good. I just wanted to kind of get an

idea of what you’re talking to. What we’re going

to do today, we’re going to go through

a number of topics. Some of the structure is going

to look like the webinar that’s been posted online,

but we’re going to add a lot more meat

to it as we go forward. So a quick description,

this is also part of the webinar

that was online. Kind of talking about what

a gold standard RCT is. So keep in mind, typically this

is a two-arm trial treatment versus control. It has basic elements

of randomization. And all of this is designed

and has been done for about 60, 70 years or so. When we’re talking

about innovative trials, by necessity, we’re

talking about breaking a few of those components

or doing them differently. And so what we’re

going to do today is we’re going to talk about

a number of innovations, going to talk about the

motivation for each one. As opposed to the

first webinar, we’re going to give examples and some

main ideas of each one of them. So we’re going to start to

quantify some of what they do and why. And certainly, all of the

trade-offs involved in this. So if you pick an

innovation, sometimes there is, quote, a “free lunch.” Other times you

have to make sure that you account for the

sample size might be increased, the power might be

decreased, and you have to weigh the

trade-offs involved in that. So keep in mind,

modern RCTs go back to the streptomycin

trial in 1946. There’s a publication

that came out in 1948 describing that trial. Really kind of the last

innovation as randomization. Prior to this

people, either used alternating assignment or pretty

much assigned things at will. And at some level,

this has been rather– it’s been stuck at this point. This 1948 trial is run

over and over again, and in lots of ways,

we have advanced. And so we want to talk about it

the ways that we’ve advanced. Certainly a standard

RCT does well. So we want to make sure the

deviations are for a reason. So again, we’re

going to talk about what are the features in

this gold standard, what are the negative features,

what are the trade-offs? The main components

here, we have a lot of components that

are designed to avoid bias. These include randomization. Randomization is

kind of a catch-all. It avoids biases on essentially

anything you don’t know. We also routinely stratify. That’s intended to

directly equalize groups on biases that we

may know in advance. We also employ blinding. And often we have a

fixed sample size, so we avoid the investigator

being able to kind of look at the data until they

get the answer they want. So all of these

things, avoid biases. The rest of this is

aimed at obtaining interpretable results. So we want to make sure we

often do a two-arm trial. That’s the sense that if we see

a difference between groups, we’ll directly attribute

it to the two groups. We randomized, we

saw a difference, the only thing they differed

on was the two arms, this caused the difference. We often try to look at

a homogeneous population, again, trying to

avoid the uncertainty. We use lots of validated

standard clinical endpoints. All of this is really intended

to make things as simple as possible in the trial so

that we can get clean answers. The trouble is that they’re

very expensive answers. So we often have to do 100

population, that’s effectively all you can do. It also asks us to look

at really narrowly-focused questions. So if I say you only

have two-armed– [INAUDIBLE] that’s

effectively all you can do. It also asks us to look

at really narrowly-focused questions. So if I say you only

have two-armed– [INAUDIBLE] what’s

the population? Oh, I’m going to focus on this

group of people, when perhaps I really want to know well

maybe it works in everybody, but maybe it only

works in this subset. What’s the population? Oh, I’m going to focus on

this group of people, when perhaps I really want

to know, well maybe it works in everybody, but maybe

it only works in the subs– And so these are

conducted at– you have one or more

interim analysis. I’ll give an

example in a moment. And at each of

these interims, we can we can stop, make a

decision, we might continue, look at the data again, we might

continue, so on and so forth. As I said, these have titles. Utility stopping,

group sequential, sample size re-estimation. A Goldilocks design as

an alternative here, which is intended to account

for incomplete information at each of the interims. All right, so an

example of this. As I said, these

have several titles, at you have one or more

interim analysis I’ll give an example in a moment

and at each of these interim Ms we can stop make a decision we

might continue look at the data again we might continue

so on and so forth. As I said, these

have several titles. Utility stopping,

group sequential, sample size re-estimation. A Goldilocks design as

an alternative here, which is intended to account

for incomplete information at each of the interims. All right. So an example of this. Suppose that I’m testing

a dichotomous response. So everybody is a yes or no. And suppose that my control I

anticipate a 30% response rate. And in my treatment,

I’m expecting 50%. This is my hope. So going in, this is

I’m powering the trial for a 30% versus 50% effect. A standard trial here might

have about 100 patients per arm. Gets us 83% power. So a standard way of

conducting this is I would enroll 200

patients, 100 in each arm, I would do at the end of it a

standard kind of normal test. If my p-value is less than 0.025

one-sided, then I’d say, hey, the treatment is efficacious. OK, well that

requires 200 subjects. I’m going to try to

convince you in a moment that what you’ve just bought is

a horribly expensive insurance policy against bad luck, and

you typically don’t need it. So let me ask a question. I’m going to take a poll here. What if we actually observe? So this isn’t what if

the truth of the drug. This is what if I actually

get in my trial 30 out of 100 responders on control and I

get 50 out of 100 on treatment? So I powered it for that effect. Suppose it actually

happens, suppose I actually get that effect. What’s the p-value? More, equal, or less than 0.025? So I powered it to get the

significance at this level. How many people think it’s more? How many people

think it’s equal? How many people think it’s less? How many refuse to answer? Fair enough. All right, it’s

actually really small. It’s about 0.0016. And that’s for designing

a trial at an 80% power. If you design a

trial at 90% power and you get the observed effect,

the p-value is about 0.0006. So if you actually get what

you expect might happen, you’re far into the

realm of significance. So why did we need 200? What were we really

doing this for? I had an interesting

conversation, CEO of a small biotech. We were doing a kind of

adaptive design trial, and he asked me the question,

why did you pick 200? It happened to be

the size of his trial as well, not the same as this. But in any case,

he was an engineer. For what it’s worth, I broke

five generations of engineers to become a statistician,

much to the disappointment of my father. So anyway, I have these

conversations a lot. Essentially– and

interesting, I’m putting some words

in his mouth, but I think what was essentially

the expectation here is that I picked 200

because somehow I knew when the p-value

would be significant. That somehow I was going

to get a pattern like this. So at n equal 100,

the p-value’s 0.27. And as I go to

150 and 200, 200’s kind of the magical time where

the p-value gets under 0.025, and I picked 200 because I

knew that’s what would happen. And you certainly can

get data like this. But in general, data can

come in lots of forms. I didn’t cherry-pick

this particular example, I just basically sat on my

computer and flipped coins. And this is also a pretty

standard trial that comes up. p-value’s 0.00015 at 100,

I’m already significant, and it’s tiny at 0.000001

when I get to 200. So I’m a position– I didn’t need all

those patients, I’m incredibly significant. And certainly what I did

is I flipped these coins under the assumption of 30/50. This is what I powered for, this

is what all statistical method said you need 200 subjects. Well, in this case, I didn’t. I got it a lot

earlier than that. Here’s another one. This is one, it’s– again, I’m flipping

30 versus 50. This one never gets

to significance. We got the 0.1, 0.06, 0.19,

it’s going up and down. This is a trial that

we know that if it has 83% power, 17% of the time

you don’t get significance. This is one of those 17%. All right. So what happens in

lots of these trials? I just gave you

three and showed you what happened at 100 to the

p-value, at 150 to the p-value, and at 200 to the p-value. So I generated 50 trials. And all of these

lines, I have plotted each of these lines– this is

the p-value over time at 100, 120, 140, 160, 180, and 200. All of these are simulated

30% control, 50% in treatment. Now one thing that’s

certainly going on is all of these

lines are going up. Not uniformly, so some

of them go up and down, but the general trend is up. The other thing

that’s going on here is these are the

lines of significance. That’s the– for

0.025, 0.01, and 0.001, I’m plotting the

z-score on this axis so I didn’t put lots of

tiny p-values in my graph. So what happens here? Well at 100, you can

see a lot of these are already significant, and

they become highly significant. At 150, even more

are significant. So most of these,

you can see there’s a whole bunch of trials,

they’re significant and they’re significant

far before I got to 200. So why did I get 200? What I’ve done is I’ve

bought an insurance policy. I’ve picked 200 not because

that’s when the p-value is going to get above 0.025. What I’m waiting for

is the slow ones. So like this red

one, I’m waiting for trials like

that to finally have gotten above that

line of significance. And I’ve chosen that 200– when I say I have 83%

power, 83% of the lines are in the significance region

above 0.025, and another 17% are below. And what I’m doing when I choose

a sample size of 200 is I’m waiting for that to occur. I’m waiting long enough

for all of these laggards, and I’m not taking

advantage of the fact that most of the trials

have been significant for a long time

before I got to look. All right, so a group sequential

is a way of handling this. We have to pick

interim analysis, and you can pick a couple– 100, 150, 200. You can pick many– 100, 120, 140, 160, 180, 200. So I can pick lots

of interim analysis. Generally the more you pick,

the more efficiency you get. There is diminishing returns. So one thing we often

do is we basically calculate, if you do two, if

you do three, if you do four, if you do eight,

here are the results. And you need to

decide how complicated it is to run an interim versus

the statistical efficiency involved. We’ve had trials

where it’s incredibly easy to run an interim and you

may as well run tons of them. Other times, oh, we’ve got

to have a huge DSMB meeting, there’s all kinds of reviews,

and that may be an incentive to run less. So here is the kind of

example that you might see, and this is 120, 140, up to–

sorry, 100, 120, 140 up to 200. One thing that’s going on

here, you’ll see at 100, I’m going to declare

success if the p-value is less than 0.0031. I need a highly significant

result in order to get that. Now keep in mind, there are

a few trials that hit this. That’s 0.001. So some are going to make it. At 120, it gets a little bit

lighter, at 140, a little bit. And these are going up. And finally at the end, I

need a p-value of 0.0183. The way I tend to

think about this, if you go to like a carnival

game, you’re standing in front, they’d say, OK, you’ve got to

throw a ball through a basket in order to win a prize? There’s a certain

sized hoop that you’ve got to throw it into. Ordinarily if you’re

going to get one throw, I’m going to let you have– you need a p-value of 0.025. There’s a certain sized hoop

you have to get it into. If I’m going to give

you multiple throws, so I’m going to let you throw

it– how many is in this? 100, 120– that

are six different throws that you’re going to get. I can’t give you

the same sized hoop for every single one of

them, because what happens is that increases your

chance of getting one even if you’re a bad shot. So what all these are

aimed at, the reason that all these

p-values have changed, is because you don’t get

one throw at a big hoop, you’re getting sick

throws at smaller hoops. It’s designed so you

don’t lose any power, you don’t lose any

type I error, but it’s designed to allow you to

win this trial earlier. And again, the notion going back

to this graph, what I’m doing is I’m just figuring out

when this line finally gets high enough, I’m going to

go ahead and declare success. I’m not going to wait for 83%

of the trials to get there, I’m going to stop this trial

for success whenever a trial actually gets high enough. It turns out, in this particular

trial, 28% stop at 100. And that’s, again, 30 versus 50. This is what I’m powering for. So that I’m not– this is

not for a dramatic effect. These are just ones

that got lucky early. 12% at 120, 11%, 11%, 10%. 7.9% win at 200, and then here

is 18% that lost at the end. So these are ones that went

through the entire sequence and never actually made it

through any of these hoops. If I look at this,

my power is 81.8%. Now remember, I was at 83.3%. So the first trade-off

here is I’ve lowered power. By looking earlier

and splitting all– by throwing through six hoops,

I’ve lost a little bit of power by doing this. What I’ve gained is my expected

sample size is now 148. Some of these trials stop at

100, 120, 140, they stop early. A few of them go

all the way to 200. So I’m basically saving

25% of my sample size, and it has cost me 1.5% power. So I have to decide

about that trade-off. I can actually pay that

trade-off in a different way. Suppose I say that I’m

just going to increase my maximum sample size. So now I’m going to

interims from 100 up to 220. So this is a possibility

I may need a larger trial. But what does this do? So again, I’ve

split this all out. You can see, here’s

20% of my trials, they’ve had to go to 200. No, this one. 21.5% reached 220. So I’ve made 21% of my

trials longer at 220. And in addition, a lot

of them are smaller, my expected sample size is 156. So if I have to fund a lot of

trials, if I’m doing this– how many of you have run more

than one trial in your life? How many people have run

more than five in your life? How many have run more

than 100 in your life? OK, there is a maximum– OK, just making sure. By the way, I know an

Alzheimer’s researcher who’s run like 100 trials

and never had a success, and you really ought to be

hitting the type I error rate at this point, so

that’s a little unlucky. But in any case, this is– so we’re saving sample

size on average. If I have to do this repeatedly,

I have to go up to 220 every once in a while, but I

get to go down to 140, 160 a lot more. So that’s why I get to save

sample size in doing this repeatedly. It’s also worthwhile

to consider futility. So suppose we were

140 patients in, and what we observe is 15 out

of 70, that’s 21% on control, and 19 out of 70,

that’s 27% on treatment. Where am I right now? I’ve got p-value of 0.2147. OK, that’s not bad. Certainly not 0.025. One question to ask, is

this trial worth continuing? So certainly if you are

on a standard DSMB– I’m on a lot of

DSMBs, we would look at this– in the absence of

a rule that says to stop, what would the discussion be? Well, this isn’t going

as well as we hoped, but there’s no evidence

it causes harm, the trend’s in the

right direction. We have no reason

to stop this trial, we’re going to go

ahead and continue it. So one question to

ask is, well, how likely is it that we’re actually

going to win this thing? We currently have

a p-value of 0.21. We know that we need

to get that to 0.018. And if we– you could

imagine applying this to a non-group sequential

design where you have to get the p-value to 0.025. It would look very close. But here, we’ve got

to get to 0.018, and we’ve got to do it by 220. OK. In order to get a

p-value of 0.018, that requires about a

15% observed effect. Your friendly statistician can

back-calculate that for you. All right, so we

need a 15% effect. We’ve got about a

6% effect right now, and we’re 140 patients in. So at 220, in the next 80

patients, what do we need? Well we need a 32% effect

on those next 80 patients, and right now we’re

running at six. So something’s got to give here. Either we’ve been

unlucky to date and our luck is going to

have to change significantly, or this might be a trial that

we’re just not on pace to win. We often use predicted

probabilities in order to calculate this. So I can ask the question,

what is the likelihood that this is going to win? There are two pieces to

the uncertainty in that. In the next 80 patients,

there is sampling variability. If I told you– if mother

nature came out of the forest and said the true rate

is 40%, in 80 patients you’re going to get

40% plus or minus. So that’s sampling variability. The other thing is,

mother nature does not come out of the forest

and tell you it’s 40%. So you don’t really know

what the rate is either. So there’s uncertainty

in the parameter. What predicted

probabilities do– and I am I’m omitting

the details here, but can certainly

refer to papers, and I’ll be happy to

write blogs on this or whatever people

would like to see– but a predicted

probability incorporates both of those uncertainties. And you can calculate

the predicted probability of success is 6.9% here. It’s quite low. And I hope that actually

matches the intuition, 6% on the first 140

need 32% on what’s left, that’s a big shift. It’s not very likely to happen. All right, so I’m going to

add futility to the example that I just had. And at each interim I’m

going to stop for futility if the probability of eventual

success is less than 5%. Now that’s 6.9 I just

did, I would have let that go with this rule at 5%. I would have continued

on to the next interim. The value that you pick

here, this is something, again, your friendly

neighborhood statistician ought to be computing. If I do 5%, if I do 10%, if I do

15%, what are the consequences? The things that you’re

trying to manage here, you certainly want to

stop poor treatments. If things aren’t going well

because your drug is not working, you would

like to stop the trial. You also– remember that even

in good scenarios, sometimes you get unlucky. That 17% of trials

that didn’t win, I don’t want to go

out to 220 for those. If I’m going to lose,

I’d rather lose at 140 than lose it to 220, it’s

just easier on everybody. The other thing that

I have to do here is I don’t want to declare

a futility too often, because every once

in a while, there’s a trial that looks bad early

that comes back and wins. And if I declare a futility

too aggressively, what happens is I’m going to lower my

power because I’m going to eliminate that possibility. So here what I’ve done, we have

probability of eventual success is at 5%. I incorporated that rule. And so now what you

can see here is these are the same p-values

up to 220, this is the probability of winning

at each of the interims, and then there’s a probability

of declaring futility. There’s a small

number of trials, they look really

bad really early. And so those 3% stop. And then just a few stop

kind of it each interim. I want to emphasize that

most of the trials that are stopping for futility,

if you would let them go, they would have lost. Again, we’re trying to

find that 17% that failed and get rid of them earlier. I have made a few mistakes. So remember, the power was 85%. Futility has cost me 1.7%

power by putting it in. 1.7% of trials I stopped

for futility that would have gone on to success. However, that 83.3, I’m

now exactly the same as the trial you started

with, n equal 200 that had no

adaptation whatsoever. I’ve increased the

sample size, I’ve put on the group sequential,

I’ve put on futility, my overall power from all

of those things combined is the same. Only 10% of my trials

are reaching to 220. So this little

extra that I spent, I’m not doing it very often, and

almost 80% I’m actually saving. So certainly I’m trying to get

rid of this insurance policy that I bought. And the expected

sample size is 150. So I gained a little

bit in this scenario. This is when the drug works. Futility is not designed

for when the drug works, it’s designed for when

the drug doesn’t work. So what I illustrated

here was that I didn’t cost too much power. This is what happens

under the null. I actually can stop almost

60% of trials at 100. So if the drug does nothing,

30% versus 30%, I’m out early. I can get rid of 80% of the

trials at 140 or before. So I’ve still got 83%

power in the alternative, I’m stopping on the null,

the expected sample size here is 123. They’re all out really early. Now if I were having to

fund a lot of trials, what does this do

to my population? I took a quick survey on

NHLBI areas and kind of what the success rate is in trials,

imagine that you had to fund a bunch of trials and 30% of

the drugs work and 70% don’t. Remember, trials, it’s hard

to develop drugs and have everything work here. If you were in that

boat, what would happen– we’ve obtained the same power. The expected sample size here,

in the 30% of trials that work, we averaged 150 subjects. In the 70% that didn’t,

we averaged 123 subjects. Overall for this population, I

averaged 131 patients a trial. Some of them are trials

that work, some of them are trials that

fail, the futility works better than the

success stopping, actually, and you can fund 52% more

trials than you could otherwise by doing this kind of thing. So again, the flexible sample

sizes, the expected sample size significantly reduced some

numbers on that 25% to 40%. I haven’t talked

about this, but I said it was so uncertain

when significance occurs, if you actually didn’t

know it was 30 versus 50, maybe it’s 30 versus 70,

maybe it’s 30 versus 50, maybe it’s 30 versus 40, that

increases the uncertainty all the more. You want to have more

interims to account for that. And finally, you do

have to pay for this. You have to account for

the multiple analysis, those p-values they

can’t all be 0.025. You may have to increase

the maximal sample size. And also, keep in mind, if

you have safety requirements, stopping early might

invalidate those. If I need 200 subjects to get

an adequate safety database, I can’t do this. If I need to make sure I have

adequate numbers of people in different ethnic

groups, for example, this may be difficult to do,

you have to make sure that those kind of

operational concerns are also handled in

this, so all of those are part of this

group sequential. All right. I’m going to switch gears. My colleague will. You wanna ask questions? Questions? Yes, sir? Oh, can we give you the

mic for the recording? Adam [INAUDIBLE]

from [INAUDIBLE].. So what we find is,

DSMBs, as you noted, do they listen to the

futility and efficacy rules that you put in? How binding are these, and

what type of information do you put in the

protocol to make sure that the interim

analysis and futility have a lot more context around it

so when they get to a meeting and he’s able–

should we stop or not, there’s more context behind

what went into the boundaries? So the question was

essentially, what goes into a DSMB, what goes

into the protocol, what goes into the charter, how

is this actually implemented? I’m thinking three

things, I’m going to start going through them,

I’m going to end up with two. There are three kinds of people,

those that can count and those that can’t. So anyway, one thing,

certainly in the protocol, it needs to be very explicit. Futility will be declared

if the predicted probability is less than 5% and

here is the calculation, either in the

protocol or in the SAP reference from the protocol. Usually at the kickoff meeting

to the DSMB if not prior, you also would have a

discussion with the DSMB. If you’re the sponsor, you want

to make your expectations clear here. No, this is the

rule for futility, and we want this to be

followed, so that everybody is clear on that. The other thing that you

would show is example trials. Here is the kind of data. If you’re at 140 and this

is the data that you see, these are our

criteria was stopping. And so that’s a reason– effectively you’d end

up saying things, look, if we have less than a

6% effect at n equal 140, we want you to stop, and

that’s what this rule says. And at that point, if the

DSMB, if there’s uncertainty, you can have a long

conversation about it and nobody’s blinded yet

because there isn’t any data. So you want to be upfront about

what the expectations are. I have heard horror stories of– and academic settings

are a little different. Private sponsors are really

picky about making sure that futility

rules are followed. I know of cases where the DSMB

went rogue and essentially said, we kind of

want to see what’s going on in this one subgroup. So they really ran the

trial twice as long after the fertility

rule had been hit. They have not been hired as a

DSMB member very often anymore. But anyway, you

want to make sure that those expectations–

really, it’s about, you don’t want this

to be embedded. This is part of the discussion

and laying it all out in advance, and

making sure everybody has bought-in before

anybody has to be separated. All right. All right. So the next topic we’re

going to talk about is single-arm trials. This kind of falls

under the same banner as how do we minimize

our resources when we’re running a clinical trial. So one way, as

Kert described, is to have a flexible sample size. Another thing we sometimes

do is run a trial where all of our

participants receive the investigational therapy. So these are what we

call single-arm trials. There’s no placebo arm or

no control arm in the trial. So when we run a single-arm

trial, what that really means is we’re assuming that

we have some information external to the trial that we’re

going to bring into the trial. So we’ve essentially

made some assumption about what patients

might have looked like if we had a control arm. So we’ve made an assumption

about what our average rate is, about what our response rate is. Assuming that that’s

what the behavior is for patients that would have

been assigned to a control arm. So this might come from

historical data, for example. Maybe we’ve run another

trial in this population. It’s common to do

this in rare diseases. One of the big

reasons that we might consider a single-arm trial is

that patients are hard to find. So if we have a hard

time recruiting patients, finding patients, a single-arm

trial might be beneficial here. So we’ll talk in

detail a little bit about some of these

gains and losses. One of the main reasons

that we would consider a single-arm trial

is the sample size is typically going to

be much, much smaller than what you would

need if you were running a randomized trial,

because you’re only going to have one arm. So instead of having to

randomize subjects between two arms, you just have one

arm and you need about half as many subjects. It’s also– one

of the big reasons is it’s a lot easier to enroll

patients in a single-arm trial potentially. So if patients are unwilling

to be accrued to a trial where they know they have a

chance of getting a placebo arm, and maybe they would

prefer to enroll to a trial where they know they’re going

to get an experimental therapy, then it may be hard to

enroll to a randomized trial because patients just don’t

want to commit to that. So those things make the

single-arm trial beneficial, but some of the losses– and we’ll go through

in detail and describe these– one is that you need

that historical estimate that you’re going to compare to. So you need to have some

assumption about what your control arm rate is. And as we’ll see,

that if we’re not accurate about that, if we

are off in that estimate, that we can see some

severe losses in terms of our power and our

type I error rate. And one of the other big

issues with single-arm trials is, we’ve violated one

of those kind of pillars that Kert described at the

beginning of the session today. We’ve lost blinding. And because of that,

we can see some biases both from the

patients– they know they’re receiving the

experimental treatment; and also from their assessors,

their clinicians who also know that these are the patients that

received experimental therapy. So we lose that randomization,

we lose the blinding, and because of that,

we start to see biases. So one of the innovative

trials that we are going to talk about

now is, how can we use historical information

that we have but do it in a way that we’re not running a

single-arm trial and all of the caveats that

come along with that? And what we’re going to do is

borrow historical information in a statistical sense. Essentially what

we’re going to do here is augment a control arm

in a randomized trial with historical data. And what that allows

us to do is now instead of doing just a

one-to-one randomization– so half of our patients

get control, half of them get the treatment, because we’re

going to now start borrowing information from

the historical data, we can adjust that

randomization ratio. So instead of a

one-to-one, we can maybe do a two-to-one, a three-to-one,

or even more extreme so that more of our

patients are receiving the experimental

therapy, fewer of them are receiving the control arm. But because we’re using

that historical data, that helps to augment our

data on the control arm. So the benefits that we get

from this is we retain blinding. So we now still have that– we removed some of those

biases that we might see in a single-arm trial. The other thing is that the– since we are still

collecting data in our trial about

control subjects, that acts as a verification. So if we know what the

historical data was and we get data in the trial,

we can see how closely those two things align. So in a single-arm trial, we’re

just assuming that we know it. In a randomized trial, we get

to compare and see how similar or how different was the control

data in our current trial to the historical data. So historical

borrowing– and we’ll talk through a little bit of

the details of how this happens. The gains here is

that because we are using that

historical information, we require fewer subjects

in our randomized trial. So the sample size

benefits, compared to a standard one-to-one

randomized trial, are significantly smaller. So that lowers our risks. And we talked earlier about how

single arm trials may be easier to enroll if patients are averse

to receiving a placebo arm, they may not want to enroll

to a randomized trial. So here, if we’re able to

run a randomized trial that does borrow from

the historical data, it can potentially be

easier to accrue because we need fewer control subjects. Most of the subjects

that enroll to the trial are going to be assigned

to the treatment arm. So that helps to mitigate

that risk of our accrual being small being

slower than expected. As before, if we were

doing a single-arm trial, we would need a

historical estimate. If we’re going to borrow

from historical data, well, we need the

historical data. So we’re relying on

having some information, either from a previous

trial, maybe a registry that we can use. And the other thing is– so what this design

is kind of built on is how closely the

current control data is to our historical data. And we’re kind of banking on

that being true that these two things are similar. If it turns out that our

current control data differs from what we’ve

seen in the past, we have a limited

backup plan for what to do if that doesn’t match. So we’ll talk through a

concrete version here. Suppose that we’re going

to run a single-arm trial and we’re looking at a

dichotomous outcomes– this is maybe a responder

rate kind of an outcome. And are what we

would like to show is that our responder rate is

greater than some estimate p0. So p0 is essentially

are our historical rate that we’re going to compare to. So where do we get that? So one option is that we have

an expert or a panel of experts who agree that 30% is the

right number just based on their expertise. It could be that

we’ve seen some data from a small natural

history study. So from patients

who weren’t treated or who were given kind

of a standard treatment. So maybe in this small

natural history study, we’ve seen three patients that

responded out of 10 patients, so we have about a

30% rate here too. Other ways that we

can get information about this control

rate, maybe we look at retrospective

chart reviews. Maybe we have some data

from a large clinical trial. So for example,

suppose there was a trial that had 200 patients

and 60 of them were successes. Again, that comes out

to about a 30% rate. But of course, if we

were able to enroll a large clinical trial,

there’s no reason why we should have to

run a small trial now. So now suppose that we’re

going to test the hypothesis that our response rate

is bigger than 30%, and our rule here was

we’re going to run a trial with 20 subjects. And if we see 10 or more

of our subjects that respond to the

treatment, we’re going to declare that trial a success. So if we apply this rule, it

turns out that our type I error rate– so if our

rate is truly at 30%, our type I error

rate is about 5%– 4.8%. And if we assume that are

our treatment has a 60% rate and we’re comparing to

that 30% on the control– so have a 30% improvement, this

rule of 10 out of 20 successes, that gives us the power of 87.2. All right, so we’re done, right? We have over 80% power, we

have 5% type I error rate. But the problem

here is that we’ve made an assumption that 30% is

the correct historical rate. And there’s not always a lot

of certainty around that. If we look at the

literature, we’ll see a lot of times

in clinical trials that are run very

similarly, there’s variability in

what that rate is. So in these characteristics

that I just showed you, a 5% type I error

rate and an 87% power, this is all predicated on

that 30% being the right rate. So we’ve chosen that, then

we’ve assumed that it was true. But what happens if that 30%

is actually not the right rate? So we said on the previous

slide that our type I error rate was about 5%– 4.8%. So what we’ve done when we

calculated that type I error rate is we assumed that are true

response rate on the treatment and our true response rate

on the control arm is 30%. And it turns out that

if we run a trial and we count 10 out

of 20 successes, then that gives us a type

I error rate of 4.8%. But this definition never asks

whether 30% is the right rate. So suppose instead that both the

treatment and the control arm are both a 40% instead of a 30%. So here again, there’s no

difference between the control and treatment, it’s

a null scenario. But now instead of both being

at 30%, they’re both at 40%. What happens? We think that a type I

error rate is still 5%? So it actually turns out

that if the true rate is 40%, the probability

of a successful trial under that rule that we’ve

created is no longer 4.8%, it’s actually gone up to 24.5%. So this is our type I error

rate if in fact our treatment and control are both 40%

instead of 30% that we assumed. And it gets worse from there. So if we’re even more wrong– so if the truth is actually

a 50% rate on both control and treatment, our type I error

rate goes up to almost 59%. So we’re nearing 60%. So the graph here– what this is showing

is on the x-axis is the true control rate. So remember, we assumed

that we had a rate of 30% So that’s this horizontal

green line right here. And if in fact that is true,

our type I error rate is 4.8%. So that’s where this line

crosses that green threshold. So the green line

here is at 4.8%. So what happens is

as our truth starts to differ from

what we’ve assumed, as it starts to

get bigger, what we see is the type I error

rate starts to go up. And the further away

from 30% that we get, the more we increase

our type I error rate. What happens to power? So unfortunately, our

troubles aren’t limited just to type I error rate. We also have bad things

happen to our power. So remember, what

we were looking for was a 30% difference. So 30% on the control arm versus

a 60% on the treatment arm. And we had seen that

we had about 80% power if the truth is that

our control rate is 30%. So what happens if our control

rate is actually smaller? So what if the control

rate is actually 20% and we get an effective 50%? So we still have a 30%

difference from 20 to 50. But what happens is now our

control rate has drifted, and what happens, as we’ll

see on the next slide, is that our power

starts to go down. So now on the y-axis,

instead of type I error rate, we’re looking at power. Again, if we looked at our

30% truth on the control rate, what we’re seeing is 87% power. But as our true control

rate starts to go down and we still assume

a 30% improvement, our power is now less

than what we had assumed. So we have bad things

happen on both sides. We start to get increases

in our type I error, we start to get

decreases in our power. And this is all because

we’ve made an assumption that 30% was the right rate. And when we were wrong about

that, bad things happen. So why did we run a

single-armed trial in place of– yeah. OK. Sorry. Are there questions, comments? So this idea of our

type I error rate in our power changing as a

function of the control rate, why didn’t we talk about this

when we run a randomized trial? Why am I talking about this

only when we’re talking about single-arm trials? Well we can see,

what I’ve done now is added a line to our plot

that shows what happens if we did a randomized trial. So now we still have a

20-patient trial, but instead of all 20 of those going

to a treatment arm, I’ve now randomized

10 of those patients to treatment, 10

of them to control. So there’s a line– it

may be difficult to see, it’s an orange line right here– that shows the type I error

rate across the same range of control rates. And as you can see,

that orange line is always below our 5% rate. So in a randomized trial,

the type of an error rate– in the tails, as we start

to get really far away, we see that that

actually go down. But across the range

of control rates, are type I error

rate is controlled. So we’ve limited the

chance of a false positive here, that that red line

that was our single-arm trial that we saw just a minute ago. What happens for power– again, we’ve added

this orange line, this is now our

randomized trial– is again fairly

steady, fairly flat over the range of control rates. It does start to drop. But the other thing

that we see here is the single arm

trials, over a wide range here, the single-arm

trial has higher power, and that’s really a

function of the sample size. So the randomized trial,

we have fewer subjects on our treatment arm. So the main point behind

this is that history– in this case, we’ve

assumed something about our control rate– leads us to the best

guess of our parameter, but we could be wrong. Sometimes history

leads us astray. What we see is that our best

properties for our trial occurs when the truth is

close to our best guess. So if we have the sweet

spot where if our truth is– if our assumption is

close to the truth, we get good behavior,

but when our truth starts to differ from

what we’ve assumed, we start to get biases. So if our truth is

actually a lot different, we can either see inflated

type I error rate, or we can see some

big losses in power. All right. So now what we’re going to

transition to talking about is how we can use

historical borrowing– so use an innovative

statistical method to use to run a randomized

trial so that we get those benefits

of randomization, but try to control not

needing a large study. So now I’m going to

consider an example, we’re running a phase II trial. And our sample size, the

most that we can afford is about 210 subjects. I’m going to

allocate my subjects in a two-to-one randomization. So out of those 210

subjects, 140 of them are going to go to the

treatment arm and 70 of them are going to go

to a control arm. And then I’m going to

use historical data to kind of argument that data

that I get on the control arm here. So suppose that we have

this historical data. So maybe this was a

previous study that was run on the control therapy. And out of those 120 subjects

in that previous study, 70 of them– 72% of them were responders. So that’s a 60% rate from

that historical study, and that’s what we saw

in the control arm. So I’m going to show

you two examples here. I’m going to show you a

trial that kind of ignores that historical

data, and then I’m going to show you a trial

that uses that historical data and just pools it in with the

data from our current trial. So we take those 72

responders that we got in at our current trial, and

we add them to whatever we see in our new trial. So we go back to

this graph again. So the green line here, this

is a 5% type I error rate. What we see, this is our trial

that pools the data together. So we take those 72

responders that we got on the previous

trial, we add that data to what we get in the

new trial and what we again see is inflation of our

type I error rate. Remember that that historical

data had a rate of 60%. So if the truth is

60% and we get data around that same number

in our current trial, our type I error

rate is actually a little bit lower than 5%. But as the truth

starts to differ from what we saw in

that historical data, we again start to see this

inflation of our type I error. So what happens

to the borrowing? So here, again, there’s a

reference line here at 60%. The purple line is what happens

if we pool the data together– so we take those 72 patients

from our previous trial we add them in with

our new patients. What we see is that

power starts to increase when we do the

borrowing as compared to a randomized trial that

doesn’t do the borrowing. So the orange line

here is if we just do that two-to-one

randomization but we ignore those 72 patients

that were responders on the previous trial. So if in fact our true

control rig is 60%– so it’s on point with what we

saw in that previous study, we see these big gains between

pooling that data together and ignoring it. If our true rate starts

to differ from that 60%, though, we see the

power for the pooling to start to go down

quite substantially, where it starts to plateau

if we ignore that data. So what we’ve done

in this example was we’ve just decided

to completely 100% use the data that we saw

in the previous trial. We can also think about rules

that allow us to use that data, but in some sense downplay

it so that we’re not using the data to the full extent. So in– what we call

these are power priors. What these are static

weights that basically say, hey, I’m going to take

that historical data and I’m going to weight

it by some amount. So a weight of zero here would

correspond to no borrowing, so I ignore that

historical data. A weight of one

corresponds to pooling, so that’s what we saw

in our previous example. But now we can allow

weights that are in between. So for example, you could

assign a weight of 50%. So what that means is

each of those subjects in the historical data set

essentially counts as half of a subject in the new trial. So you’re kind of

accounting for the fact that that historical

data is potentially different in some way than

in your new population. So this is an

attempt to recognize the potential for drift

in your population, and account for the fact that

subjects in the new trial might be a little bit

different from subjects in the past trial, and so

we’re going to weight them a little bit differently. So for our statisticians

in the room in the room, the way that we handle

this in Bayesian methods is what’s called a power prior. So what we get is if we were

pooling the data together, what happens is that

historical data essentially just gets treated like it

comes from the current trial. So we take our prior times

our historical likelihood times our current likelihood. What happens when we do a weight

that’s in between zero and one is essentially we’re

taking that historical data and raising it to a power. So what happens is if the

weight is equal to one, then this becomes

equal to pooling. If weight is equal to zero,

then this historical likelihood goes away. And anytime we have

a weight that’s different from zero or one,

that essentially just adds some kind of a weight

to the historical data. So how do we choose

a weight for W? So common choices here

might be half or a third. So again, this is an

attempt to recognize that patients in

the current trial are somehow different

than patients in the past. So we’re going to

weight them maybe a half or maybe a third of the

patients in our current trial. As we start to get

values near zero, that essentially becomes like

ignoring the historical data. It’s possible that we could

use values bigger than one. So what that does is

essentially add more weight to that historical data. So if we have a really large

weight, what that essentially does is act like a

single-arm trial– so it makes the current

data that you’ve collected not as

important, it gets overwhelmed by that

historical data. So essentially, assuming

weights bigger than zero, assuming large values

of W, it is essentially equivalent to

assuming that you have a massive data set from the

past and essentially know the answer. All right. So now what I’ve

done on this graph is show different values

of W– so different ways that we could weight

the historical data. So let’s see. So what we saw in the

past is our purple line– this is right here–

that was if we just assume that the historical

data counts just like our current data. So a weight of one

is that purple line. So this is our type I error

graph, and as we saw before, we have this

inflation as we start to differ from our 60% rate. But now what we see is that this

inflation of our type I error rate is dependent on how much

we weight the historical data. The more weight that we

give to the historical data, the more type I

error inflation we see as our rate starts

to differ from the truth. But as we put less weight

on the historical data, we see less of that

type I error inflation. So this line right

here, for example, is weighting the historical

data at just 10%. And we see that we still

do have some type I error inflation as our rate starts

to drift away from 60%, but it’s much lower

than if we just treat our historical patients

the same as our current data. So what happens to our power? So again, I’ll point out that

the line we looked at before– it’s a little hard to

see, but here’s our rate if the weight is equal to one. So again, we see this kind

of continuum of these lines that the less we wait

our historical data, the more we start to approach

the randomized trial; and the more we wait

our historical data, the more we start to see

these drops in our power. But again, most of the

time, as long as we’re close to our historical rate– sorry, as long as the truth is

close to the historical rate, we see this spot where

our power is always increased by the borrowing

relative to doing a trial that ignores that historical data. So how do we choose

a weight for W? How do we know

what’s the right way to weight the historical data? What we would really like

to do is have some mechanism that if the true parameter is

close to the historical data, we borrow a lot and

we put a lot of weight on that historical data. But if our true

parameter is different from the historical

data, then we would like to

downgrade that data and not count it quite so much. The problem, of

course, is that we don’t know what the true

parameter is, so how do we decide? Well, as in most

cases, the current data is a good way for us to judge. So what we’re going

to do is let the data help us decide

how much we should weight the historical data. So if we’re running our trial,

and then the current study, the data comes in around

85%, and remember, our historical

rate was 60%, well that tells us that something

has likely changed. The parameter is

likely different from that historical

rate and maybe that’s a situation

where we don’t want to weight our historical

data quite so high. So any method that

we can consider that assesses the agreement

between the historical data and our current data is what

we call dynamic borrowing. So what happens is we

assess that difference between the current

and historical data, and we decide on a weight

that’s based on that agreement. So as in most things,

no method is foolproof. So it’s possible that the

current data leads us astray. So what could happen is

that there’s high drift– so really the current– I’m sorry, really, the

historical data differs from the truth, but just randomly our

current data happens to agree with the historical data and

we decide to borrow a lot, even when we shouldn’t. So that’s possible. It’s also possible that

there’s really in truth no drift– the

historical data is very much on point with the truth. But just for some random reason,

the current data randomly deviates, and so we

decide to not borrow so much when we should. Question? Yes, if you don’t mind

me asking a question now. Mm-hmm. So when are you making the

decisions about the weighting? Is this being done in an

iterative manner or not? And then the second question

is, what about a contemporaneous cohort? Would that be similar to

historical bothering– bothering. Sometimes it is bothering. Because it sounds like quite

a bit of work, actually, and I don’t know about that. Right. So the question is, when do

we decide how much to borrow? So what we’re going to

do is actually create a statistical model that kind

of does this automatically. So when we run the analysis,

the statistical model does this assessment

to see the agreement between the historical

data and the current data and automatically adjusts

the degree of weighting. Does that answer your question? On multiple occasions? You can do this borrowing

just once during the trial. You can do interims where

you assess it multiple times. Each time you do the analysis,

it would reassess that. Was there another

question in the back? [INAUDIBLE] Oh OK, thank you. OK. [INAUDIBLE] So I think

the follow-on question to that would be, though, you’re

designing the trial upfront with an n and trying to figure

out how many subjects you need, and at that point, you don’t

take the historical borrowing into account, right? When you’re preliminarily

designing the trial or do you? Because the question is,

if it doesn’t turn out as your interim analysis

or continuous analysis actually shows

you, then what does that do to the

initial trial design and how do you sort

of make up for that? I think that would be I think

the follow-on question to that. Yeah. So I think if you’re considering

a design where you’re going to potentially borrow from

this historical information, that should be part of the

pre-specification that you do. So you should be

considering this upfront when you design the trial and

write down the analysis plan. And what else was

I going to say? Yeah, Kert. Hi. I think I got one. So another aspect to this,

this is somewhat getting back to this insurance policy idea. What you’re gaining here

is essentially upfront, you’re making an assumption. That by using this

historical data– I’ve seen multiple

clinical trials, they all come at a 30% rate, I

expect to see another 30% rate. I’m going to save

30% of my sample size by doing it this way. What you would

want in your design is it to be prospective that

if I can go from 1,000 to 700, I use 700. But if in fact they don’t

agree, it does make sense to have this possible extension

prospectively into your design, where at the beginning,

the protocol says, I’m going to go to 700,

this is the analysis, if certain conditions

aren’t met, then the trial will

automatically go to 1,000. You could also–

we’ve designed trials where we’ve

randomized two-to-one, making use of the fact that

we’re borrowing these controls. And if in fact they

don’t agree, not only does the trial

potentially get larger, but it also switches to

one-to-one randomization because it’s not borrowing

those controls anymore as much. So all of this is, we’re

making an assumption, we’re trying to make

use of it, but you have this prospective

backup plan which makes the trial larger

if the assumption’s not met. Another question. So I’m going to ask this at

the risk of having been late, so I apologize if you’ve

already gone through this. But this makes tremendous

sense in terms of– sense to me in terms

of trial efficiency. What I worry about is if the

natural history of the disease or the care is

changing, then you are going to be slow

to recognize that by continuously

borrowing from prior data to inform your existing trial. Is there a way of handling that? Is there a philosophy

around that concern? Why don’t you show the

dynamic borrowing graphs? OK. So I think I think some

of this will be addressed in what we’re about to show. So the question is

really about the drift and if there is a drift. And potentially if it’s

slow to materialize, are you going to be

able to recognize that during the

course of the trial, I think that’s part

of the question. All right. So I’m going to go through

some of the details of how we actually implement

this borrowing. So the methodology

here is called Bayesian hierarchical models. What these models do

is they explicitly have a parameter that assesses

the variability from study to study. So in our case, if we

have our current study and we have the historical

study, what the model is doing is assessing the variability

and the response rate from the historical data

to our current control. And this parameter that actually

measures that variability has a direct relationship to

how much weight we assign. So if the variability is small– so if our current data looks a

lot like the historical data, we have small variability. The model’s going

to borrow a lot and we’re going to make heavy

weight on that historical data. But if that parameter

for the study variability, if it says that

these two studies differ very widely, then that means

we’re going to have a smaller weight on the historical data. So in the model, as we

estimate this parameter, and we do that through

Bayesian methods, we get a posterior distribution

that determines the weight. So as we saw, the more borrowing

happens when you agree, less borrowing happens

when you disagree. So we do have some

statisticians in the group, the mechanism for this. So in general, we’re going to

let PC be the historical rate– sorry, the control rate

in our current trial. So C here stands for current. And then we have

potentially more than one historical studies. So in the example we

just had one data set, but we could

potentially have more, maybe these are

several small data sets that we’ve

looked at that all have the same kind

of control treatment. So P1 through P big

H, these are the rates from our historical studies. What we do is we

model the responders. So Y is the number of responders

on our current data accrued, and we assume that’s binomial. Historical data also has

a binomial distribution. And then what we do is

we assume that this set of historical control

rates has a distribution. Here we do the modeling

on the logit scale. And then what we do is

we assume that this group of historical rates has

a normal distribution with some mean and some

standard deviation tau. And this parameter tau is

what we just talked about it. That’s the parameter

that measures the study-to-study

variability, and that’s what’s going to control the

degree of our borrowing. So tau, this is the

most important parameter for the borrowing. If we just assumed that

we knew tau in advance and we don’t estimate

it from the data but we just assume

that we know it, that basically

corresponds to giving a specific static weight. But in Bayesian methods,

what we’re going to do is actually estimate tau as

we start to collect our data. And then what happens is

as we start to estimate tau and we see that there’s a high

study-to-study variability, we start to downweight

that historical data, but if we see that there’s

agreement between our two studies, then we start to

increase our borrowing. So what that does is it creates

what we call dynamic borrowing. So the degree of borrowing is

not pre-specified in advance. Rather, the model is

pre-specified in advance, and then as we

accumulate the data, we estimate that

variability, and the model determines how much we borrow

and it assesses that drift. So I think we’ve got that. So what are the

properties of this model? So we previously

looked at graphs where we had a static weight,

either a weight of one where we fully pooled the data

together, or a weight of zero or we completely ignored it. Now when I’m showing this

bright lime green line, this is what happens if we

do a hierarchical model. So we now allow

the data to dictate how much of the borrowing we do. So on the left this is

our type I error graph. What we see is as

the truth starts to differ from that

historical data, we do start to see some

type I error inflation. But importantly, that type I

error inflation is bounded, and it eventually

starts to go down. So what happens is the

model starts to recognize– as you get way out here so far

away from your true parameter, it starts to recognize,

hey, these two things don’t look the same,

it starts to downweight that historical data and

your type I error rate starts to come down. And similarly, in the power,

we also see a mild power loss compared to downweighting. So here’s the

historical borrowing. And you can see that

compared to if we just pool, we do lose a little

bit of power, but if we compare

to a design that ignores that historical

data– that’s the orange, we do see improvements

in our power. In terms of power loss–

so if our true parameter differs from the data

that we’re borrowing from, again, if we just pooled

that data, what we see are these huge losses in power. We ignore the data

that kind of plateaus the historical borrowing, and

again, it starts to plateau. So as your current data

starts to differ too much from the historical

data, the model recognizes that, starts to

downweight the historical data more so you don’t see these

dramatic drops in the power as we did before. So I’ll quickly go through

an example of this. So we had a trial

that was originally designed as a

non-inferiority trial with a fixed sample

size of 750 subjects, one-to-one randomization. So 375 subjects per arm. There were some historical

data that was available. There were actually two

historical studies here. If we use the

historical data and we use the hierarchical

model, what we found is that we could do a trial that

had 600 subjects rather than 750– so about 20% fewer subjects. Here we changed the

randomization ratio to two to one– sorry– yeah. Instead of one-to-one, we

changed it to two-to-one. So put 400 patients on

treatment, 200 on control. And for the expected drift–

so within the range where we thought there might

be some difference from the historical rate,

the type I error rate was controlled, and we

actually had comparable power to our original design. So we were able to

save 20% fewer subjects for about the same power and

about the same type I error rate. Let me pause again for

questions and see– I wanted to make sure I

addressed your question. Questions? So I think– It may not be appropriate

for this discussion, it’s more of a philosophical

one that if the natural history of the disease– that is, the focus of a trial

is actually changing, then the assumption around

how close this trial is to very good, very well-run

multiple studies with event rates before, one shouldn’t

assume that this trial should be weighted less

in its new control rate than the prior ones. And if we all do that

with all of our trials, we’re going to miss the idea

that the biology is changing or that the treatment effect is

not what it was 10 years ago. I think what you– you raise I think

the key point here, is what is your expectation? There is a sweet spot

where there is a benefit, and outside of that sweet

spot there is a loss. And really, it’s an

expectation of what is our belief on how often we’re

going to hit that sweet spot and how often we’re going to

have excessive drift on there. And in a lot of cases,

we expect the sweet spot to be likely to be hit, and

that doesn’t mean every time but enough of the time that this

is worthwhile in repeated use. I think kind of the other

philosophical point, if I think the natural

history of the disease is changing quickly, I

also have the worry– if yesterday’s trial is

not relevant to today, then today’s trial may not

be relevant to tomorrow. And so I have that other issue

of going forward as well. So I think that becomes

just a really hard problem to deal with, and this isn’t

designed to solve that. I think– OK. So I think this is a more naive

question than Neil’s, but it comes down to the

same issue, because I think the issue that

maybe a lot of us would be struggling

with is when do you even consider doing this? Because it sure

sounds like you could save a lot of

time, and certainly from an NIH perspective,

a lot of money if someone was able to use

historical borrowing. But when do people

consider this one? Neil raised the question of

the changing natural history of the disease. Another question I would

have is, do you even consider it, for instance,

if the historical data is retrospective

historical cohorts. Do you even consider– versus, let’s say, a

prospective cohort. Do you consider

observational study data versus prior-randomized

clinical trial data? Because a lot of the examples

that you’ve been using have been other randomized

controlled trials. In some of our

rare diseases where we want to use a lot of

these innovative methods in design, the best

historical data we have may be retrospective cohort

data, it might even be if we’re lucky

some observational prospective cohort data. It’s less likely in some

of our rarer diseases to be another randomized trial. So I mean, do you even

think about using this in those kinds of situations? So it depends. One of the key issues– so if

you go to the FDA perspective on this, they’re very upfront

that they fully accept the statistical methodology. And the hangup is

essentially, do we trust the historical data

that’s being brought to bear, which is exactly the

right question to ask. It does depend on the setting. There’s a pharmaceutical

group called TransCelerate which they’ve been collecting– essentially a bunch

of pharma companies have agreed to data dump

a repository exactly for this purpose

among many others, but this is one of

their key work streams. And they’re writing

some white papers on under what conditions do

we think this is valuable? What that white paper

will say is, essentially you need to look at the

area, areas like glaucoma, for example, very

stable over time. PCR rates in breast cancer,

very stable over time. Depression studies, you can get

lots and lots of variability, and essentially it’s so large

this might not be a good idea, so you really have

to know the area. In terms of rare

diseases, the smaller the trial you’re running today,

the broader that sweet spot. There’s so much

variability in the study that anything you can do

to reduce the variability is bigger than any risk you

have to creating a bias. And so in rare diseases, you

have the interesting trade-off that the data might

not be as nice, but because the sweet

spot is so large, it can be in your benefit. So one thing we want to

do is when we design this, we write out the sweet

spot is this big. You need to be comfortable

that your rate is within these bounds or this

isn’t going to work for you. And then it becomes– I mean, I do not want to call it

a gamble, but in the long-run, it’s, do you believe

the sweet spot? And if you’re going to

be right 80% of the time, then this is worth doing. If you’re going to be right 40%

of the time, then this is not. And it’s a matter of obtaining

that confidence in repeated use. And I think– I didn’t ask

the second part of my question very clearly now that we’ve

gone through some of this, but to get away from the problem

of the historical patients as we’ve been discussing,

could you use this methodology for a contemporary

control That’s not included in the study,

but is an observational cohort or on another registry? So that’s what I was

trying to get to. Do you have something to say? I’ll say it. Yeah, I don’t want to hog– –thinking. This is– sorry, this is

a dear to my heart area. So contemporary–

the general issue of drift, one part of that

is separation in time. Other parts of it

are going to be different sides, different

standard of care, other aspects. So I think the general

notion of how close they are, time is one aspect of that

and other factors also come into play. So contemporaneous gets rid of

the time aspect that may not get rid of different

sites, other differences between what’s going on. We’ve had a lot of use. We run platform trials which

are 10-year continuous trials. Those are nice because you

can see the whole time course, and you can get an idea how

much have they been changing over time, and you’ve got

almost the self-mechanism for checking this assumption. So contemporaneous,

I think they’re good, but it doesn’t eliminate all

the possible discrepancies. Let’s see, I think we

need to take a break. Deraus is informing me that

we’ve got goodies in the back. We’re going to take 10

minutes, will that work? Let everybody get some food. If it takes 12 minutes,

we’ll take 12 minutes. So we’ll let everybody

get some food and rest. We’re going to go ahead

and get started back up, so feel free to grab your

coffee and your goodies. All right. So we’re going to get

started back up again. So this morning– sorry, the

previous session, we mostly talked about ways that we can

efficiently use our resources. So we’ve talked about ways

that we can make our sample size flexible, ways

that we can use data that we have in hand to help

make our trials more efficient. One of the other broad topics

that we want to talk about is how innovative

trials can help us to answer broader questions. So a standard randomized

clinical trial requires focus. So typically what

we see is a trial that is a one-to-one

randomization looking at one treatment versus control. We have a very specific

homogeneous patient population that we’re looking at. We’ve made an assumption

about our treatment effect. And a lot of times, our

standard randomized trial doesn’t look at a lot

of different things. Like we don’t look at

treatment combinations, we don’t look at different

what we call domains. So there’s reasons

historically for why we have this focus in

our clinical trials, but one of the

things that we often think about when we’re in the

process of designing a trial is we try to anticipate if

the trial doesn’t come out the way we hoped. If we have our p-value at

0.06, for example, what would we regret

and what would we think we would have done

differently in our trial? Would I think, oh,

I should have looked at a different population? Or I should have had

a bigger sample size? Or I should have looked

at a different dose? And so we try to

think about what are those things in

our trial that we might regret if we don’t do them. And if we can think about

those things in advance, and then we can make those

changes to our design up front to help reduce

that anticipated regret. So innovative trials,

rather than focusing on a small question often

aimed at answering a bigger question or a set of questions. So because of this, we

can broaden our focus so that we’re not just maybe

looking at a single population or maybe not just

looking at a single dose. And this makes our

trials more robust. So robust meaning that

we’re robust to some of these uncertainties. So if we have a narrow

focus and we’re just designing a trial with one

dose versus our control arm, we’re making an assumption that

we’ve picked the right dose. Or if we just look

at a population, certain characteristics,

we’re making an assumption that that’s the right

population where the treatment benefit occurs. And if there’s uncertainty

about that, there’s uncertainty about what the right dose is

or what the right population is and we choose incorrectly, there

are no obvious consequences to that. So if we can make our trial

more robust and looking at broader questions–

multiple doses, multiple populations,

not only does that help with accounting for

the uncertainty that we have and maybe giving us some

protection against making a wrong assumption,

but it also allows us to look for things like

interactions between treatments that you don’t get

to do if you just selected one single treatment. So this idea of looking

at more questions, this is not a new idea. So a lot of us

may have been told that when you’re running

a clinical trial, you can only change one

thing at a time, right? But actually, Ari

Fisher, one of the giants in the statistical

world, recognized in 1926 even that asking a very

narrow set of questions is not necessarily

the best idea, and if we can answer

logical or ask logical, well-thought-out

questions, that we have a better chance

of answering those questions. So one of the designs

that we can talk about to ask broader questions

is a factorial design. So the idea here is

that instead of asking to compare just one single

treatment versus control, we may be interested

in combinations or different types of

treatments in combination with one another. So for example, we may have

a trial where we have maybe– what I call a domain. So Domain A, you can either

give big A or little a. So maybe this is– in the example

we’re going to use, maybe this is patients

who are on a ventilator. So you have patients that’s

either on a ventilator, big A, or not on a ventilator,

that’s little a. And maybe you’re also

interested in another treatment, antibiotics. Patients say they’re on

antibiotics or they’re not antibiotics, so that’s Treatment

Domain B. Patients can either get big B, which means they’re

assigned to antibiotics, or little b, which

means they’re not. So a trial that just

looks at ventilator versus no ventilator, or a trial

that just looks at antibiotics versus no antibiotics,

that would be a trial with a

single narrow focus. And what that doesn’t

allow us to do is look at any interactions

between these things. So what we could

do instead is look at a trial that looks at the

combination of these things. So you could do big A in

combination with big B; big A in combination with little

b; a big B and a little b. So in the example that we

use, we just had two domains. You can generalize this. So maybe our example is not just

a ventilator and antibiotics, but maybe it’s a ventilator,

antibiotics, and steroids. We now have three domains,

and each of those domains has two possible treatments

with it or without it. And we could run

a trial where we look at all of those

different combinations. So now we’ve got ventilator,

yes; antibiotics, yes; steroids, yes; versus

ventilator, yes; antibiotics, no; steroids, yes. And looking at all of

these things together allows us to answer

a broader question. So this design is called

a factorial design. And these types of

designs are always going to be better

than a separate trial for each variable

because it allows us to look at the interactions

between those two things. So for example, if

you were running a trial where you

were just looking at steroid use in combination

with antibiotics– so big A, big B, and versus neither of

those things– so a patient either gets both or

they get neither, that doesn’t allow you

to answer the question of whether the

antibiotics are important or whether the

steroids are important. So if you see a benefit

in both of them together, which one is it that

causes a benefit? Or is it really only

when they’re together? And you can’t

answer that question unless you do a factorial

design that allows you to look at that interaction. And let’s see– so

if the factors really do interact, only the trial

that looks at all of them, only a factorial

design is the answer to finding that interaction. If these factors

don’t interact, you can still estimate the

effects of all of the factors, and the nice thing

about a factorial design is that the cost

is actually similar than if you just explored

all of them independently. So an example of

this type of design is called the prospect trial. This is a pediatric

respiratory trial. It’s an NHLBI-funded trial. This was looking at the

comparative effectiveness on two factors or two domains. So the first one was the

positioning for a ventilation. So patients were

either supine or prone while receiving the ventilation. And then the type

of ventilation was either conventional or a

high frequency ventilation. So now we have in combination,

there are four possibilities. So we have four

arms in our study. And the trial was designed

to have interaction– sorry, to have interims where we could

consider either dropping a row or dropping a

column of our table. So potentially we could

have a situation where we stopped all of the arms

that were in the prone position and continue the

supine position, or we could do the opposite

where we continued the prone and stopped the other. We could also drop

one of the rows. So for example,

you could stop all of the arms that had the

high frequency ventilation and continue the conventional,

or do the opposite and stop the conventional

and continue the other. So this is an example

of a factorial design that allowed us to look at

these different domains. So as we said, the

factorial design is really the only way that you

can detect these interactions if they exist, the

only way that we can account for which

of those factors contributes to our

treatment benefit. These trials can be

slightly harder to plan and implement when you

have multiple factors. But the nice thing is that if

you don’t have interactions, it’s almost free to

do this, so why not? And if the

interactions are there and you want to estimate

them, this does typically lead to a larger sample size. So that’s one way that we can

address broader questions, is a factorial design where

we look at multiple treatment domains. Another way we can look

at broader questions is by doing a dose

ranging study. So many of the trials

that we see today have a small number of treatment

arms, maybe one or two doses versus control. So we have often

made an assumption that those are the

right two doses that we should be looking at. But if we make a guess, it’s

a possibility that we’re wrong and it may be the best dose

isn’t actually one of those one or two that we’ve chosen. So a dose ranging trial,

what this is going to do is a trial that

incorporates more doses. So three doses, six

doses, we’ve looked at trials that have many

more doses than that. A lot of times when we

do dose ranging trial where we have multiple

doses, we want to have a model that

explicitly looks at the relationship between the

efficacy and the dose level. And a lot of times

will often look at adaptive features such

as maybe eliminating arms that aren’t performing well. So we can start to

drop arms or maybe change the allocation

to our doses– so if some doses start to look

like they’re performing better, we increase the

randomization to those arms. So we can have adaptive

features and we’re going to talk about those a

little later this afternoon. We’ll talk about something

called response adaptive randomization or RAR and

give an example of that. That was a little

teaser, so we’re going to talk more about

dose ranging trials, but wanted to talk about

response adaptive randomization before we go into that. So the gains of this is– well, first of all, we

have a better chance of picking the right dose. If we can look at a

broader set of doses instead of just

picking one or two, we have a better chance

that we’ve picked the right dose in our trial. So the losses, a lot

of times we think if we’re going to add more

doses, that’s more arms, isn’t that more patients? So if I’m going to

add another dose, does that mean I now have to

almost double my sample size? And the answer is

usually actually no. So with modern modeling, so

when we add this dose response models and we start

to add features like adaptive

randomization, there may still be a small

increase to the sample size, but it’s actually

a marginal cost. So the cost of going from

three arms to four arms is actually not that large. So one of the other

things that happens with dose ranging

trials, we often pick a dose response model. Again, that’s an assumption. Basically we’re

looking at what’s the form of the relationship

between efficacy and dose. So that does rely

on an assumption, so that has to be

carefully considered. So we’ll circle back around

to dose response modeling in just a moment. But first wanted to talk a

little bit about enrichment. It’s kind of a buzz word lately. So the idea is

sometimes we don’t know the right

population of patients to enroll in our trial. And enrichment trial

recognizes that there’s some uncertainty about the

population in which population actually has the benefit. So an enrichment trial, you can

do this maybe one of two ways. You can either start with

the largest population and then start to narrow

that down and start to focus on the population

where you’re seeing benefit, or you can go in the

opposite direction. So maybe start with

a small population, and then as the

trial moves on, you start to expand into a

larger, broader population to see if that

benefit still holds. Some examples of

this, we’ve seen this in a lot of oncology trials. In particular,

when we’re looking at a targeted therapy that’s

tested in multiple tumor types– so the idea is that maybe

it’s not as important where the tumor originated, but

rather the molecular target of the agent and

whether it’s going to be beneficial in patients

with those types of tumors. So in oncology, we see

these trials, we sometimes call them basket trials,

where instead of just running a trial in this histology

and then running a separate trial in lung

cancer and running one in breast cancer, what

we can often do instead is enroll a broad population. So we enroll patients in

different types of tumors, and then we use our methodology

to try and figure out which of those populations

is there a benefit. We’ve also seen an example– or some examples in other areas. Stroke, and we’re going to

walk through an example here, where the enrichment was

done to determine the time window between when

the stroke occurs and when the patient gets to

the hospital and the stroke severity. And the idea is, does the

effectiveness of the treatment differ across this population,

or is it effective everywhere? So enrichment trials, the

main idea here in the gain is that we can identify where

the treatment is effective and where it’s not. So one of the

downsides is that we do have some more complicated

statistical methodology to account for the fact that

we’ve got multiple groups and if we have multiple

chances to win. So we have to account for that. But one of the big gains

is that this is robust. Because we’re looking

at a broader population, there’s a better

chance that we’re going to ultimately end up with

the right population instead of just guessing at the

outset and potentially getting it wrong. These trials tend to be

larger than if we were just doing a trial in one

homogeneous focused population. But then, of course, the

potential benefit of that is it might be easier to enroll

because we’ve now got a broader inclusion/exclusion criteria. So an example of this is a trial

called DAWN run by Stryker. Jeff Saver was the

PI of this trial. And this is investigating the

use of a device called TREVO for the treatment of stroke. Thanks to Roger Lewis who

provided these slides for us. So before the trial

was started, there were some papers that came

out showing that endovascular therapy was highly

effective in some, but the population for

where that benefit happens was a little unclear. So which patients are

most likely to benefit from this treatment? So at the start of the

trial, the investigators, they sat down and

tried to think about where are their uncertainties

about this device? What are the things

that we would like to learn in the trial? So this slide shows kind of

this set of uncertainties that they listed, things

that they don’t know that they would like to learn. One of those was

where is the benefit? Where does this occur? So in stroke, the

primary endpoint here was the modified Rankin

scale, mRS disability scale. So this is a seven-point scale,

kind of an ordinal scale. So zero means the patient

has no symptoms at all, a 6 on this scale means

the patient has died. So the question was,

is the treatment most beneficial for

subjects who are maybe in the lower range of that scale

or more beneficial for patients who are in upper range? Or is it beneficial

across the entire range? So that was one of the

things that we didn’t know going into the trial. Another question was

about who benefits. And one of the things

they wanted to look at was the core infarct size. So this is, as the

way I understand it, was a measure of the

damage to the brain tissue. And the one of

the hypotheses was that only those patients with

smaller core infarct sizes would benefit from

the treatment, but those with the

larger infarct sizes were less likely to benefit. But that was an

uncertainty, we didn’t know, and that was one of

the things that we would like to learn

about during the trial. One of the other

uncertainties was the magnitude of the benefit

for this device called TREVO. Does it have no benefit,

does it have a large benefit, does it have a small benefit? And the implications

here were, well, if it has a large benefit,

we could run a small trial, but if the benefit is smaller,

we would need a larger trial. This has kind of– the idea that we might

need a flexible sample size because our we have

some uncertainties about the size of

the treatment effect. So for all of these

questions, if we were to just make an

assumption and say, OK, we’re only going to

enroll patients who have the small core infarct

sizes, that would be making an assumption, and if that

assumption is incorrect, that could have

consequences for the trial that maybe the

device is beneficial and we just picked

the wrong population. So what we have done

instead is designed a trial that’s going to look

at all of those factors. First of all, we’re going to

talk about that first one, the mRS scale. So this is a seven-point scale. Typically in the analysis

of these types of trials, the scale is dichotomized. So for example, we

look at patients who had an outcome of 0, 1, or 2. So these are good

outcomes, patients have small-to-no symptoms. And we count the

rate of patients who have an outcome of

0, 1, or 2 versus those that have 3 through 6 in kind

of a responder rate analysis. So the downside of that is

that if the benefit occurs– if we move patients who

might have otherwise been a 4 and we move them

to a 3, this kind of analysis that just lumps

together the 3’s and 4’s doesn’t account for that. So we miss kind of the subtlety

in the treatment effect there. So one way to handle that is

to have an analysis method that looks at that scale. So it’s not just looking– does

a patient go from somewhere of 3 to 6 down to somewhere 0 to

2, but looking at how they move and what degree of

movement they have. So what we’ve done

here is constructed what we call a utility-weighted

mRS. So what this does is it takes each value of

that mRS score, 0 through 6, and assigns it a utility. So here, utilities that are– a utility of 0 is essentially

there’s no benefit, and a larger utility score means

that there is a bigger benefit. So there were a couple of

papers in the literature that had proposed

utility-weighted scores. So these top two

rows in the table shows some utility weights

that had been proposed. So for example, these were

on a scale from 0 to 10– so a utility of 0 is no benefit

to the patient, a utility of 10 is a good benefit. So patients who end up

with a mRS score of 0 were assigned a

utility weight of 10– so that’s a good outcome. Patients who died were

assigned a utility of 0. 5 as kind of an interesting one. So 5 on the mRS scale

is essentially a patient who is in a vegetative state. So one of these

proposed methods treats that– gives it a very

small utility score of 0.1. The other proposed

method actually gave this a negative

utility score, so that this

outcome was actually considered worse than the

outcome of 6, which was death. And then you can

see that in these of the other range of the

mRS’s, the two proposals, they agreed pretty closely. So in the DAWN

trial, the new trial, they actually used the

weights that are shown here. So outcomes of 5 and 6, both

received a utility score of 0. mRS of 0 got the

good utility of 10, and then the various

degrees in between. So then when the

analysis is run, that allows us to look a

little more subtlety of where the treatment effect is. So instead of just doing

this dichotomous looking at 3-to-6 versus

0-to-2, we can now have a better idea of the

magnitude of the benefit. So for example, a patient

going from a 4 to a 3, that’s a pretty large

increase in their utility. Whereas going from a 1 to

a 0 has a smaller change in their utility. All right, so that was one

question that we addressed. The other question

was, who benefits? And this idea of, should

we enroll only patients who have small infarct

sizes or what should we enroll a broader population? So the infarct size,

this was thought to be something they could

identify as an eligibility criterion for

entering the study, and might define

different populations and how they respond. So the strategy could

be that we start with the small infarct

sizes and then expand if we see benefit

there and start to enroll the larger ones. The other strategy,

what we call enrichment, is we start by enrolling

the broad population, And then we start to restrict

the enrollment criteria if necessary. And the DAWN strategy

actually did this later one. So they started by enrolling up

to 50 cc core infarct volume. And then they allowed

enrichment rules– so they had interim

analyses that would allow you to then prune

off certain sections of that so they could potentially

decrease that from 50 to 45, from 45 to 40, et cetera. And the way that

that was done was by looking at the likelihood

of a positive trial. And if you were more likely

to get a positive trial from decreasing from 50

to 45, then that was done and you enriched to

a smaller population. And I think I have another

slide about that in a moment. So that the third question that

we had some uncertainty about was the magnitude

of the benefit. So ideally, if the truth

is that this device doesn’t work at all, if

there’s no benefit, we would like to be able to

stop the trial early and save that sample size. Or, we would like

to say, OK, if there are certain regions of the

population where there’s no benefit, we would like

to stop enrolling just in that population,

but if we see benefit in other areas of

the population, we would like to keep going. So if there’s no benefit,

we want to be able to stop. If there is a small benefit

but it’s clinically important, we would like the

trial to keep going. So they set a large

maximum sample size of up to 500

patients, but then had interim analyses that

allowed us to start early. The other idea behind a

clinically-important difference and understanding

what that means was related to the

choice of the outcome. And instead of just using

the dichotomous measure, trying to pick an

outcome that’s going to be more sensitive

in recognizing where that benefit is. And finally, if we have a

large treatment benefit, that means we don’t really need

the full insurance policy, as it was described

earlier, of a large trial. So we’d like to build in

rules that allow our trial to stop early and potentially

as early as 200 patients if it’s likely that the trial

is going to be successful. So everything about

this design was pre-specified– they actually

wrote up a methods paper describing their methods. So the sample size

was a maximum of 500, and they had multiple

interim analyses. Starting at 150, 200,

all the way up to 450. And they pre specified

what types of decisions could be made at each

interims– so some interims they had the possibility

of enriching, some interims they had the

possibility of stopping the trial. The rules for

early stopping were based on the predicted

probability of success. And then the rules for

stopping for futility were based on if the

probability of success in any of those subpopulations

was less than 10%– sorry, in all of those

populations was less than 10%. The enrichment criteria

here was based, again, on the predicted

probability of success, and the idea was,

if we could increase the probability of winning

the trial by restricting the population by– and if by doing that, increase

the predicted probability by at least 10%,

then we would enrich. And we would

eliminate populations based on their core

infarct size if they had less than a 40% chance of

benefit in that population. And the enrichment here

was done from the top. So we started with infarct

sizes from 0 to 50, and then we would

calculate what’s the predicted

probability of success with the full population? What’s the predicted probability

of success if we restrict it to 45 and below? What’s the predicted

probability if we restrict it to 40 and below? And if by doing that,

enrichment could increase the probability

of a successful trial, then we would enrich. So some of the statistical

details around this. So this, as you can see, became

kind of a complicated trial, the enrichment,

the early stopping, so we had to do a

lot of simulations and make sure that the

trial was going to behave the way we wanted it to. One of the aspects

that we looked at was, what is the criteria

for a successful trial? So we were calculating

the posterior probability of the mRS– utility-weighted

mRS score being better than the control arm. If no enrichment happens–

so if all through the trial we enroll up to 50 cc

[INAUDIBLE] infarct volume, then the probability of benefit

has to be at least 98.6% to be positive. And this threshold

takes into account the multiplicities of our early

interims for early stopping. And that’s if no

enrichment occurs. If enrichment

occurs, we actually need a more restrictive

criteria because of the fact that we’ve kind of

cherry-picked which population we’re

going to be analyzing in the final analysis. So we started out with 98.6,

but if enrichment occurs, there was a rule for how we

would adjust that threshold to account for the enrichment. We had some statistical

models kind of similar to the models that we’ve

talked about before that shared information across

infarct-sized populations. So they had a model looking

at the benefit as a function of the infarct size. And we ran before the trial

started massive computer simulations to show

that this all worked, that we controlled type I

error, that we could understand how the enrichment happens. So here’s the paper,

this was published in the Journal of Stroke

for the DAWN trial. And this is a trial

that has actually run. It actually finished

and reported out in January of this year

in the New England Journal of Medicine, and actually saw

quite significant efficacy across the wide

range of patients. So I believe the outcome was

it never actually enriched, but the trial did stop early and

was successful at an interim– and I don’t recall

which interim it was. It was the first one– OK, so they saw the

wildly positive results in the first interim, they

were able to stop early. Pause to see if there are

questions on the enrichments? All right. And to hand it over to Kert

to talk about platform trials. On a pacer. So I’ve had

instructions, I have– there’s actual

tape on the floor, I’m not meant to

go past the tape. So I’m going to sit

here very carefully. All right. So I want to talk

about platform trials. These are trials that

enroll multiple treatments within the same structure. So we’re investigating

lots of things at once. They’re motivated by the notion

that if you look at a disease– and I’ve got all

Alzheimer’s up here– there’s been lots of

Alzheimer’s trials, unfortunately

they’ve all failed. What’s happened is if you take– I have an example here, 25,

there’s been more than 25. But as a society, what we’ve

done is we’ve invested 50,000 to 100,000 patients

in Alzheimer’s. We’ve failed over and over and

over and over and over again, and we’ve invested one to

2000 patients in each novel treatment, and we’ve

invested 50,000 patients in– 25 to 50,000 patients

under control. The net result of that is we

know a ton about the natural course of Alzheimer’s. And do we need 25,000 to 50,000

patients enrolled on control in Alzheimer’s? Or would it have been

valuable to do a few less and maybe test a few more

drugs while we were at it? We could have had more

shots on goal, so to speak. So what platform

trials are aimed at is trying to

make that happen? So this is just a graphic. This is really how we’ve

allocated patients. A whole bunch of

controls, and, I mean, certainly there’s a lot of

patients in each of these bars, but the relative

proportions, this is a little off from

what we would think would be the

natural thing to do. So a platform trial is a

special example of something called a master protocol. Master protocols are kind of

a another buzzword these days. Essentially they’re a

protocol that– imagine a protocol with no drug names. Sounds kind of odd. But if you’re investigating

Alzheimer’s, there are certain endpoints

that you’re interested in. Patients are visited

at particular times. They have certain medical

tasks, medical procedures. You can write a

lot of the protocol without really knowing what

the experimental drug is. We’re going to be

testing ADAS-Cog and we’re going to

do it at this rate, so we’re going to do this

kind of analysis at the end. So the master protocol

is really specifying all of that stuff–

the visits schedule, the endpoints, the

analysis, and so forth. A master protocol is

designed so that it can be run over

multiple treatments, and in particular, it

can be run perpetually. So you can say, I’m going to

start a trial in Alzheimer’s. There actually is something

called the EPAD trial in Alzheimer’s,

which is a platform. And at any given time, there

are three to six treatments an EPAD. It’s still getting

going at this point, but there will be three

to six treatments in EPAD. And drugs can go in

and out of that trial. And when a trial– when a drug comes

in, what happens, instead of having to write

an entire new protocol, you say, look,

most of this stuff, we’re going to

have the same visit schedule as all the other

drugs, the same endpoints, the same tests. But when a new drug

comes in, instead of writing an entirely

new protocol, what we’re going to do is we’re

going to add an appendix, and that appendix

says, this is what we’re going to do for

drug X within this. So usually that

appendix gives details of what the treatment is. So certainly all the

medical information, any specific safety risks,

any particular safety concerns if you have to

do extra kinds of testing. We have to do a certain kind

of blood test for this drug that we wouldn’t

ordinarily have to do. And if there’s any

patient exclusions. We ran an Ebola

trial, for example, and some of the drugs could be

used on pregnant women and some of them couldn’t. So the appendix would

say, oh, this one’s not applicable to pregnant women. A platform trial is a specific

form of master protocol. And it’s defined by

its being continuous. So certainly most

master protocols are intended this

way, but some of them we’re going to have six

drugs from the start. A platform trial

is usually intended to run for a long time,

and drugs are going to enter and exit over time. It’s also intended to have a

cohesive inferential structure. And so what do I mean by that? I’m not trying to run– platform

trials are often confused with cooperative networks. Where if you run a

cooperative network, you’ve got a large

patient infrastructure, but every trial that comes

in is really separate. So I’m going to run

this trial and I’m going to get patients

from this network, and I’m going to run

that trial and I’m going to get patients from

the network and so on. But the trials really aren’t

talking to each other. These are trials where they

aren’t trials stapled together. What happens is patients– when you come into

this trial, if there are six treatments

running in it, you can get randomized to

any of the six treatments. So it all has a single umbrella

that decides randomization and it has a single umbrella

that describes the analysis structure. One of the key

reasons for doing that is if you have multiple

arms and you’ve randomized across

everything, that allows you to pool controls

and it allows you to compare across different arms

because you had randomization as opposed to if you

didn’t, if you just ran this trial in this part

of the network and that trial in that part of

the network, you’d worry about potential

biases because you didn’t have this randomization. Yes? So Kert, that’s

very interesting. Are there– in these

master protocols as you’re describing them, do

the inclusion criteria for each of, let’s say, the

subtrials that are going on within the master protocol

have to be the same in order for it to work? Because I can envision

where you might be able to have a master

protocol for a trial, but the inclusion

criteria for trial A might be different than B, C,

or D, just maybe even slightly. So there are a couple different

directions you can go. And the efficiency that you

gain depends on the details. So I’m trying to remember–

what’s the European oncology trial? If you asked me any other time,

I’d think of it immediately. All right. Anyway, there is a trial

that for some reason I’m blanking on,

what it is designed to do is there are several

different oncology treatments, but every one of

them is intended for a different

molecular mutation. So they’ve got drug A is

intended for mutation A, drug B for mutation B, and so on. That is really more of a,

they are using the platform as a screening mechanism. If you imagine these

trials run separately, trial A would exclude out 80%

of the people they screen, trial B would do the same thing,

trial C– because they’re just looking– each one’s

looking for a small segment with their particular mutation. If you do all of these

together, then what happens is that screening

mechanism has gained you operational efficiency. Other trials are intended to

look at the same treatment and you’ve got

multiple drugs that work on the same population. That gives you

inferential efficiency because you can share

controls among that. So it really– you can

go both directions there, and the efficiency you gain– you get efficiency, but it’s a

very different kind depending on which direction you go. Oh, yes, sir? I have two questions. So one, how do you

get the companies to agree to a master protocol? Which, I don’t

know, maybe you can point to a reference

or something that you can point

for the lawyers to kind of debate about? So the second question is

more realistic is, is there a certain type of randomization

platform schema that you would recommend for a platform

trial where it’s more efficient to

bring in new drugs as other drugs

kind of get exited? How do you get

companies to agree? Very, very carefully. So the first thing

that you can always do is you can always get

a company to agree to be the second, the

third, or the fourth drug into the platform, because you

can effectively promise them the cost is cut in half. The thing that you can’t

ever do is convince someone to be the first drug

in the platform trial because they’re difficult

to get together. Now that’s a little

bit tongue-in-cheek, but that part’s absolutely true. The other sticking point,

which I think you probably were intending,

is companies often don’t want to play in

the same sandbox, period. One thing that I think is

fascinating about platform trials, there is a

direct comparison between drugs within a platform

and you randomize in-between them. Usually what has happened,

platform trials nowadays actually have, part

of their charter, is a statement that they will

never publish and compare the drugs together. And often that is enough. I-SPY 2– you asked for papers. The I-SPY 2 has been

running for about 10 years, it’s had lots of

companies go through. I don’t know if there’s

any publication that describes the interactions. Probably by design as

we never actually want to describe the

private interactions. There is a paper

that’s been submitted from the adaptive platform trial

coalition, which is self-named. But Derek Angus is

leading that, Derek Angus and Brian Alexander. And they I think

have some discussions on how to get

companies to buy in. So I’m not giving you a

good answer because it’s a delicate question,

but we have successfully done it for a number of trials. An example of this– so let me just give you a kind

of– this is the efficiency. So we’ve got an

emerging epidemic, this is inspired by our

work on the Ebola epidemic in West Africa. Multiple possible treatments

were suggested for Ebola. If you remember,

there was something called ZMapp, a number

of other antivirals, there were some suggestions

certain cancer drugs might be effective, and people were

thinking of combinations, all kinds of stuff going on. So one question was, if

you have this large space of possible treatments,

how do you look at those? What do you need to do– what’s the right way to

evaluate 20 different drugs? We’ve done this in all

Alzheimer’s as a society over the past 20 years. We’ve done drug A

failed, drug B failed, drug C failed, drug D failed– we’ve run a lot of

trials in sequence. So how do we look through

these 25 different issues? In order to make the problem

a little bit more concrete, suppose that under

the standard of care there is a 30% survival rate. That’s consistent with

our Ebola example, it might have been closer to 40. But I will say 30 here. We’re going to

make the assumption this is a hard disease. So I’m going to assume that only

10% of novel treatments work. 90% don’t. And that depends on

what area you’re in. If you’re in Alzheimer’s,

nothing’s worked in Alzheimer’s. Probably assuming 2%

might be optimistic. Sepsis, an incredibly hard area. Certain kinds of

cancer immunotherapies right now fortunately are

in much better situations– you might expect a larger

number in certain areas. But I’m setting up this problem. So 30% chance of survival,

and of this large bucket of possible treatments,

only 10% work. How do I find the golden

nugget among that, those large numbers? That 10% of novel agents that

work, they have 50% survival. So working here means I

increased from 30 to 50. All the rest do nothing. So 30/30. So certainly things could be

harmful, I’m not assuming that. So how do we start? So this is what we might

do in a traditional trial. We’ll pick one. We’ll run a trial. I’m going to do 100 patients

per arm, so I do 200 patients. 0.025 type I error. If that first trial

is successful, I go, yay, I found

a treatment and I declare the process a success. If that trial fails, then I’m

going to pick another drug out of this bucket and I’m

going to run another trial. So this is just

intended in a sequence. Grab a drug; run a trial; if

it works, great; otherwise, grab another drug,

if that works, great; otherwise,

grab another drug– one after the other

after the other. Now this is actually

a pretty good strategy if you can pick what the

absolute most promising drug is in advance. Generally speaking,

that’s hard to do. If you actually knew

which drugs work, you probably wouldn’t

need to do the trials. So this– if we’re in this– if our bucket has 10%

good drugs, what happens? So on average, we’ve got a

run through 9.8 treatments to find a successful drug. That’s about in line with

our 10% of the drugs work. Our n, we have to– so this n is the total number

of patients we had to treat– 1,966. And that’s effectively–

really, we’re running 10 different

trials on average in order to find that

one golden nugget. Let’s see, we assume

that accrual rate, so we found one in

12 months on here. So here’s another way

to do this search. Instead of looking at one-to-one

over and over again, well what does that do? My first one-to-one trial,

I enrolled 100 controls. My second one-to-one trial, I

enrolled another 100 controls. The third one, I ran

another 100 controls. That’s the same

control over again. So that doesn’t

make a lot of sense, let me at least combine those. So I’m going to do a

shared control design. I’m going to do five

agents at a time. So still 100 per arm,

but now five agents are going to get compared

to a single control arm. And again, I’m going to do

pairwise independent analysis. So now I run 600 patients– 100 controls, 100 on

each five active arms. If I find a successful

drug, I go, yay, I won. Otherwise, I go

back and I enroll another five arms at a time. So I run this sequence. This is good. And I get savings here from

sharing the control resource. The fact that I haven’t

enrolled 100 controls over and over and over again is

what’s gaining me efficiency. So my average

sample size, it now takes me 1,528 patients

in order to finally find a successful treatment. And I can do this in eight

months rather than 12 months. I can do a little bit better. Remember, we talked

about futility earlier being able

to stop bad drugs? I’m going to do that. So those five active

arms, if some of them aren’t performing

well, I’m going to stop them for futility. That gets me an additional

benefit, a large one. So that 1,528 I can drop to 971. So now I’m almost in half. So sharing controls,

using futility, I can now find an effective

treatment twice as fast. What a platform does is it

adds another piece to this. . So we’re still going to

investigate five agents at a time, but now

what we’re going to do is we’re going to run interims

every 150 total subjects. I was about to get

up, but I refrained. So the futility

for an arm here– so we’re still

declaring futility, what you can think

of this platform as doing is whenever an

arm stops for futility, I’m going to bring a

new arm in immediately. So I’m not running

a six-arm trial and waiting for it to finish and

then running another six arms. Whenever a slot opens, I’m

going to bring a new drug in absolutely immediately. So I’m just investing

things forward. Question? So is there a point in

this mechanism in which your controls can crossover? If it’s somebody

having repeated events? Can someone– oh,

so can somebody– so that depends. Some of these are designed

to allow crossovers. There is a platform

trial called PanCAN which it allows patients to be

enrolled on multiple therapies over time. In other ones– in

this one, we actually did not allow crossovers. Usual issues with crossovers on

washout effects and carryover effects would apply to this,

but in theory, it’s possible. So if you had

asthma, for example, and you wanted to

see what was the best abortive agent in the

middle of an asthma attack, and someone was on the

trial as a control, there was other agents,

after some period of time, the controls could

switch after they’d been controls for

X amount of time, because the events are

in themselves isolated although the patients

are the same? OK. I don’t know that

much about asthma, but it sounds plausible,

that sounds like the PanCAN. As long as you don’t think– if I’ve been on drug A– or control and I

switch to drug A, if you think that’s no

different than being on drug A, that usual

question, if the answer is– if you could do a crossover

in a regular trial, you could do a

crossover in a platform. So I just want to touch

on the– actually, go back to the eligibility

question for a second here. Because if you’re replacing

one of these arms, let’s say the new treatment

comes in with, for example, you can’t give it to pregnant

women who have Ebola, how does that modify the

overall exclusion criteria for kind of the platform trial? Does it have to

change or are they not eligible to be randomized

to one of the arms? How is that handled

operationally? Yeah. It’s a– oh,

there’s a follow-up. [LAUGHS] I’m a

little nervous now. So the– no, you’re good. So generally speaking,

it’s complex. What you would often do is have

to analyze, say, pregnant women separately. In the Ebola trial, the

two special populations were pregnant

women and children. And it was actually

an interesting debate because the pregnant women

issue, the obvious question, if you die of Ebola,

that’s bad for the baby. So it was actually–

it was a huge debate over which drugs

actually ought to be excluded for pregnant women. You’d have to make

a modeling decision. Do you want to combine

the pregnant women and the normal data– not normal, it’s the wrong

word– pregnant women and everyone else

together in the analysis and pool it when you’re

evaluating an arm? Do you want to view

pregnant women separately when you do it? You’d have to basically come

up with a statistical model to account for that, and I don’t

have an easy answer for you in that situation. From an enrollment perspective,

though, as they come in, they wouldn’t be eligible to

be randomized to that arm. Yeah. So the operational– sorry–

the operational [INAUDIBLE] is easier. If you had arms A,

B, C, D E, and you said that that C and E were

ineligible for pregnant women, then a pregnant

woman coming in would be randomized to A, B, or

D. And so potentially you’ve got biases in those

populations and that’s where the modeling

would come into play. And so then with kind of

changing exclusion/inclusion criteria potentially

for treatments and also different treatment

modalities, I mean, these essentially are

unblinded trials, then? Depends. So the EPAD in Alzheimer’s, for

example, some of the treatments are pills and some of the

treatments are injections. For the purposes of EPAD,

you can get a placebo pill or you can get a

placebo injection. So the way EPAD works

is you are randomized to arm A, B, C, D, or E.

If arm B is an injection, then you’re randomized, you have

a chance of getting a placebo injection or a active B.

If you’re randomized to C and it’s an oral, you

might get a placebo oral or you might get

active C. So it’s kind of a two-tier randomization. It’s a question of whether– in an EPAD, they actually

pool the oral placebos with the injection

placebos, which was actually a controversial choice. What was interesting, one of the

compelling arguments that was made is, we’ve spent 20 years

and haven’t been able to move the needle on Alzheimer’s. Somehow if we could give an

injection as opposed to a pill, and that will affect

Alzheimer’s, that would be a little strange. But it is a choice

on whether you think that different

placebo modality matters, and it will make life

more complicated. So I haven’t given

you a direct answer, we try to deal with

it as best we can. All right. Let’s see. OK, so this open platform. So again, you can think of

the adaptive plus every 50. This was essentially

the five-at-a-time, but then we waited for

the trial to finish. This is when we invest

the resources forward. So now you can see, we’re

down to 849 subjects. So we’re searching for

this needle in a haystack. We’ve gone from

1,966 down to 849. We’ve saved about

60% of the subjects, and we’ve really done

it in multiple ways. We’ve shared our control

arms, done multiple treatments at once, we’ve employed

adaptive features, and we’ve invested

any savings that we’ve gotten forward as best we

can by bringing in new drugs. This is in a paper, this

reference in clinical trials, essentially this table. There are lots of

examples of this. They have different features. There’s a paper by Janet

Woodcock and Lisa LaVange in the New England Journal of

Medicine really touting these. These are the

examples in the paper. So some of these

I-SPY 2s have been running for about I

think it’s eight years, but it’s been

running a long time. Lung-MAPS and other

oncology trial. ADAPT is really a concept trial

for resistant antibiotics. This is not the Ebola

trial I was describing, we did something for

the Gates Foundation. This is something with the NIH. This is another all Alzheimer’s

trial separate from EPAD, and then this is

an antipsychotic. There’s actually a whole bunch

of– so these are some more examples. I don’t intend to go

through these slides. The only thing this slide

is meant to indicate is this is getting

back to your question. There are just lots of weird

complexities that happen. And really, some of these

are blinded and some are not; some of these are embedded in

a learning health care system and some are not;

and all of them involve quirks to

deal with the issues. And there are

occasionally trade-offs. We will pool the placebo arms. We don’t think it’s going to

make a big deal difference, but we can investigate

twice as many drugs, so we’ve made that

trade-off consciously. Certainly these are a

challenge to set up. So if you’re going to

do a platform trial, getting the first drug

in it is a long haul convincing companies to

play in the same sandbox, to convince researchers to

play in the same sandbox. There are operational aspects

that are more complex. Informed consent is

often two-tiered. Often there is an

informed consent– are you willing to be

randomized into a drug, and then separately, are you willing to

be randomized to the placebo arm or the active arm

of this drug is how EPAD does informed consent– so you have to do it

twice, for example. On the plus side, the

second and future agents, this is a no-brainer

thing to do. If you have these

platform trials set up, it would be ridiculous to do–

if the whole world were set up where here’s a breast

cancer platform, here’s an Alzheimer’s platform,

Here’s a heart disease platform, here’s

a stroke platform, it would be

ridiculous for anybody not to enroll in these platforms

and go run their own trial with their own controls. What you’ve got

here is a warm base. Sites are up and running. There are larger

networks allowing you to find subsets of patients. Everything is better in

this particular world. And I think that is good. All right, so just as

a conclusion, again, these things, it’s

the startup cost versus the massive

savings at the end is really what the

trade-off is here. All right. Anna. All right. So the next topic is about

changing the randomization ratio during the

course of a trial. So the idea behind this is

instead of doing a one-to-one randomization, or if

we have multiple arms, one-to-one-to-one randomization,

the idea is that we’d like to efficiently focus our

resources to the arms that are most promising. So the place where we’ll

typically see these is dose ranging

trials, for example, that’s the example we’ll talk

about in a moment; platform trials as Kert just

described; factorial design– so if you can identify

among multiple arms which arms are the most promising

and then start to focus your resources on those arms. Different ways this

can be achieved. Kind of the simplest

method in many ways might be considered

arm dropping. So if you have

sufficient evidence that an arm is not performing

well, you get rid of it. Another way that

this could be done is a little more subtle

than arm dropping– that is adaptive randomization. So if you have an arm

that’s not performing well, you start to decrease the

allocation to that arm and start to increase the

allocation to the arms that are performing well. So you can see

that in the figure. Going to describe this by

walking through an example. This is a trial that was

designed for smoking cessation. The final endpoint

is looking to see if a patient is still

smoking in weeks 3 through 6 of the treatment. So the idea is in weeks 1

or 2, they may not be fully weaned off of the

cigarettes yet, but by weeks 3, 4,

5, and 6, if they don’t smoke in any

of those weeks, they’re considered a

success, a failure otherwise. The other idea in this

trial was that patients who drop out of the

trial and maybe don’t complete the full six

weeks of follow-up are also considered failures. So this could be– we assume

that they dropped out, maybe that’s because

they are smoking again. So we consider those

patients a failure. This trial has multiple doses. So in addition to

the placebo, we have six dose arms ranging

from 1 milligrams up to 100 milligrams. And what we would like to

do, the goal of our trial is to identify the dose that

has the maximal response. So the dose that has the

highest response rate. One of the quirks

of this is that we expect that a U-shaped

curve is possible. So the idea is that maybe

those top doses may have higher dropout rates, for example. Maybe those higher doses

come with more adverse events and patients are more likely

to drop out at those doses. And because those

doses, if you drop out, you’re counted as

a failure, it’s possible that the dose response

curve is not just increasing, but that actually

at some point, it starts to decrease because

those adverse events start to outweigh the benefits. So– oh, let’s see. So in this particular

case, we are assuming that U-shaped

curve is a possibility, and we’re looking for the

maximum effect, which may be at one of the middle doses. Other trials, instead of

looking for the maximum effect, they might be interested

in looking for something like an EDx, like ED90, which

would be the dose that gives you 90% of the maximal effect. Again, the idea

here is that as you start to get to the

higher doses, maybe those are doses that

have adverse effects. So if you can back

away from that and give a slightly

smaller dose and sacrifice only a minimal

amount of efficacy, maybe that’s the

optimal dose to give. So one way that we

often handle that is by looking for this ED90. So trying to find a dose that

has at least 90% of the benefit but maybe is not

the highest dose. Another way we sometimes

handle that idea, that trade-off between

efficacy and safety, is explicitly by using

a utility function, and the idea behind this is

to combine several endpoints. So maybe we can explicitly

define that trade-off between the safety

and the efficacy, and we do that with

a utility function, and then we try to

find the dose that has the maximum utility which

combines those endpoints. So our current trial is a

little bit simpler than that because we’re just

going to count patients who drop out as failures. And so that aspect

of that trade-off is just explicitly handled

in the primary endpoint. Some details about

the trial here. So we’re going to enroll

280 subjects into our trial. In addition to our goal of

identifying the best dose, we also would like

to calculate– this is a phase II trial,

we’re interested in what’s the probability that a

dose that we carry forward is going to be successful

in a future trial. So what we’re going to do

for each dose in the trial, is we’re going to

calculate– suppose that we choose this dose and

we carry this dose forward into a phase III, one-to-one

randomized trial with placebo, and in that future

trial, we plan to enroll 400 patients, 200 on each arm. And what I would like to

know is the probability that that dose is

going to be successful. And that could be used

as a fertility rule, it could be used as part

of our success criteria. So I’ll describe several

iterations of the design that we went through,

starting with what we call a basic trial. So in our basic

trial, we’re going to enroll 40 subjects to

each of the seven doses– so equal randomization. This may not be the

optimal strategy. Oftentimes when you

have multiple doses, it’s actually beneficial

to enroll more subjects to the control arm. But here, we’re going to

consider a simple example where just unequal allocation

to all of the doses. In our basic trial,

we’re not going to do any modeling of the

efficacy across doses. Instead, we’re just going to

analyze each dose independently without any borrowing

between the doses. So in order to do

that, what we’re looking at here is the response

rate we call Pd for dose d. And we’re just going to give

independent prior distributions to those doses. So we’re doing this

in a Bayesian fashion, and what that means is

at the end of the trial, our drug is going to be

considered successful if our posterior probability

of beating the control is higher than 99.2%. So right now, we don’t

have any adaptations, we don’t have any

early stopping. This is just a fixed

sample size trial of 40 per arm analyzing the

doses independently. And the reason that we

have this high threshold is a multiplicity adjustment. The fact that we have

multiple arms in our trial, we have multiple

shots on goal, so that we need to have an

adjustment that it’s harder to win each dose. All right. So we’ve described

some scenarios that we’re interested in. We’d like to see how

well this type of design behaves under these

different scenarios. So in the table here,

I’ve got the doses along the top and a

few different scenarios along the rows. So first of all, we’d like

to see how well the design performs in the null scenario. Well, we think that the control

arm has about a 20% rate. So about 20% of patients that

get a placebo stopped smoking. And then the null

scenario assumes that there is no benefit

on any of the doses, it’s 20% across the board. We’ve got a scenario

we call Harm. Here is where the lower doses

are equivalent to control, and then as you start

to increase the dose, the response rate

actually goes down. We’ve got a set

of scenarios where we have a treatment benefit. In all of these scenarios, the

benefit increases with dose– so that the highest doses are

more beneficial than the lower doses to various degrees. And then finally, we

have a set of scenarios we’d like to investigate where

the maximum benefit actually occurs somewhere in the

middle of the dose range. So here in the

inverted-U scenario, the best dose is

the 25 milligrams, and then the effect

starts to decrease as you get higher doses. And then we looked at a

scenario where that peak occurs earlier in the curve– so it occurs at the

five-milligram dose. So we’d like to see across

this range of scenarios what happens to this trial. So in this table,

what we’re showing is essentially the

power and the type I error rate, which is

in this column here. So under the null scenario, are

type I error rate is about 5%. If we look at the scenarios

where we have a treatment benefit, this particular trial

has a pretty modest power here. So ranging from 14%

up to about 56%, and in our inverted-U scenario,

somewhere around 45% to 50%. So this is an

underpowered trial. We don’t have the

highest power to detect these kinds of scenarios

that we’re interested in. The far column on the

right shows the probability of picking the best dose, and

the best dose is described here in parentheses. So for example, in

the positive scenario, we had about a 40% probability

of choosing the best dose. And in these

inverted-U scenario, we had about a 48% probability. So some of that depends

on how many doses are good and what’s the magnitude of the

difference between your best dose and maybe the

second best dose. All right. And then that give

me 10 minutes for OK. All right. Moved the wrong way. All right. So now we’re going to

take that basic trial that we described and we’re

going to add some modeling, and we’re going to consider

a dose response curve. And what this is just a function

that describes the efficacy as a function of the dose. Here in our example,

we can’t consider a curve that assumes a

monotonic relationship, so we expect there

is a possibility that the curve

increases for a while and then plateaus or

maybe even comes down. So we can’t assume something

like a logistic curve or Emax curve because those have

a built-in assumption that the effect

increases with dose. So we’re going to use a model

called a second order Normal Dynamic Linear Model or NDLM. So this is kind of a

smoothing spline function. So now I’m going to

go back to that table where we’re looking at the

probability of success. So now in each one of these

cells, we have two numbers. So the number on the

left is our basic trial that we just looked at before. The number on the right is now

we’ve done nothing else except add a dose response model. So under the null scenario,

we have about the same type I error rate, about 5%. But in some of our

other scenarios, we start to see some pretty

big increases to our power. So take a look, for example,

at the positive scenario. Previously we had 56% power. Just by adding on a

dose response curve, we’ve increased

that to 70% power, so we can see some

pretty big gains. The idea here is that we’re

able to borrow information across these doses. Let’s see– inverted-U scenario,

we saw some jumps from 50% to about 64% by adding

the dose response model. The probability of

picking the right dose, we see some modest gains

here, about 40% to 46%. Here it didn’t change

very much, about 55%. Another way we can think

about choosing the right dose or have we made the right

decision is, how many doses continue into the next trial? So what we’re

showing in this table is the probability that we

pick the right dose given that we can carry

one dose forward, the probability that we get

the right dose if we carry two doses forward, and the

probability that we get the right one if we

carry three doses forward. And here, this is conditional

on trials that were a success. So if we just take the subset

of trials that were successful, looked at which doses

we carried forward, the trend here is obviously

if we take more doses forward, we have a better chance

of choosing the right one in that step that

we take forward. So the next thing we’re

going to add to our trials– so we’ve looked at just going

from a basic trial to adding a dose response

model, now we’re going to start to change

the adaptation. So now instead of just

allocating patients equally across all the

doses, we’re going to do something different. So we’re going to

start the trial with what we call a

burn-in period where we have a fixed allocation. This could be one-to-one, this

could be some other ratio, but we’re going to have a

small portion of our trial that has this

allocation that’s fixed. And from that data, that

initial set of data, we’re going to estimate

the dose response curve. And then for the

next set of patients, we’re going to change

the allocation ratio and start to allocate more

patients to the doses that have the best efficacy. And we’ll continue

to iterate this. So we’ll have interims

that are frequently spaced, and each time we do

an interim, we’re going to update those

randomization probabilities. So how do we decide how to

change these allocation ratios? So one simple method is

we’re just for each dose, we’re going to calculate what’s

the probability that that is the best dose? And we’ll allocate in

proportion to that. So doses that have the

highest probability of being the best dose, those

get the highest randomization probability. And doses that have

lower probability of being the best dose are going

to have that smaller weight there. So in our example, so

to make this concrete– I should go back. So here, what we do is we

calculate this probability. What you can do is

actually also raise this to a power which helps you

control how aggressive you are. So if you calculate

this probability and then raise it to a

power of 2 or a power of 3, that allows you to be more

aggressive about going after the best dose. In our example we’re not

going to do that, we’re not going to put an exponent on it. So what we’re going to do

is start with a burn-in of 10 patients per dose. And after that burn-in period,

we’ll estimate our model and we’ll update our allocation

probabilities every four weeks. So the other kind of

bells and whistles– there’s lots of different

ways you can do this– is, if any of our doses has less

than a 5% probability of being the best, we’re going to

temporarily just zero out that dose so that

it’s going to receive no patients until

the next interim, and we’ll reassess

that probability at the next interim. So effectively what that does

is a temporary arm dropping, but it’s not a

permanent arm drop. So it’s just zeroing

out that dose, and then as we start to get

more data, we’ll re-evaluate it. In these trials where we

do adaptive allocation, how we treat the control arm turns

out to be really important. So remember here, our final

analysis is analyzing relative to control– so we’re comparing

the response rate of our best dose to control, and in order

to make a good comparison there, we need control patients. So we want to make sure

that we don’t allocate away from control. So for example, we don’t

want to just put the control arm in with the

adaptive randomization and say, well, if the control

arm starts to perform poorly, we’ll allocate away from it. If we start to allocate away

from control, what that does is it hinders our ability

to make comparisons to the control at the end of the

trial and that hurts our power. So here, what we’re going

to do is fix the allocation to the control arm. So two out of every

eight subjects are going to be allocated

to the control arm, and then the remainder of

those subjects in the block will be allocated adaptively. So here are our results. Now we’ve got the basic

trial, the NDLM trial, and now we’ve added the

adaptive allocation. And we can start to see

some large benefits again. Especially look at here,

the low is the best dose. So now by allocating adaptively,

we’ve gone to about 50% power, 47-ish to over 60% power. We see some gains in

our other scenarios. The positive scenario

went from 70 to 79% power. So because we’re now making more

efficient use of our resources, putting the subjects on the

doses that we care about, the ones that are

performing the best, we see some gains in the power. So again, some of the

gains here ethically. The idea behind

adaptive randomization is to avoid exposing patients

to arms that aren’t performing well, and instead,

give the patients who were enrolled in the trial

the best chance of getting allocated to a good arm. So these trials can be a little

more complicated to set up. Careful planning,

careful calibration, making sure you’re

not making decisions too early, and some

operational details about doing these interims and changing

the adaptive allocation ratios, so those have to be

considered as well. Questions? All right. Last topic. Everybody tired? OK. So I want to talk about rare

diseases and more informative endpoints. So I’m going to do this

by way of an example. This is a GNE myopathy

which many of you may know more about than I do. But essentially this is

a progressive disease where your muscles are

gradually replaced by fat. And it goes essentially

from your feet up. So what happens early in

the course of disease, you basically lose muscles

around your ankles. It moves up to your

knees, through your hips, through your arms, and

it is eventually fatal. Very rare. Worldwide prevalence, four

to 21 out of a million. There’s nothing that helps

this particular treatment. This is challenging to do– nothing to [INAUDIBLE]

particular indication. This is a challenging

disease to do research on. And the reason is, while the

prevalence is low, in addition, people are at different

stages of the disease. So people early in the

disease, an endpoint like six-minute walk

might be relevant because they’re losing

function in their ankles and they’re having trouble

walking and so forth. People late in the disease

are not ambulatory at all. So a six-minute walk is

irrelevant to people like them. You may be interested in

the ability of somebody to– grip strength or something

like that late in the disease. So it’s difficult to come up

with a clinical trial design that accounts for all of that. So what we did is we came up

with a disease progression model. So we essentially came up with a

statistical model that describe the entire disease course. We had natural history

data on 38 patients. Each one of them is at a

different place in the disease. So for example, this person,

their strength is a lot lower. Each of these curves– I’m not going to go through

all the details of this, but different

kinds of their knee extensions on here, their elbow

flexed strength, their shoulder strained, knee flex. These are different pieces

components of the disease. You can see this

patient has completely lost the knee and

dorsiflex, but still has strength in other areas. So each of these

patients, what we can do– I’m going to skip this model. What we essentially did is

we took that disease course and we said, this is what

happens to dorsiflex over time. This is what happens to

your knees over time, to your grip over time, to

your hip extension over time. And what we were able to do

is we could map each patient onto their disease age. So some patients

might be in this area where if we were going to

follow them– this is in years. So if they’re in

this area, you can see the dorsiflex

is the thing that’s going to give us the

most information. In this area of

the disease, this is a situation where the

hip extension or the grip gives us more information

about the progression of their disease. So what we tried to do is

take each person’s data and tried to map it

onto each of these. So this is the kind of fits that

we got to the natural history data. So this is one object

analyzed over several years, and you can see essentially

these are the observations and this model is

fitting it quite well. So this is saying, if we know

your five or six strength measures, we can

essentially map this to your at disease aged 10 or 20

or 30 depending on where you are in the disease. This also includes

the uncertainty bars around each of the

different ones. You can see, knee is

an interesting one. The knee behaves

very differently than the other measures. This is more fits. I recognize we’re a

little short on time. As an advertisement,

Melanie Quintana is giving a webinar on

disease progression models sometime early in the spring. There will be more detail on

this in that webinar as well. But I’m still

trying to indicate, this is a model that fits very

well when all is said and done to the natural history data. And it fits well even

though the patients are at different components

of their disease. What we did in this

is the trial design, those six endpoints,

what we did is the endpoint for the primary

analysis is the disease age. So for each patient, what we did

is we took those six measures, we mapped it back to a

disease age for each patient, and the primary analysis

is basically measuring, can we halt the

progression of the disease? If you come in at

baseline and you– if Anna’s at disease progression

age 10, two years later, has she gone to 12 or

is she still at 10? And if it were somebody

else, if Doray came in at disease age 30, two years

later, is she still at 30 or is she at 32? And so it was the time involved

in that we were measuring as the primary endpoint. The trial has 50 patients,

they’re enrolled three-to-one. Treatment to placebo. And again, if you think of this

as the disease progression– this is the overall disease

age, what we’re really trying to measure

is are we halting the progression of the disease? So it might be that Anna going

from 10 to 12 over two years is full progression, Anna going

from 10 to 11 over two years, we’ve cut the

progression in half. That’s our measure,

our primary endpoint. The purpose of this is to

combine all this information in the six endpoints

on strength and get us and adequately-powered

trial, which this does. So this ends up–

you can see, if we’ve got a treatment effect of

50%, we’ve slowed the disease progression in half. The trial power is 98%. In fact, if you slowed

the disease down to 75%– so it’s 25% slower, we

still have about 75% power. So this is, again,

a trial that tried to work on a very small

number of endpoints, and I recognize I haven’t given

it full description in order to– since we are running

out of time here, but there is more on this. This trial is what

came out of the NIH on rare and neglected diseases. It is currently being

run by NeuroNEXT or about to be run by NeuroNEXT

as one of their trials. Let’s see. So let me end here. Make sure I have questions. Please sign up for

the ICTR website. When you download slides, you

can sign up for an account. Our email addresses are on here. If you have any questions,

we’d be happy to answer those. And certainly in close,

anybody have any questions, comments to end? [INAUDIBLE] recorded this

within the next two weeks or so. [INAUDIBLE] or you want to share

something with a colleague, you’ll have the whole

audio/visual presentation [INAUDIBLE] to share. Thanks. Thank you all. [APPLAUSE]