The stories of epacadostat and bempegaldesleukin are well-known, and nearly identical. Both were promising new immunomodulatory drugs for advanced melanoma with sensible mechanistic rationales, both were trialed in combination with an anti-PD-1, and both posted promising response rate numbers in small single-arm phase II’s only to ultimately fail to beat anti-PD-1 monotherapy in larger randomized controlled phase III’s.

The PD-(L)1 class of immunotherapies have been extraordinarily successful at treating a wide range of cancers, and as the examples above illustrate, have proven challenging to improve upon when combined with other immunotherapy agents (chemotherapy or kinase inhibitor combinations aside). Just a few weeks ago there was yet another high-profile setback as tiragolumab in combination with the anti-PD-L1 atezolizumab failed to improve outcomes in non-small cell lung cancer at an interim assessment, although I should note that there is still hope that the final results will be positive for overall survival.

It’s not as though dual immunotherapy combinations don’t work, there have been successes: nivolumab + ipilimumab is the archetypical example, and nivolumab + relatlimab a recent successor, both approved in advanced melanoma. One of the common criticisms the failed combinations have attracted is that they were taken into pivotal trials with scant evidence of single-agent activity and hence the failures were unsurprising, although I should note that tiragolumab + atezolizumab had positive randomized phase II data and the relatlimab combination seems to work despite LAG3 antibodies having little single-agent activity1.

This all serves to highlight the challenges associated with evaluating whether a combination is really better than the monotherapy alone based on data from early trials. But I do think the techniques of probability and causal inference can help to make informed judgments about response rate data in single-arm phase II oncology combination trials, which is what I want to explore in this post.

The overall response rate (ORR)

The overall (or objective) response rate (ORR) is usually the primary way in which the efficacy of a new oncology drug is judged in early stage clinical trials. The standard definition of ORR is the proportion of patients with a complete or partial response to therapy, which are defined per the RECIST 1.1 guidelines for solid tumours as:

  • Complete Response (CR): “Disappearance of all target lesions. Any pathological lymph nodes (whether target or non-target) must have reduction in short axis to <10 mm”
  • Partial Response (PR): “At least a 30% decrease in the sum of diameters of target lesions, taking as reference the baseline sum diameters”

Response criteria are somewhat different for liquid tumours and for immunotherapies vs. chemotherapeutic agents, but I’ll ignore that complication for now as it’s not strictly relevant for this post.

ORR is a surrogate for the more important outcomes of overall survival and quality of life, however, as it can be relatively quickly assessed it’s commonly used as the primary endpoint for the short, small, and comparatively inexpensive phase I/II trials which are run in order to provide justification for funding larger and longer phase III’s with survival-based endpoints. Generally, a 20% ORR for single agents2 or a 20% relative improvement in ORR on top of standard of care (i.e., in combination) is the minimal proof of concept threshold that a drug should hit in phase II to justify advancement to phase III.

It’s a common scenario for companies to run a small single-arm phase II trial, tout an ORR value that appears superficially impressive when compared to a standard of care benchmark, then fail in phase III when it turns out that the promising early efficacy didn’t hold up in larger trials. Sometimes these failures can be explained by ORR being an imperfect surrogate endpoint with a weak or absent correlation with overall survival345. However, in other cases the most parsimonious explanation is simply mean reversion as sample sizes increase.

Novel combination trials are particularly prone to encountering the issue of false positives in small trials. When a single agent drug is being tested in isolation in a single-arm phase II trial any responses are very likely to indicate true efficacy because there is no confounding with other known active agents and cancers rarely spontaneously regress6. However, in combination trials it can be hard if not impossible to disentangle the efficacy of the new drug from a combination partner with known activity in that setting; there is a fundamental credit assignment challenge in regards to attributing a response to the established drug, the new drug, or “synergistic interactions” (the latter of which may not even exist).

Through this post I’m going to think about ways to assess ORRs in early trials, focusing on metastatic melanoma (the “proving ground” for new immunotherapy agents) and the case of epacadostat, one of the early and highest profile immunotherapy combination failures. For background (and as I’ll be referencing these values throughout the post), I’ve compiled a (non-exhaustive) selection of ORR data from key trials of immunotherapy drugs in first-line advanced melanoma in the table below. There are some complexities with different criteria for ORRs which can confound inter-trial comparisons, but I’ll ignore that because it’s convenient to do so.

Regimen ORR Trial Phase Trial start year
Ipilimumab + dacarbazine 17% (42/250) NCT003241557 III 2006
Ipilimumab 17% (31/181) KEYNOTE-0068 III 2013
Ipilimumab 19% (60/315) CheckMate 0679 III 2013
Ipilimumab + epacadostat 23% (9/39) NCT0160488910 II 2012
Nivolumab 33% (117/359) RELATIVITY-04711 III 2018
Pembrolizumab 32% (111/352) ECHO-301/KN-25212 III 2016
Pembrolizumab + epacadostat 34% (121/354) ECHO-301/KN-25212 III 2016
Nivolumab + relatlimab 43% (153/355) RELATIVITY-04711 III 2018
Nivolumab 44% (138/316) CheckMate 0679 III 2013
Pembrolizumab 46% (170/368) KEYNOTE-0068 III 2013
Nivolumab + bempegaldesleukin 53% (20/38) PIVOT-0213 II 2016
Pembrolizumab + epacadostat 56% (25/45) ECHO-20214 II 2014
Nivolumab + ipilimumab 58% (181/314) CheckMate 0679 III 2013
Nivolumab + epacadostat 65% (26/40) ECHO-20415 II 2014
Pembrolizumab + lifileucel 67% (6/9) NCT0364592816 II 2019

ORR as a binomially distributed variable

Because ORR is a binary outcome - either a patient achieves a PR or CR, or they do not - we can use the binomial distribution to evaluate the distribution of ORR outcomes. Therefore, the probability of a given drug achieving a particular response rate in a trial is given by:

\[P(ORR) = \binom{s}{r}p^{r}(1-p)^{s-r}\]

Where \(p\) is the probability of a PR or CR (i.e. the ORR), s is the sample size and \(r\) is the number of responses. An important assumption I make here is that all drugs have a “true” invariant response rate for a particular population (henceforth referred to as the \(tORR\)), and the distribution of results in a given trial are randomly determined by this underlying probability (so \(p = tORR\) in the above formula). To calculate the chance of a drug achieving an ORR equivalent or better than a specific value we need to sum the results for every equal or greater value of r up to and including the value at which \(r = s\) (i.e., everyone in the trial responds).

Using the binomial distribution is a helpful starting point for analyzing single-arm trials because it allows us to quantify how likely a particular result is at specific assumptions of \(tORR\). For example, I’ve plotted results from a number of example trials below (all melanoma except for magrolimab, which is in myelodysplastic syndrome). Each line shows the probability of achieving the result in the legend if the \(tORR\) was the value shown on the x-axis - naturally the most likely result is the one that was actually achieved.