If you’ve ever seen (or built) a patient-based prevalence model to forecast revenue for a pharmaceutical product it’s likely that “proportion of patients diagnosed” was one of the inputs used in that model to estimate the number of potential patients likely to take that particular drug. However, it’s been my experience that good information on the true prevalent diagnosed proportion of a particular condition is rarely available, and in lieu of better data some token assumption in the range of 80-95% is used. Sometimes you can make use of published cross-sectional prospective screening studies to find a solid estimate of the true proportion of diagnosed patients, but more often than not you’re forced to make an educated guess.

I don’t want to use unjustified and arbitrary assumptions in forecasts if I can avoid it, so I tried to explore the topic and learn a bit more about it, guided by three main questions:

  • If I don’t have access to the true value of the proportion of diagnosed patients, what other variables can I use to estimate that value and how important are they?
  • Average or median time to diagnosis is much easier to come by in the literature than “proportion of patients diagnosed”. What impact does a faster or slower diagnosis have on the proportion of patients diagnosed?
  • What do I need to believe to find a particular value of “proportion of patients diagnosed” in a model plausible?

My approach was to break the problem down in a few variables that seem influential for determining the “proportion of patients diagnosed”, then sketch a simple model and use values for those variables from the literature to identify likely boundaries on plausible values for the “proportion of patients diagnosed”. I walk through my approach over the next few sections, but if you want to skip to my takeaways they’re at the end of the post.

A few definitions to start

  • “Patients” are people with a specific medical condition, whether or not they have a confirmed diagnosis
  • For brevity I will refer to the “proportion of patients with a confirmed diagnosis” as \(p(Dx+)\), which is equivalent to the probability that a randomly chosen patient has a confirmed diagnosis at the time of selection. \(p(Dx+)\) is calculated by dividing the number of living patients with a confirmed diagnosis by the total number of living patients, whether or not they have a confirmed diagnosis
  • \(p(Dx-)\), the “proportion of patients without a confirmed diagnosis”, is equal to \(1 - p(Dx+)\)
  • Diagnostic delay is the “time interval between the onset of symptoms and confirmed diagnosis of a disease”1

A simple model of the path to diagnosis

If you take a cross-sectional sample of a population with a given chronic disease, patients will either have a confirmed diagnosis at that specific time, or they won’t. But a simple value of \(p(Dx-)\) captures two very different groups of people:

  • Patients who are in the process of getting a diagnosis, and will get it some time in the future if they survive long enough to complete the diagnostic process
  • Patients who will never get a confirmed diagnosis because they are not seeking one and/or not being screened (perhaps they have an asymptomatic condition and have no particular reason to visit a doctor)

The question I then had was “Which is a bigger contributor to low rates of diagnosis, a slow diagnostic pathway or non-care seeking behaviour?

I’ve sketched out a toy model to help outline the contributors to a particular value of \(p(Dx+)\) below, assuming that patients can be in one of four distinct states:

  • Undiagnosed, seeking treatment (\(Seeking\ Dx\))
  • Undiagnosed, not seeking treatment (\(Not\ seeking\ Dx\))
  • Diagnosed (\(Dx+\))
  • Dead

In this toy model the proportion diagnosed is dependent on 3 factors: the treatment seeking rate (\(S\)), the diagnostic delay and the mortality rate (\(M\)). We can treat the \(Seeking\ Dx\) group as a sort of queue; if patients are able to remain in that group for \(t\) time steps without succumbing they will enter the \(Dx+\) group. So the toy model reveals a natural opposition between the mortality rate and rate of diagnosis, and at a high-level how \(p(Dx+)\) depends on diagnostic delay.

The longer the diagnostic delay, the less chance that someone is diagnosed before they die (whether or not it’s directly related to the disease) and the more people that are in the diagnostic pathway relative to the diagnosed population.

An equation for p(Dx+)

If you make a few (likely untrue, albeit useful) simplifying assumptions you can make use of a fairly simple equation to calculate \(p(Dx+)\) at steady state, based on the structure of the toy model above:

  • First, I assume that time to diagnosis is symmetrically distributed (like a normal distribution). With this assumption I can represent the diagnosis rate as the average or median time to diagnosis (\(t\)), because in a symmetric distribution a patient is just as likely to be diagnosed earlier than average as they are to be diagnosed later
  • Second I assume that all patients have the same mortality rate, regardless of whether or not they have a diagnosis

Thereby, the equation is as follows:

\[p(Dx+) = (1 - M)^t * S\]

What this equation is effectively saying is that at each time step \(t\), \(1 - M\) percent of patients in the undiagnosed population will die. By the time you have waited for \(t\) time steps, the number of patients that remain i.e. \((1 - M)^t\) will be the diagnosed population (since everyone dies at the same rate in our model). Then, you remove people who aren’t seeking treatment by multiplying by \(S\) to get to a final estimate of the proportion diagnosed.

If you set \(S\) to 1 and plot \(p(Dx+) = (1 - M)^t\) you get a graph like the one below, which visualizes the opposition between the mortality rate and rate of diagnosis I mentioned earlier.