Dr. Taylor is doing a study to determine the mean duration of flu symptoms in patients in her practice during the yearly flu season. She calls each of the 50 patients in her practice known to have had flu symptoms in the last week to ask whether they are currently experiencing flu symptoms.

If a patient has symptoms she asks how long he or she has been experiencing symptoms and follows up with him/her to determine the total length of time the patient experienced with symptoms. Otherwise, she records no data. The following dataset represents the theoretical data which was collected:

**Note**: Start and end dates are relative to day "0," durations are in days.

However, the other 20 of the 50 patients are unaccounted for since Dr. Taylor was doing a prevalence study and did not want to collect retroactive data. The following is a plot of the longitudinal lines of each of the 50 patients:

Dr. Taylor calculates the mean of the sampled durations to be 5.6 days, and wants to conclude that the mean duration of flu symptoms in her patients is higher than the average duration that the CDC predicts (suppose arbitrarily that the predicted range for the particular year is 2-5 days). However, to do so would be a mistake because there is a bias in the data collected.

Suppose we are trying to catch a bus that comes roughly on 10 minute intervals throughout the day (with some variation due to compounding factors). Further suppose that bus arrival times are independent of one another - that is, one bus being late or early does not necessarily imply the next bus will be late or early. If we arrive at the bus station at a random time, how long should we expect to wait?

The simple answer seems to be that we should expect to wait 5 minutes, as it's likely we'll arrive in the middle of some interval. However, this is not the case.

If we arrive at a random time, we are more likely to arrive during a longer interval than a shorter one. This works similarly to if we threw a dart at a dart board without looking- in that we'd be much more likely to hit one of the larger partitions than the smaller ones (in fact, this is the physical counterpart to length bias and is called size bias). Additionally, we are more likely to arrive toward the beginning of an interval than toward the end of one. Using a limiting process, we conclude that the actual expected wait time for a bus is, quite strangely, ten minutes!

A more detailed mathematical explanation of the bus waiting-time "paradox" can be found here : http://mahalanobis.twoday.net/stories/3486587/

Another example of length biased sampling in real studies comes in testing the effectiveness of a drug or procedure in prolonging the life of cancer patients. In these studies, if a cross-section of patients on a certain drug or procedure is taken as the sample, we are more likely to sample individuals with less severe cancers whom naturally have a longer survival time. Thus, the effectiveness of the drug or treatment can be overestimated in these cases.

An example of such a problem is discussed in the paper:

”Biological heterogeneity and length-biased sampling in asymptomatic neurosurgical patients,” Y. Yoshimoto & Y. Tanaka

The first indicator of length bias is that, on average, the number of patients in the categories of later start dates are higher than those in the categories of earlier start dates. The problem with this is that we suspect that the start dates of patients should be relatively uniform throughout the 7 days. This means we are probably sampling only the longer durations from the categories with earlier start dates.

Also, as we suspected, the mean durations per category seem to be higher (on average) for those sampled that started having symptoms closer to day zero. We could verify this by way of linear regression if it were not clear and look for a decreasing trend in the plot of start date versus mean duration. In fact, for this case the slope of the regression line is -0.5009. Our conclusion is that there is some length bias in our sampling procedure.

The next step is to come up with the weights on our measurements. In our case, it is convenient to weight our measurements in term of the largest measurement (8 days). For example, we are eight times as likely to observe a measurement of length "8" over a measurement of length "1." Thus, we weight the number of measurements of length 1 day (in our case, none) by a factor of 8. The generalized form of our weights in this study is weight = 8/duration.

So in order to find our corrected mean we use

The idea behind this corrected mean is that it up-weights the number of short observations and therefore the total duration per category. Another way to think of it is that in our weighted mean calculation we are increasing the proportion of shorter observations. We then divide by the weighted number of observations to calculate the weighted average duration per patient. In this particular case, since our weights are based on the length of the durations, our formula happens to simplify to

It is conceivable that this simplified formula may not always work, depending on what we choose as our weights.

Now it is time to compare the uncorrected mean, the corrected mean, and the true mean durations of the patients' flu symptoms in Dr. Taylor's sample (the data used to calculate these quantities, and a program to calculate them can be found attached).

The uncorrected mean was already noted to be 5.6 days, the corrected mean turns out to be 4.62 days, and the true mean (which we know since this study is a simulation and we have access to the data from all of the patients not in the sample) is 4.70 days. Thus, our corrected mean is much closer to the true mean of the population and we have a mean duration of flu symptoms in Dr. Taylor's practice which is within the CDC's predicted range.

I | Attachment | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|

txt | allpatients.txt | manage | 1.6 K | 21 Jul 2010 - 15:05 | ErikGregory | Data from all 50 patients |

png | categories.png | manage | 3.8 K | 21 Jul 2010 - 12:03 | ErikGregory | Categories based on start dates |

png | durcategories.png | manage | 2.5 K | 21 Jul 2010 - 12:24 | ErikGregory | Categories based on durations |

png | longitudinallines.png | manage | 11.3 K | 21 Jul 2010 - 10:34 | ErikGregory | Longitudinal lines chart |

png | patientslongitude.png | manage | 11.5 K | 21 Jul 2010 - 10:51 | ErikGregory | |

png | sampled.png | manage | 13.6 K | 20 Jul 2010 - 16:59 | ErikGregory | Data from all sampled patients (30) |

txt | simulationprogram.txt | manage | 2.6 K | 21 Jul 2010 - 15:02 | ErikGregory | Simulation Program for data and analysis |

Topic revision: r3 - 01 Aug 2011 - 12:59:28 - MaryBanach

Copyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding CTSPedia? Send feedback

Ideas, requests, problems regarding CTSPedia? Send feedback