You are here: CTSPedia>CTSpedia Web>ContentInterest>LogTransformation (12 Jun 2013, PeterBacchetti)EditAttach

Statistical models are sometimes more meaningful and accurate if outcome or predictor variables are transformed, and a common choice for transforming variables is to apply logarithmic transformation. This may be appropriate when the variable only takes on positive values, and results are easier to interpret than with most other types of transformations. Considerations for whether and how to transform differ for outcome and predictor variables.

For example, CD4 count among persons with HIV infection does not reflect health implications accurately without transformation, because a 100 cell difference between 10 and 110 cells/mm3 is much more important than a 100 cell difference between 700 and 800 cells/mm3--110 is bad but 10 is dire, while both 700 and 800 are good and have very similar prognostic implications. After logarithmic transformation, the difference between 10 and 110 is 18 times larger than the difference between 700 and 800, which may better reflect the importance of the differences. This is true regardless of what base is used for the logarithm (e.g., natural log or log base 10), and the base generally does not matter for outcome variables, because results will be back-transformed.

Secondary issues for transforming outcome variables are 1) more closely approximating the assumption for linear regression modeling that the residuals are normally distributed, and 2) preventing a few observations from being extremely influential. Issue 1) is secondary because linear regression may still work reasonably well for some departures from normality and bootstrapping or robust standard errors can be used to obtain valid confidence intervals despite non-normal residuals. Issue 2) is secondary because the influence of a few large values can be assessed by performing sensitivity analyses with them deleted or by resetting them to smaller values (sometimes known as "Winsorizing"). Moreover, it may be appropriate for some observations to be very influential. For example, in studies of costs of medical care, patients' total costs are often highly skewed and a few patients who experience complications may have much higher costs. Logarithmically transforming costs, however, is often inappropriate, because very expensive patients really do have a large influence on total health care expenditures and downweighting their influence is inaccurate. Regression modeling of the untransformed costs will estimate the additive effects of predictors on mean costs, which will accurately reflect impact. In contrast, logarithmically transforming costs would estimate the multiplicative effects of predictors on the geometric mean, which is not directly relevant for total expenditures--the total cost for a group of N patients is the raw mean times N, not the geometric mean times N, so the bottom line for hospitals and ultimately for society as a whole is better reflected by a focus on the raw mean.

When the outcome has been logarithmically transformed, the coefficients from a regression model should usually be back-transformed to show either percentage effects or fold-effect (these are two different ways of describing multiplicative effects. The percentage effect is calculated from the regression coefficient as

pctEstimate = 100*(antilog(coefficient) - 1).

This is interpreted as: each 1 unit increase in the predictor is estimated to be associated with a pct% change in the outcome variable. The fold-effect is calculated as:

fldEstimate = antilog(coefficient).

This is interpreted as: each 1 unit increase in the predictor is estimated to be associated with a fldEstimate-fold change in the outcome variable.

Because very large percentages can be confusing to some readers, such as a 200% increase being mistakenly interpreted as doubling, a reasonable practice is to report percentage effects when all are <100% (2-fold), and to report fold-effects when some are >=100%. With either approach, the antilog used should match the base of the logarithm used to transform the outcome variable: if natural logs were used, then the antilog is exp(coefficient); if log base 10 was used, then the antilog is 10^coefficient.

Logarithmic transformation makes differences between large values less important and differences between small values more important. Sometimes, the importance of differences between very small values may be exaggerated by logarithmic transformation. In the case of CD4 counts of CD4 counts, the difference between counts of 1 and 4 cells/mm3 is not very important because both are very dire, but logarithmic transformation would treat this as being comparable to the difference between counts of 100 and 400, which is a very substantial difference. To prevent this, one can modify the transformation by first adding a small amount. For example, log(10 + CD4) is a more reasonable transformation when some counts are very small. This treats the difference between 1 and 4 cells/mm3 as being comparable to the difference between 100 and 130 cells, which is more reasonable than treating it as comparable to the difference between 100 and 400 cells.

Exactly what amount to add before taking logarithms is an important decision, and the best choice will differ for different variables. As with the choice of whether to transform or not, the primary consideration is to best reflect what is important about the outcome. The added amount should be large enough that differences between very small values are not given exaggerated importance. A secondary consideration is that interpretation is easier if the amount is small enough that the percentage or fold-effect interpretation of the back-transformed coefficients is still approximately correct over most of the range of the outcome's values. In the CD4 count example, the differences between transformed counts of 100 versus 200 or 300 versus 600 still correspond fairly well to the logarithm of 2, reflecting a 2-fold difference, so the simple interpretation of back-transformed coefficients remains about correct for most CD4 counts. (Note that the back-transformation remains as above; do not subtract 10 after taking antilogs.) For situations where 1 is a small value of the outcome, the transformation log(1 + outcome) is a common choice.

Adding a small amount also makes it possible to apply logarithmic transformation to a variable that is zero for some observations.

For predictor variables, the main purpose of transformation is to more accurately model the association of the predictor with the outcome. In regression models, numeric predictors are assumed to have a linear relationship with the outcome, and this assumption may be more accurate for a logarithmically transformed predictor than for the raw untransformed predictor. For example, a Cox proportional hazards model of mortality risk among persons with HIV infection with a predictor being the untransformed CD4 count will assume that each 100 cell difference in CD4 is associated with a reduction in the hazard by the same factor, whether it is a difference between 10 and 110 cells or between 700 and 800 cells. This is unlikely to be accurate. With logarithmic transformation of CD4, the assumption would be that a 2-fold difference in CD4 is associated with a reduction in the hazard by the same factor, whether it is a difference between 100 and 200 cells or between 500 and 1000 cells. This may be more accurate, and whether it is can be assessed empirically by seeing whether logarithmic transformation improves the fit to the data. As with outcome transformation, modification by adding a small amount may be desirable.

Unlike for outcome transformation, the base of the logarithm used does make some difference. Although the fitted model will be identical regardless of what base is used, the scaling of the estimated regression coefficients is altered by the base. If log base 2 is used, then estimated effects are per 2-fold increase in the predictor, and if log base 10 is used, then estimated effects are per 10-fold increase in the predictor. The base should therefore be chosen so that estimates will correspond to understandable amounts and be easy to interpret. In general, natural logarithms are a poor choice, because the estimated effect per 2.718-fold increase is an odd amount to try to interpret.

Alternatives for modeling non-linear relationships include linear splines and adding quadratic or higher order polynomial terms. These require estimating more parameters and result in more complicated models, but they are also more flexible and can sometimes provide enough improvement in the fit to the data to justify the extra complexity.

Outcome | Predictor | |||||

Purpose | Reflect what is most meaningful about the outcome measure | Accurately model the association with the outcome | ||||

How to decide? | A priori | Empirically | ||||

What base for logarithm? | Irrelevant, because results will be back-transformed | Choose to be an interpretable amount, e.g., log base 2 -- effect per 2-fold increase in predictor log base 10 -- effect per 10-fold increase | ||||

Back-transform results? | Yes, to obtain multiplicative effects, percent or fold | Not because of predictor transformation. Back- transformation determined by outcome transformation, regardless of predictor transformation | ||||

Topic revision: r6 - 12 Jun 2013 - 13:05:20 - PeterBacchetti

Copyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding CTSPedia? Send feedback

Ideas, requests, problems regarding CTSPedia? Send feedback