After a couple of requests, I am posting replies from Scott Foster to my questions from last week.
Thanks also to Simon, Thierry & David for their responses.
> I am using glms. Could someone please explain what's the difference
> between (a) using a gaussian family distribution with a LOG link
> function and (b) LOG transforming the response variable with a normal
> distribution (Gaussian family distribution with identity link function).
> The outputs differ and clearly one option or the other will result in
> better fits depending on the dataset (everything else equal) but I want
> to understand why is this so.
> Thanks in advance,
> Tomas Easdale
This is easiest to understand with the aid of expectations.
Under a log link we have
log( E( y)) = a + bx
=> E( y) = exp( a + bx)
=> E( y) = exp( a) exp( bx)
introducing the normal errors gives
y = exp( a) exp( b) + e.
Under a log transformation we have
log( y) = c + dx + e
=> y = exp( c) exp( bx) exp( e).
Note what has happened to the errors in the two models. In the
log-linked model the errors are on the original scale. In the
transformed model the errors are on a different scale.
The choice of model (for normal data) is not immediately clear and
depends on your specific situation and its resulting data. Loosely
speaking choose the log-link if you have homoscedastic residuals on the
original scale and choose the log-transform if you have homoscedastic
residuals on the log( y) scale.
Be careful with interpretation if you do decide that the log transformed
model is the bee's knees for your data. It is not sufficient to simply
take the exponential of the predictions on the log scale as predictions
on the original scale.
Hope this helps
> That response is very useful and it all makes sense now as LOG transforming the response works better for heterocedastic data. But your response also brings me to a second question that has been circulating around. What can/should I do if I want to use my regression to predict responses in the original scale. Exponentiate the errors? That doesn't seem much feasible, is it?
You are right, exponentiating the errors is not a good option.
There is really only one option (I think) for log-normal data (data that
is normal after a transformation) is the back-transform
E( y) = exp( mu + sigma / 2)
where mu is the mean of the log-transformed data and sigma is its
standard deviation. See the wikipedia page on log-normals for a
description of why this works. There is a similar back-transformation
Note that log-transforming is not your only option. you could try using
another error distribution (e.g. gamma) where there is an increase in
variance with the mean (increases with the square of the mean). If this
is not the desired relationship then you could even try quasi-likelihood
methods that allow you to specify an arbitrary variance mean
If you ever get the time try reading McCullagh and Nelder. It really is
a very good book.