comparison of binned rates across treatment

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

comparison of binned rates across treatment

Jeremy Chacon
Hello all,

I'm trying to figure out the correct way (or at least *a *correct way) to
(1) determine rates and error in rates and (2) statistically compare these
rates in the following experimental design.

The experiment in brief:

One Categorical variable: two different genotypes

One Ordinal variable: continuous variable ('distance') put into bins to
allow calculation of rates per bin

One Dependent Variable: number of events observed during the observation
period (i.e. a rate)

Or in paragraph form:

I am interested into one main relationship and its interaction with another
variable: First, the relationship between distance and the rate of an
event. Second, how the rate, as a function of distance, is different in two
different genotypes.

To measure this, I have experimental arenas (==replicates) with my two
different genotypes. I observe the arenas *as long as I can *(this is
important because it results in different replicates having different
observation times). During the observations, every second, I count whether
the event has occurred, and I also record the value of the continuous
variable (distance).

Thus, I end up with a data.frame containing raw data that looks like this:

replicate   genotype     time    distance   event_occurred
1                 A               1           10                  0
1                 A               2           13                  0
1                 A               3           17                  1
1                 A               4           17                  0
1                 B               1            8                   0
1                 B               2           10                  0
2                 A               1           10                  1
2                 A               2            7                   0
2                 A               3            4                   0
2                 B               1           10                  1
2                 B               2           11                  0
2                 B               3           19                  1
. . .

Since I'm interested in the rate of the event occurring as a function of
distance, what I am currently doing is (somewhat arbitrarily) binning the
data into distance bins and summarizing the data accordingly, such that the
time observed per bin per genotype per rep is summed, as are the number of
events. Doing this on the above mock data.frame results in this, if I bin
the distances into 0-10 and 11-20:

replicate   genotype   total_time distance   events
1                 A               1           0-10         0
1                 A               3           11-20       1
1                 B               2           0-10         0
2                 A               3           0-10         1
2                 B               1           0-10         1
2                 B               2           11-20       1

At this point, I would like to be able to see if the rate of events
/ total_time varies with distance and with genotype.

I've thought of a few ways to do it, none of which I'm sure make sense.

Method 1: Find a rate for each replicate-genotype-distance combination and
do a two-way ANOVA where the factors are genotype and distance:

replicate   genotype   total_time distance   events  rate
1                 A               1           0-10         0        0/1
1                 A               3           11-20       1         1/3
1                 B               2           0-10         0         0/2
2                 A               3           0-10         1         1/3
2                 B               1           0-10         1         1/1
2                 B               2           11-20       1         1/2

aov(rate ~ genotype*distance)

There is one main problem with this method (and some other smaller ones):
the total observation time for each replicate-genotype-distance combination
is different. By finding a rate (by doing events/total_time) and ignoring
the fact that a longer observation period occurred for some combinations
than others lets data points with hardly any observation time be treated as
just as reliable as data points with extremely long observation times. This
seems wrong to me because the precision for data point that had long
observation periods should be much higher than data points with short
observation times.

Of course, the big plus here is that this is easy.

Method 2: Do an ANCOVA by rounding down the distance bins (which were
treated as a categorical variable in the ANOVA above) and using this as a
continuous covariate. In other words, make the data.frame like this:

replicate   genotype   total_time distance   events
1                 A               1           0            0
1                 A               3           11          1
1                 B               2           0            0
2                 A               3           0            1
2                 B               1           0            1
2                 B               2           11          2

And then do this test in R:

lm(events ~ 0 + genotype*distance*total_time )

I feel like this *should *make sense, because the slope for a specific
genotype:distance combination, as time_s increases, should reflect the rate
of the event? I made the intercept be zero because there cannot be a
positive rate in zero time. I came to this method because it works for a a
single genotype and single distance bin. In other words, if all I had was
this:

replicate   genotype   time_s    distance   events
1                 A               1           0            0
2                 A               3           0            1
. . .


and did this:    lm(events ~ 0 + time_s)

the slope gives me a reasonable estimate of the rate and better yet I can
use the standard error of the coefficient when plotting rate + error.

The truth is I don't know if this idea generalizes to the bigger dataset,
with a categorical variable and the multiple distance bins.  Additionally,
even if it does, I'm not positive how to pull out the relevant rates and
errors from the different coefficients using summary().

Method 3: Do the exact same thing as method 2, but use a poisson regression
because there are going to be a lot of zeros when the observation time is
small:

glm(events ~ 0 + genotype*distance*time_s, family = 'poisson')

I'm not sure if this makes sense though and I also don't know how I'd get
rates from the model.

Method 4: Ignore the fact that this was a replicated experiment, calculate
rates for the each genotype-distance combination across all replicates,
and. . .

One way I've seen rates calculated by others is to ignore the fact that the
data came from multiple replicates, and just count the total events in a
genotype-distance combination and divide that by the total observation
time. Then, since counts are Poisson-distributed, obtain a standard
deviation by sqrt(total events) / total observation time.

This method is along the lines of what I find here:

http://cran.r-project.org/web/packages/rateratio.test/

which treats each second of observation as an datapoint (which seems
strange to me due to the clear non-independence of serial observations) and
does allow statistical comparison of two rates. But it does not appear to
be able to be used for more than a two-group comparison as far as I can
tell.

Calculating rates this way is nice and easy and gives me plottable data,
but it doesn't leave me with an obvious way to test for significant
differences and also seems odd because it ignores the replication.

So what to do?

I apologize for this wall of text, I've just been banging my head on this
dataset for awhile now and figured it may be easier to share everything
I've tried when trying to get help. I should note that it would be even
more wonderful if any solutions could be generalized to add a second binned
variable, because in truth I'm interested in how distance, variable 2, and
genotype all affect the rate of this event. Complicated!

I greatly appreciate any help or insight you may have.

Thanks much!

Jeremy
--

*___________________________________________________________________________Jeremy
M. Chacon, Ph.D.*

*Post-Doctoral Associate, Gardner Lab*
*University of Minnesota*
*Genetics, Cell Biology, and Development*
*6-160 Jackson Hall*
*321 Church St. SE*
*Minneapolis, MN 55455*
*USA*

*[hidden email] <[hidden email]>*
*Lab telephone: **612-626-6906*

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology