Hello all,
I'm trying to figure out the correct way (or at least *a *correct way) to (1) determine rates and error in rates and (2) statistically compare these rates in the following experimental design. The experiment in brief: One Categorical variable: two different genotypes One Ordinal variable: continuous variable ('distance') put into bins to allow calculation of rates per bin One Dependent Variable: number of events observed during the observation period (i.e. a rate) Or in paragraph form: I am interested into one main relationship and its interaction with another variable: First, the relationship between distance and the rate of an event. Second, how the rate, as a function of distance, is different in two different genotypes. To measure this, I have experimental arenas (==replicates) with my two different genotypes. I observe the arenas *as long as I can *(this is important because it results in different replicates having different observation times). During the observations, every second, I count whether the event has occurred, and I also record the value of the continuous variable (distance). Thus, I end up with a data.frame containing raw data that looks like this: replicate genotype time distance event_occurred 1 A 1 10 0 1 A 2 13 0 1 A 3 17 1 1 A 4 17 0 1 B 1 8 0 1 B 2 10 0 2 A 1 10 1 2 A 2 7 0 2 A 3 4 0 2 B 1 10 1 2 B 2 11 0 2 B 3 19 1 . . . Since I'm interested in the rate of the event occurring as a function of distance, what I am currently doing is (somewhat arbitrarily) binning the data into distance bins and summarizing the data accordingly, such that the time observed per bin per genotype per rep is summed, as are the number of events. Doing this on the above mock data.frame results in this, if I bin the distances into 0-10 and 11-20: replicate genotype total_time distance events 1 A 1 0-10 0 1 A 3 11-20 1 1 B 2 0-10 0 2 A 3 0-10 1 2 B 1 0-10 1 2 B 2 11-20 1 At this point, I would like to be able to see if the rate of events / total_time varies with distance and with genotype. I've thought of a few ways to do it, none of which I'm sure make sense. Method 1: Find a rate for each replicate-genotype-distance combination and do a two-way ANOVA where the factors are genotype and distance: replicate genotype total_time distance events rate 1 A 1 0-10 0 0/1 1 A 3 11-20 1 1/3 1 B 2 0-10 0 0/2 2 A 3 0-10 1 1/3 2 B 1 0-10 1 1/1 2 B 2 11-20 1 1/2 aov(rate ~ genotype*distance) There is one main problem with this method (and some other smaller ones): the total observation time for each replicate-genotype-distance combination is different. By finding a rate (by doing events/total_time) and ignoring the fact that a longer observation period occurred for some combinations than others lets data points with hardly any observation time be treated as just as reliable as data points with extremely long observation times. This seems wrong to me because the precision for data point that had long observation periods should be much higher than data points with short observation times. Of course, the big plus here is that this is easy. Method 2: Do an ANCOVA by rounding down the distance bins (which were treated as a categorical variable in the ANOVA above) and using this as a continuous covariate. In other words, make the data.frame like this: replicate genotype total_time distance events 1 A 1 0 0 1 A 3 11 1 1 B 2 0 0 2 A 3 0 1 2 B 1 0 1 2 B 2 11 2 And then do this test in R: lm(events ~ 0 + genotype*distance*total_time ) I feel like this *should *make sense, because the slope for a specific genotype:distance combination, as time_s increases, should reflect the rate of the event? I made the intercept be zero because there cannot be a positive rate in zero time. I came to this method because it works for a a single genotype and single distance bin. In other words, if all I had was this: replicate genotype time_s distance events 1 A 1 0 0 2 A 3 0 1 . . . and did this: lm(events ~ 0 + time_s) the slope gives me a reasonable estimate of the rate and better yet I can use the standard error of the coefficient when plotting rate + error. The truth is I don't know if this idea generalizes to the bigger dataset, with a categorical variable and the multiple distance bins. Additionally, even if it does, I'm not positive how to pull out the relevant rates and errors from the different coefficients using summary(). Method 3: Do the exact same thing as method 2, but use a poisson regression because there are going to be a lot of zeros when the observation time is small: glm(events ~ 0 + genotype*distance*time_s, family = 'poisson') I'm not sure if this makes sense though and I also don't know how I'd get rates from the model. Method 4: Ignore the fact that this was a replicated experiment, calculate rates for the each genotype-distance combination across all replicates, and. . . One way I've seen rates calculated by others is to ignore the fact that the data came from multiple replicates, and just count the total events in a genotype-distance combination and divide that by the total observation time. Then, since counts are Poisson-distributed, obtain a standard deviation by sqrt(total events) / total observation time. This method is along the lines of what I find here: http://cran.r-project.org/web/packages/rateratio.test/ which treats each second of observation as an datapoint (which seems strange to me due to the clear non-independence of serial observations) and does allow statistical comparison of two rates. But it does not appear to be able to be used for more than a two-group comparison as far as I can tell. Calculating rates this way is nice and easy and gives me plottable data, but it doesn't leave me with an obvious way to test for significant differences and also seems odd because it ignores the replication. So what to do? I apologize for this wall of text, I've just been banging my head on this dataset for awhile now and figured it may be easier to share everything I've tried when trying to get help. I should note that it would be even more wonderful if any solutions could be generalized to add a second binned variable, because in truth I'm interested in how distance, variable 2, and genotype all affect the rate of this event. Complicated! I greatly appreciate any help or insight you may have. Thanks much! Jeremy -- *___________________________________________________________________________Jeremy M. Chacon, Ph.D.* *Post-Doctoral Associate, Gardner Lab* *University of Minnesota* *Genetics, Cell Biology, and Development* *6-160 Jackson Hall* *321 Church St. SE* *Minneapolis, MN 55455* *USA* *[hidden email] <[hidden email]>* *Lab telephone: **612-626-6906* [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-ecology |
Free forum by Nabble | Edit this page |