Hi Listers,
Further to my last post I am seeking more insights into how to interpret effects of Factors in a RDA analysis. I think of a factor as a single variable whose influence on observed fish distribution patterns I would like to quantify, along with a bunch of other numerical variables. To do the analyses this Factor (called 'geom') is turned into a number of dummy variables (seven actually). The conceptual problem I am having is that when I do a call to vif.cca to check on collinearity for example, the output suggests I should remove some of the dummy variables making up the levels of my Factor. Doing this would leave me with only 3 of the dummy variables out of the original 8 to put into the RDA. I then don't see that I am actually testing the Factor 'geom' anymore but rather just individual variables representing a couple of the different levels of the original Factor. How do I proceed with this? The same conundrum for me is seen when I run the forward.sel command to look at the most efficient number of explanatory variables to have in the final model. The process selects only some of the dummy variables to include in the model. Again I struggle to see how I am testing or including the full effects of the Factor 'geom' if only a few of the dummy variables are actually included in the model. # here is the model run with all the potential explanatory variables ('geom' is the FACTOR with 7 levels) > fish.env <- rda(fish.h~coral_cover+macroalgae+turf_algae_sqrt+ccc_4thrt+rubble_sqrt+reef_slope_sqrt+rugosity +exposure+min_d_sqrt+d_range+chl_a_log+popn_density_4throot+fp+protection +geom,data=env.factor) # collinearity assessment - the results favour dropping 3 of the 'geo' dummy variables leaving only 2 for the model. > vif.cca(fish.env) coral_cover macroalgae turf_algae_sqrt ccc_4thrt rubble_sqrt reef_slope_sqrt 2.688656 3.099972 2.849219 1.771637 2.411291 2.418953 rugosity exposure min_d_sqrt d_range chl_a_log popn_density_4throot 2.967752 3.433961 2.587696 2.643991 3.626107 4.571781 fp protection geomgeo_bl geomgeo_cbrc geomgeo_isefr geomgeo_isprc 4.059624 3.195210 4.329852 12.657270 9.570015 12.052385 geomgeo_lefr geomgeo_oefr 7.347812 17.090731 # here I have kept all the 'geom' dummy variables to submit to forward selection and it only selects 3 of them, hence it doesnt feel that I am actually including a Factor 'geom' in the model but rather just a few individual dummy variables? > forward.sel(fish.h,env.dummy3,adjR2thresh=R2a.all_fish_env) Testing variable 1 Testing variable 2 Testing variable 3 Testing variable 4 Testing variable 5 Testing variable 6 Testing variable 7 Testing variable 8 Procedure stopped (alpha criteria): pvalue for variable 8 is 0.092000 (superior to 0.050000) variables order R2 R2Cum AdjR2Cum F pval 1 exposure 8 0.07086240 0.0708624 0.04925455 3.279475 0.001 2 fp 13 0.04756799 0.1184304 0.07645089 2.266248 0.002 3 geo_isefr 16 0.04571706 0.1641474 0.10298750 2.242500 0.003 4 chl_a_log 11 0.03812686 0.2022743 0.12250174 1.911778 0.009 5 geo_bl 17 0.03423972 0.2365140 0.13863121 1.749016 0.008 6 geo_isprc 21 0.03384005 0.2703541 0.15514682 1.762391 0.012 7 reef_slope_sqrt 6 0.03466752 0.3050216 0.17353919 1.845666 0.007 Any advice appreciated cheers Andy -- Andrew Halford Ph.D Research Scientist (Kimberley Marine Parks) Dept. Parks and Wildlife Western Australia Ph: +61 8 9219 9795 Mobile: +61 (0) 468 419 473 [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-ecology |
I said earlier that you can use either dummy variables or factors with little difference. However, there are some cases where the differences becomes important: that is the case when you start playing with individual dummy variables like they were real variables. Don't do that! Do not use forward.sel with dummy variables: it makes no sense. Do not look at the VIF values of single factor levels: that rarely makes sense, and they can never suggest removing single levels of factors. It is the whole factor, in our out. Never partly in, partly out. If you go to play like that, you really should switch to standard R way of defining your factor as a factor. The highish VIF values for some levels are probably triggered by correlations with some continuous variables in your model.
There should be a generalized VIF in vegan that gives one statistic for the whole factor. Contributions are welcome at http://github.com/vegandevs/vegan. cheers, Jari Oksanen ________________________________________ From: R-sig-ecology <[hidden email]> on behalf of Andrew Halford <[hidden email]> Sent: 08 February 2017 10:14 To: [hidden email] Subject: [R-sig-eco] Factors in partial RDA part 2 Hi Listers, Further to my last post I am seeking more insights into how to interpret effects of Factors in a RDA analysis. I think of a factor as a single variable whose influence on observed fish distribution patterns I would like to quantify, along with a bunch of other numerical variables. To do the analyses this Factor (called 'geom') is turned into a number of dummy variables (seven actually). The conceptual problem I am having is that when I do a call to vif.cca to check on collinearity for example, the output suggests I should remove some of the dummy variables making up the levels of my Factor. Doing this would leave me with only 3 of the dummy variables out of the original 8 to put into the RDA. I then don't see that I am actually testing the Factor 'geom' anymore but rather just individual variables representing a couple of the different levels of the original Factor. How do I proceed with this? The same conundrum for me is seen when I run the forward.sel command to look at the most efficient number of explanatory variables to have in the final model. The process selects only some of the dummy variables to include in the model. Again I struggle to see how I am testing or including the full effects of the Factor 'geom' if only a few of the dummy variables are actually included in the model. # here is the model run with all the potential explanatory variables ('geom' is the FACTOR with 7 levels) > fish.env <- rda(fish.h~coral_cover+macroalgae+turf_algae_sqrt+ccc_4thrt+rubble_sqrt+reef_slope_sqrt+rugosity +exposure+min_d_sqrt+d_range+chl_a_log+popn_density_4throot+fp+protection +geom,data=env.factor) # collinearity assessment - the results favour dropping 3 of the 'geo' dummy variables leaving only 2 for the model. > vif.cca(fish.env) coral_cover macroalgae turf_algae_sqrt ccc_4thrt rubble_sqrt reef_slope_sqrt 2.688656 3.099972 2.849219 1.771637 2.411291 2.418953 rugosity exposure min_d_sqrt d_range chl_a_log popn_density_4throot 2.967752 3.433961 2.587696 2.643991 3.626107 4.571781 fp protection geomgeo_bl geomgeo_cbrc geomgeo_isefr geomgeo_isprc 4.059624 3.195210 4.329852 12.657270 9.570015 12.052385 geomgeo_lefr geomgeo_oefr 7.347812 17.090731 # here I have kept all the 'geom' dummy variables to submit to forward selection and it only selects 3 of them, hence it doesnt feel that I am actually including a Factor 'geom' in the model but rather just a few individual dummy variables? > forward.sel(fish.h,env.dummy3,adjR2thresh=R2a.all_fish_env) Testing variable 1 Testing variable 2 Testing variable 3 Testing variable 4 Testing variable 5 Testing variable 6 Testing variable 7 Testing variable 8 Procedure stopped (alpha criteria): pvalue for variable 8 is 0.092000 (superior to 0.050000) variables order R2 R2Cum AdjR2Cum F pval 1 exposure 8 0.07086240 0.0708624 0.04925455 3.279475 0.001 2 fp 13 0.04756799 0.1184304 0.07645089 2.266248 0.002 3 geo_isefr 16 0.04571706 0.1641474 0.10298750 2.242500 0.003 4 chl_a_log 11 0.03812686 0.2022743 0.12250174 1.911778 0.009 5 geo_bl 17 0.03423972 0.2365140 0.13863121 1.749016 0.008 6 geo_isprc 21 0.03384005 0.2703541 0.15514682 1.762391 0.012 7 reef_slope_sqrt 6 0.03466752 0.3050216 0.17353919 1.845666 0.007 Any advice appreciated cheers Andy -- Andrew Halford Ph.D Research Scientist (Kimberley Marine Parks) Dept. Parks and Wildlife Western Australia Ph: +61 8 9219 9795 Mobile: +61 (0) 468 419 473 [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-ecology _______________________________________________ R-sig-ecology mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-ecology |
Thanks Jari,
My education continues! cheers Andy On 8 February 2017 at 17:32, Jari Oksanen <[hidden email]> wrote: > I said earlier that you can use either dummy variables or factors with > little difference. However, there are some cases where the differences > becomes important: that is the case when you start playing with individual > dummy variables like they were real variables. Don't do that! Do not use > forward.sel with dummy variables: it makes no sense. Do not look at the VIF > values of single factor levels: that rarely makes sense, and they can never > suggest removing single levels of factors. It is the whole factor, in our > out. Never partly in, partly out. If you go to play like that, you really > should switch to standard R way of defining your factor as a factor. The > highish VIF values for some levels are probably triggered by correlations > with some continuous variables in your model. > > There should be a generalized VIF in vegan that gives one statistic for > the whole factor. Contributions are welcome at > http://github.com/vegandevs/vegan. > > cheers, Jari Oksanen > ________________________________________ > From: R-sig-ecology <[hidden email]> on behalf of > Andrew Halford <[hidden email]> > Sent: 08 February 2017 10:14 > To: [hidden email] > Subject: [R-sig-eco] Factors in partial RDA part 2 > > Hi Listers, > > Further to my last post I am seeking more insights into how to interpret > effects of Factors in a RDA analysis. > > I think of a factor as a single variable whose influence on observed fish > distribution patterns I would like to quantify, along with a bunch of other > numerical variables. > > To do the analyses this Factor (called 'geom') is turned into a number of > dummy variables (seven actually). > > The conceptual problem I am having is that when I do a call to vif.cca to > check on collinearity for example, the output suggests I should remove some > of the dummy variables making up the levels of my Factor. Doing this would > leave me with only 3 of the dummy variables out of the original 8 to put > into the RDA. I then don't see that I am actually testing the Factor 'geom' > anymore but rather just individual variables representing a couple of the > different levels of the original Factor. How do I proceed with this? > > The same conundrum for me is seen when I run the forward.sel command to > look at the most efficient number of explanatory variables to have in the > final model. The process selects only some of the dummy variables to > include in the model. Again I struggle to see how I am testing or including > the full effects of the Factor 'geom' if only a few of the dummy variables > are actually included in the model. > > # here is the model run with all the potential explanatory variables > ('geom' is the FACTOR with 7 levels) > > > fish.env <- > rda(fish.h~coral_cover+macroalgae+turf_algae_sqrt+ > ccc_4thrt+rubble_sqrt+reef_slope_sqrt+rugosity > > +exposure+min_d_sqrt+d_range+chl_a_log+popn_density_4throot+fp+protection > +geom,data=env.factor) > > # collinearity assessment - the results favour dropping 3 of the 'geo' > dummy variables leaving only 2 for the model. > > vif.cca(fish.env) > > coral_cover macroalgae turf_algae_sqrt > ccc_4thrt rubble_sqrt reef_slope_sqrt > 2.688656 3.099972 2.849219 > 1.771637 2.411291 2.418953 > rugosity exposure min_d_sqrt > d_range chl_a_log popn_density_4throot > 2.967752 3.433961 2.587696 > 2.643991 3.626107 4.571781 > fp protection geomgeo_bl > geomgeo_cbrc geomgeo_isefr geomgeo_isprc > 4.059624 3.195210 4.329852 > 12.657270 9.570015 12.052385 > geomgeo_lefr geomgeo_oefr > 7.347812 17.090731 > > # here I have kept all the 'geom' dummy variables to submit to forward > selection and it only selects 3 of them, hence it doesnt feel that I am > actually including a Factor 'geom' in the model but rather just a few > individual dummy variables? > > > > forward.sel(fish.h,env.dummy3,adjR2thresh=R2a.all_fish_env) > > Testing variable 1 > Testing variable 2 > Testing variable 3 > Testing variable 4 > Testing variable 5 > Testing variable 6 > Testing variable 7 > Testing variable 8 > Procedure stopped (alpha criteria): pvalue for variable 8 is 0.092000 > (superior to 0.050000) > variables order R2 R2Cum AdjR2Cum F pval > 1 exposure 8 0.07086240 0.0708624 0.04925455 3.279475 0.001 > 2 fp 13 0.04756799 0.1184304 0.07645089 2.266248 0.002 > 3 geo_isefr 16 0.04571706 0.1641474 0.10298750 2.242500 0.003 > 4 chl_a_log 11 0.03812686 0.2022743 0.12250174 1.911778 0.009 > 5 geo_bl 17 0.03423972 0.2365140 0.13863121 1.749016 0.008 > 6 geo_isprc 21 0.03384005 0.2703541 0.15514682 1.762391 0.012 > 7 reef_slope_sqrt 6 0.03466752 0.3050216 0.17353919 1.845666 0.007 > > > Any advice appreciated > > cheers > > Andy > > > > > > > -- > Andrew Halford Ph.D > Research Scientist (Kimberley Marine Parks) > Dept. Parks and Wildlife > Western Australia > > Ph: +61 8 9219 9795 > Mobile: +61 (0) 468 419 473 > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-ecology mailing list > [hidden email] > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > -- Andrew Halford Ph.D Research Scientist (Kimberley Marine Parks) Dept. Parks and Wildlife Western Australia Ph: +61 8 9219 9795 Mobile: +61 (0) 468 419 473 [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-ecology |
Free forum by Nabble | Edit this page |