5 Introduction to Statistics

In this lesson you will be introduced to the process of conducting statistical tests in R, specifically chi-square, t-tests, and correlation tests.

First, to access the dataset(s) you will be using today install the remotes package, and then install the lterdatasampler package (remotes is needed because lterdatasampler has to be installed from GitHub as opposed to CRAN).

install.packages("remotes")
remotes::install_github("lter/lterdatasampler")

Now load in all libraries needed for this lesson:

library(tidyverse)
library(lterdatasampler)

Then run the following line of code to retrieve the and_vertebrates data set and bring it into your R session:

data(and_vertebrates)

5.1 Explore the dataset

Do a little exploration of this data first to understand its structure, variables and data types:

# View the data structure
glimpse(and_vertebrates)

# Explore the metadata in the Help pane
?and_vertebrates

This data set contains length and weight observations for three aquatic species in clear cut and old growth coniferous forest sections of Mack Creek in HJ Andrews Experimental Forest in Oregon. The three species are Cutthroat trout, Coastal giant salamander and Cascade torrent salamander.

5.2 Chi-square - Categorical Analysis

When you are working with two categorical variables, the statistical test you would use is a Chi-square test. This test can tell you if there is a relationship between your two categorical variables.

For example, we have two categorical variables in the and_vertebrates data set:

section = two forest sections, clear cut (CC) and old growth (OG)
unittype = channel unit classification type (C = cascade, I = riffle, IP = isolated pool (not connected to channel), P = pool, R = rapid, S = step (small falls), SC = side channel, NA = not sampled by unit)

Lets focus this question on Cutthroat trout. First explore the abundance of cutthroat trout in different channel types, using the n() function to return the total count/number of observations in each group.

and_vertebrates %>% 
  filter(species == "Cutthroat trout") %>% 
  group_by(unittype) %>% 
  summarise(abundance = n())

This output tells us that there are quite a few observations with the NA category, meaning channel type was unknown or not recroded. Let’s edit the workflow above slightly, using two new functions: drop_na() and count(). drop_na() will remove any rows within a specified column (or columns) that have NA values, and we can use count() as an alternative to group_by() and summarise() when we just want number of observations for a single variable (in this case unittype).

and_vertebrates %>% 
  filter(species == "Cutthroat trout") %>% 
  drop_na(unittype) %>% 
  count(unittype)

This returns just about the same data frame as the first method, but now with the NA category removed because it dropped any observations that were NA for unittype.

From this we also observe that the highest Cutthroat trout abundances are found in cascade (C), pool (P), and side channel (SC) habitats.

Now, our question expands beyond this one categorical variable (channel type) and we want to know if abundance is affected by both channel and and forest type (section). Here, our null hypothesis is that forest and channel type are independent. To test this, we use the chisq.test() function to carry out a chi-square test, but first we have to reformat our data into a contingency table.

A contingency table is in matrix format, where each cell is the frequency (in this case seen as abundance) of Cutthroat trout in each combination of categorical variables (forest type and channel unit). We can create a contingency table with the table() function. For this analysis, lets also filter out just the 3 most abundant unit types for Cutthroat trout (C, P and SC).

# First clean the dataset to create the contingency table from
trout_clean <- and_vertebrates %>% 
  #filter Cutthroat trout
  filter(species == "Cutthroat trout") %>% 
  # lets test using just the 3 most abundant unittypes
  filter(unittype %in% c("C", "P", "SC")) %>% 
  # drop NAs for both unittype and section
  drop_na(unittype, section)


cont_table <- table(trout_clean$section, trout_clean$unittype)

To execute the Chi-square test does not take that much code, but it is important to note that by default, chisq.test() assumes the null hypothesis is that all frequencies have equal probability. If you have different pre-conceived frequency probabilities for your data you have to define those within the chisq.test() function.

chisq.test(cont_table)

Looking at these results, we have an extremely small p-value. This tells us that there is a significant relationship between forest type and channel unit (i.e., we rejected our null hypothesis).

Lets look at the abundance distribution visually:

trout_clean %>% 
  count(unittype, section) %>% 
  ggplot(aes(x = unittype, y = n))+
  geom_col(aes(fill = section))+
  scale_fill_manual(values = c("orange", "darkgreen"))+
  theme_minimal()

5.3 t-test - Compare two means

Previous work has shown that forest harvesting can impact aquatic vertebrate biomass (Kaylor & Warren 2017). With this and_vertebrates data set we can investigate this hypothesis, by comparing weight to forest type (clear cut or old growth). This therefore involves a test comparing the means (average weight) among two groups (clear cut and old growth forests), which then requires a t-test.

Lets focus on conducting this test for just Cutthroat trout (to reduce species-level variances in weight), so we can use the same trout_clean data set we made earlier, but let’s also drop all NAs in weight_g. Then, lets first visualize the differences in weight among forest type with a boxplot:

trout_clean %>% 
  drop_na(weight_g) %>% 
  ggplot(aes(x = section, y = weight_g))+
  geom_boxplot()

We don’t see too much of a difference based on this visual, but lets conduct the statistical test to really verify if our hypothesis is supported.

First however we need to check our test assumptions, which for t-tests assumes the variance of the groups is equal. We can test for equal variances with the function var.test(), where the null hypothesis is that the variances is equal. In this step we need two vectors of the weights in each separate forest section. You can use pull() to convert a single column of a data frame/tibble to a vector, and we want to do this for clear cut and old growth forests separately. We then put both of those vectors in the var.test() function to assess their equal variances.

cc_weight <- trout_clean %>% 
  filter(section == "CC") %>% 
  pull(weight_g)

og_weight <- trout_clean %>% 
  filter(section == "OG") %>% 
  pull(weight_g)

var.test(cc_weight, og_weight)

Looks like our variances are not equal. We have two options now, we can either transform our weight variable or use the Welch t-test which does not assume equal variances.

Variable transformation

If we look at the distribution of weight (our continuous variable), it is pretty right skewed. Therefore, we’d likely want to do a log transformation on the data, which works well the data is skewed like this:

hist(trout_clean$weight_g)

Lets perform the variances check like we did before, but on the log transformed values, which you can do with log

var.test(log(cc_weight), log(og_weight))

Now we have a high p-value, indicating support for the null that the variances are equal. So. we can use the default t.test() test which assumes equal variances, but on a log transformed weight variable.

The t.test() function in R takes in your dependent (in our case trout weight) and independent (forest type) variables as vectors (instead of just column names like you can do in the Tidyverse). Remember how we can index single columns of data frames with the $ operator. The order of the variables in the t.test() function is {dependent variable} ~ {independent variable}. We use the ~ to specify a model, telling the test we want to know if weight varies by forest section.

Remember we also want to log transform the weight values and then specify that our variances are equal since we confirmed that with var.test() above, so the final t.test() call would be this:

t.test(log(trout_clean$weight_g) ~ trout_clean$section, var.equal = TRUE)

The output of this test gives us the test statistics, p-value, and the means for each of our forest groups. Given the p-value of 0.0043, we can conclude that we reject the null hypothesis that mean Cutthroat weight is the same in clear cut and old growth forest sections, and looking at our results (specifically the means) we can conclude that Cutthroat trout weight was observed to be significantly higher in clear cut forests compared to old growth forests. Remember though that now these mean weight values are log transformed, and not the raw weight in grams. The relationship can still be interpreted the same.

How does this relate to your original hypothesis?

Welch Two Sample t-test

Alternatively, instead of transforming our variable we can actually change the default t.test() argument by specifying var.equal = FALSE, which will then conduct a Welch t-test, which does not assume equal variances among groups.

t.test(trout_clean$weight_g ~ trout_clean$section, var.equal = FALSE)

While we used a slightly different method, our conclusions are still the same, finding that Cutthroat trout had significantly higher weights in clear cut forests than old growth.

Note: In the t.test() function you can add paired = TRUE to conduct a paired t-test. These are for cases when the groups are ‘paired’ for each observation, meaning each group/treatment was applied to the same individual, such as before and after experiments.

5.4 Correlation - Assess relationships

When you want to assess the relationship between two continuous variables, the test you would use is a correlation test. Correlation tests asses both the presence of a significant relationship along with the strength of that relationship (i.e., the correlation coefficient).

For our and_vertebrates data set, we can test length-mass relationships for our species with our length and weight continuous variables. Lets test the hypothesis that body length is positively correlated with weight, such that longer individuals will also weigh more, specifically looking at the Coastal Giant salamander.

First let’s clean our data set to just include the Coastal giant salamander and remove missing values for length and weight. Let’s focus on the variable ‘length_2_mm’ for snout to tail length.

sally_clean <- and_vertebrates %>% 
  filter(species == "Coastal giant salamander") %>% 
  drop_na(length_2_mm, weight_g)

Now we can perform the correlation test with the cor.test() function. There are multiple correlation methods you can use with this function, by default it uses the Pearson correlation method. However this test assumes that your data is normally distributed and there is a linear relationship, so if that is not the case you can specify spearman for method = to use a Spearman Rank correlation test, a non-parametric test that is not sensitive to the variable distribution.

Let’s look at the distribution of these variables first:

hist(sally_clean$length_2_mm)

hist(sally_clean$weight_g)

They both look pretty skewed, therefore likely not normally distributed. We can statistically test if a variable fits a normal distribution with the shapiro.test() function. However note that this function only runs for 5000 observations or less, so lets test for normally of the first 5000 obs of our sally_clean data set:

shapiro.test(sally_clean$length_2_mm[1:5000])

shapiro.test(sally_clean$weight_g[1:5000])

The null hypothesis of the Shapiro-Wilk normality test is that the variable is normally distributed, so a significant p-value less than 0.05 (as we see for both of our variables here) tells use that our data does not fit a normal distribution.

Therefore we have two options as we did with our t-test example: transform the variables or use the non-parametric test.

Variable transformation

Lets try the first option by log transforming our variables (since we saw they both had pretty skewed distributions), first viewing the new distribution and then performing the Pearson’s correlation test (the default for cor.test()).

hist(log(sally_clean$length_2_mm))

hist(log(sally_clean$weight_g))

All we need to add to the cor.test() argument is the two variables of our sally_clean data set we want to test a relationship for, and let’s keep them log-transformed since those distributions looked closer to a normal distribution (visually at least).

cor.test(log(sally_clean$length_2_mm), log(sally_clean$weight_g))

Okay, from these results we see a very small p-value, meaning there is a significant association between the two, and a correlation coefficient of 0.98, representing a very strong, positive correlation.

Let’s look at this correlation visually:

sally_clean %>% 
  ggplot(aes(x = log(length_2_mm), y = log(weight_g)))+
  geom_point()

We can use geom_smooth() to add a line of best fit using a linear model equation (which you will learn more about next week)

sally_clean %>% 
  ggplot(aes(x = log(length_2_mm), y = log(weight_g)))+
  geom_point()+
  geom_smooth(method = "lm")

Spearman Correlation Test

Let’s now perform the correlation test again but keeping our raw data and instead specifying method = 'spearman', as the Spearman test is better for non-parametric and non-linear data sets.

cor.test(sally_clean$length_2_mm, sally_clean$weight_g, method = "spearman")

These results also represent a significant, positive relationship between length and weight for the Coastal Giant salamander, with a very high correlation coefficient.

5.5 Exercises

Each question requires you to carry out a statistical analysis to test some hypothesis related to the and_vertebrates data set. To answer each question fully:

Include the code you used to clean the data and conduct the appropriate statistical test. (Including the steps to assess and address your statistical test assumptions).
Report the findings of your test in proper scientific format (with the p-value in parentheses).

1. Conduct a chi-square test similar to the one we carried out earlier in this lesson plan, but test for a relationship between forest type (section) and channel unit (unittype) for Coastal giant salamander abundance. Keep all unittypes instead of filtering any like we did for the Cutthroat trout (9 pts.)

2. Test the hypothesis that there is a significant difference in species biomass between clear cut and old growth forest types for the Coastal Giant salamander. (8 pts.)

3. Test the correlation between body length (snout to fork length) and body mass for Cutthroat trout. (Hint: run ?and_vertebrates to find which length variable represents snout to fork length) (8 pts.)

5.5.1 Acknowledgements

Thanks to the developers of lterdatasampler for providing the data set and vignettes that helped guide the creation of this lesson plan.

5.5.2 Citations

Data Source: Gregory, S.V. and I. Arismendi. 2020. Aquatic Vertebrate Population Study in Mack Creek, Andrews Experimental Forest, 1987 to present ver 14. Environmental Data Initiative. https://doi.org/10.6073/pasta/7c78d662e847cdbe33584add8f809165

Kaylor, M.J. and D.R. Warren. 2017. Linking riparian shade and the legacies of forest management to fish and vertebrate biomass in forested streams. Ecosphere 8(6). https://doi.org/10.1002/ecs2.1845