7 PCA and R Markdown Intro
Before we get started with conducing a Principal Component Analysis (PCA), we are going to carry out this lesson in an R Markdown document and see how it can be used to 1) organize workflows with text, code, and figures/results and 2) render a nicely formatted report, which is what you will submit for this week’s assignment.
7.1 Intro to R Markdown
R Markdown is a notebook style interface integrating text and code, allowing you to create fully reproducible documents and render them to various elegantly formatted static or dynamic outputs.
You can learn more about R Markdown at their website, which has really informative lessons on their Getting Started page and see the range of outputs you can create at their Gallery page.
7.1.1 Getting started with R Markdown
Let’s create a new document by going to File -> New File -> R Markdown. You will be prompted to add information like title and author. Give it the title “Week 7 Assignment- PCA” and change the output to Word document. Click OK to create the document.
This creates an outline of an R Markdown document, and you see the title, author and date you gave the prompt at the top of the document which is called the YAML header.
Notice that the file contains three types of content:
An (optional) YAML header surrounded by
---
sR code chunks surrounded by
```
stext mixed with simple text formatting
Since this is a notebook style document, you run the code chunks by clicking the green play button, and then the output is returned either directly below the chunk or in the console depending on settings in Tools -> Global Options -> R Markdown.
When you want to create a report from your notebook, you render it by hitting the ‘Knit’ button in the top of the Source pane, and it will render to the format you have specified in the YAML header. In order to do so though, you need to have the rmarkdown
package installed (which you will do below).
You can delete the rest of the code/text below the YAML header, and insert a new code chunk at the top. You can insert code chunks by clicking the green C with the ‘+’ sign at the top of the source editor, or with the keyboard short cut (Ctrl+Alt+I for Windows, Option+Command+I for Macs). For the rest of the lesson you will be writing and executing code through code chunks, and you can type any notes in the main body of the document.
The first chunk is almost always your set up code, where you read in libraries and any necessary data sets.
We need three new packages for this lesson. rmarkdown
is necessary in order to render your R Markdown (or .Rmd) files to specified formats (e.g., HTML, Word, PDF). plotly
is a graphics library that we will use to demonstrate how you can make ggplot figures interactive and ggfortify
is needed to plot PCA objects with ggplot2
.
Since you only need to install packages once, you do not need to include this in your R Markdown document, you can just run it directly in the console:
install.packages("rmarkdown")
install.packages("plotly")
install.packages("ggfortify")
This chunk should be placed at the beginning of your document to set up your environment for carrying out the analysis for the lesson.
library(tidyverse)
library(lterdatasampler)
library(plotly)
library(ggfortify)
# retrieve data
data("hbr_maples")
The hbr_maples
data set consists of sugar maple seedling traits measured in calcium-treated and non-treated sites to study the response of seedlings to calcium addition at the Hubbard Brook LTER.
Let’s learn a little more about this data set:
?hbr_maples
We have a lot of continuous variables representing leaf characteristics, and a few categorical variables (calcium-treated and non treated (reference) sites, low and mid elevation sites).
With a large set of quantitative variables that are likely inter-correlated, this is a highly multivariate data space. This is where Principal Component Analysis (PCA) comes in to play. PCA is a type of ordination technique, which is a way of organizing things based on how similar they are. A PCA finds a new coordinate system by defining principal component (PC) axes that best account for the variation in this multivariate space, essentially reducing a large set of variables down to two variables that represent the two PC axes that (often) explain the most variance in the data. PCA is a common EDA (exploratory data analysis) technique to see patterns in multi-variable data sets, such as clusters among sites or samples, and/or you can use the PC variables in other analyses such as linear regression.
7.2 Principal Component Analysis (PCA)
First, what is the temporal scale of this data set:
unique(hbr_maples$year)
Data was collected for two years, 2003 and 2004. Let’s run through this lesson with just the 2003 data. You notice that if you filter out just the 2004 samples, there was no data collected on leaf area, so lets just analyze the 2003 data which has 6 different trait variables. Its important to note that the PCA test does not handle NA values well, so let’s use drop_na()
here to drop any additional observations that have may have NAs for the quantitative variables.
Note: using the :
operator will select the range of columns from that on the left of :
to that on the right, which works well for this data set since all of our quantitative variables of interest are ordered together.
<- hbr_maples %>%
maples03 filter(year == 2003) %>%
drop_na(stem_length:corrected_leaf_area)
Let’s first see if our quantitative variables are inter-correlated at all, as a PCA does not work the best with entirely uncorrelated data. To do this, we can use the cor()
test you’ve seen before, and we need to reduce our maples_03
data to just the quantitative variables we are performing the PCA on.
<- maples03 %>%
vars select(stem_length:corrected_leaf_area)
cor(vars)
From this we see quite a few correlated variables (coefficients ~0.7 or greater). This tells us a PCA is a good analysis to use to summarize this data set.
There are a few different functions to conduct a PCA in R (many available in additional R packages such as the vegan
package for community analyses), but we are going to use the prcomp()
function from base R, a reliable and stable function for conducting PCA. prcomp()
takes in just the quantitative variables you want to perform the PCA on, so we can use the vars
object we just created above.
PCA is influenced by the magnitude of each variable, therefore scaling all your variables is often necessary. The prcomp()
function can do this for us by adding scale = TRUE
.
<- prcomp(vars, scale = TRUE) maples03_pca
7.3 Variance explained
First we want to view the PC axes created and assess the variance explained by each. We can do this with summary()
of our PCA object.
summary(maples03_pca)
We can also view these results visually using a plot called a screeplot.
screeplot(maples03_pca)
Looks like the first 2 axes explain nearly all the variance of the data space (>80%, which is often the desired outcome). Notice that this function plots ‘Variances’ on the y-axis i.e., the Eigenvalues which reflect the total amount of variance explained by that axis instead of the proportion. The proportions are still interpretable from the size of the bar plots however.
7.4 Variable loadings
We can next view the individual loadings of each variable (i.e., how much each variable contributes to each axis), by indexing the rotation
element of the PCA object with the $
operator.
$rotation maples03_pca
Note that when we look at these individual loadings, when variables have opposite +/- values that means those variables are negatively correlated. From this we see that on the first axis PC1 (which explains the vast majority of the variance in the data set) all these variables are positively correlated. On PC2 we see a few negatively correlated, such as stem length and leaf dry mass.
7.5 Visualize patterns
Now lets visualize some patterns with a biplot. We can create a biplot in base R with the biplot()
function
biplot(maples03_pca)
This however is pretty messy and hard to interpret. With the ggfortify
package we can create biplots with ggplot2
using the autoplot()
function. We add loadings = TRUE
and loadings.label = TRUE
to add the variable loadings to the plots, on top of the sample scores (denoted with points).
autoplot(maples03_pca, loadings = TRUE, loadings.label = TRUE)
This is still a little messy. To better view and interact with this visualization, we can leverage interactive plotting with the plotly
package, and make this ggplot object interactive by putting it inside the ggplotly()
function:
ggplotly(autoplot(maples03_pca, loadings = TRUE, loadings.label = TRUE))
Notice now you can hover over the vector lines and points to see the raw values, and also zoom in to the plot to see the names of the clustered variable loadings better.
We can view more patterns in the data by coloring points (i.e., seedling samples) by one of the other variables in our data set. This project involved assessing the impacts of Calcium addition on sugar maple seedlings by comparing seedling traits among calcium treated (W1) and untreated (Reference) sites. Therefore, let’s color points by watershed treatment (the watershed
variable) to see if there is any clustering in seedling traits among treatments.
To do so with the autoplot()
function we also specify the original dataset we used in prcomp()
and the name of the column we want to color by. (Note the colour:
argument is spelled the British way.) Let’s make this interactive as well by using ggplotly()
.
ggplotly(autoplot(maples03_pca, data = maples03, colour = "watershed"))
There appears to be some visual separation/clustering among watershed types along PC1 (the x-axis). This pattern may lead you to statistically test for differences among watershed treatments, which you will carry out in Exercise 3 below.
7.6 A Note on Rendering:
When knitting/rendering an R Markdown document to Word or PDF (anything other than HTML), it will fail if there is any code executing interactive visualizations. This includes anything created with ggplotly()
. Therefore, before rendering your assignment to a Word doc, you should add eval = FALSE
to any code chunk using ggplotly()
. Put this argument at the top of the code chunk. It should look like this: {r eval = FALSE}
.
7.7 Exercises
To complete the assignment this week, you will write your entire workflow (i.e., the entire PCA lesson above) in an R Markdown document. At the bottom of your R Markdown document (after carrying out the PCA) please add an ‘Exercises’ section, write (or paste) questions 1-3 below and write your responses directly below them, and finally knit your completed R Markdown file as a Word document. For the assignment you will only need to submit the rendered Word document, which should contain all the code you ran AND your responses to the questions below (some of which also include code!).
Looking at the biplot, interpret the variable loadings based on vector angles. Which variable are positively, negatively, and uncorrelated with each other? (5 pts.)
Make a biplot (using the
autoplot
function) and color samples by elevation. Include the static plot (i.e., do not useggplotly
). Do you notice any clustering? (8 pts.)We notice that there seems to be a visual separation among watershed treatment along our PC1 axis (which represents the entire set of seedling traits). Since we now have a single quantitative variable (PC1) and a categorical predictor (watershed), we can perform a t-test between watershed treatment to see if this difference in seedling traits is significant. Run the following chunk of code to add the PC1 variable to the original
maples03
data. Then perform a t-test and report your findings. Include all the code you ran (remember how to test for t-test assumptions and properly address them if needed) and format your interpretation as you would in a scientific paper/report. (12 pts.)
# the PCA operates row wise, so we can bind the columns since they have the same number of rows that are all paired up. Recall that the 'x' element of the PCA shows each individual sample score on each axis
<- bind_cols(maples03, maples03_pca$x) maples03