Chapter 1 Visualizing data
1.1 Introduction
Good statistical analyses begin and end with visualizations. By visualizing your data before conducting statistical analyses, you will discover patterns and identify interesting observations in your data. Many of the techniques involved in exploring and acquainting yourself with data have been pioneered in the field of exploratory data analysis (Tukey 1977).
There are several parts to the data analysis life cycle, and visualizing data should be done frequently during this process:
In this chapter we will learn how to describe different components of data. This will make visualizing the data much easier because we will use the terminology that is appropriate for the variables we are working with. After we visualize the data, this usually gives us a good indication about which kinds of statistical analyses can be done with our data.
To implement the visualizations in the chapter, we will use the ggplot2 package, one of the core packages found in the tidyverse megapackage. The gg in ggplot2 is an abbreviation of “grammar of graphics,” a language that provides insight into the deep structure behind visualizations. Resources developed by Wilkinson (2005) and Wickham (2010) provide much more on the theory and application of this language.
If this is your first time using the tidyverse package, you will need to install it:
install.packages("tidyverse")
Then, you will need to load the package using the library()
function:
library(tidyverse)
We will use the ggplot2 functions extensively throughout this chapter to learn about data through visualizations.
Visualizing and graphing data are grouped in the area of descriptive statistics. It is important to learn these skills so that we can inform our statistical tasks in later chapters. And because all good analyses end with visualization, our last chapter will focus on how we design graphs and visualizations to clearly convey our statistical results.
1.2 Components of data
In the simplest sense, data are a collection of facts. In natural resources disciplines, data are often collected across long time periods and at multiple resolutions. Increasingly, natural resources professionals are integrating data they collect with other data. For example, a study about the productivity of forests might integrate both forest inventory data with current and future climate data to understand the effects of global change patterns on tree growth (e.g., Weiskittel et al. 2011).
To be effective in visualizing data, it will help if we understand how to describe the various components of data. First, data need to be organized in a structured format to get the most value from them. Structuring data in a tidy format facilitates this, that is, where every variable is a column, every observation is a row, and each type of observational unit is contained in a table (Wickham 2014). Keeping data tidy often involves creating “rectangles” of data within a spreadsheet-like format (Broman and Woo 2018). Natural resources data can also be unstructured, such as documents with large amounts of text from transcripts or environmental remote sensing data with voluminous pixels, but data will need to be wrangled to transition it into a usable format.
In R, most natural resources data are best analyzed by importing and working with data as data frames. Within these data frames, variables are stored as columns and observations (also defined as cases or records) as rows. When we visualize data, the variable names are plotted along the axes of graphs and the observations make up the elements of a graph.
1.2.1 Categorical variables
A categorical variable (also known as a factor) places an observation into a group or category. Examples include season of the year (i.e., spring, summer, fall, or winter) and plant and animal taxonomy (e.g., genus and species). Categorical variables use a nominal approach to label observations that fit into a group.
The number of categories that a variable can take can be numerous. Generally, as the number of categories for a variable increases, it will be more difficult to visualize and test for differences across the categories. At the other extreme, a categorical variable can be binary in which it takes only one of two possible outcomes. Examples include presence or absence, alive or dead, and positive or negative. While categorical variables are not necessarily quantifiable, if a categorical variable is ordinal it indicates that the order of values is important, but the difference between each order cannot be quantified.
1.2.2 Quantitative variables
Quantitative variables take numerical values. These variables can be discrete, where data are based on integers or counts. An example of discrete data includes the number of plant species found within a genus. Discrete variables are common in natural resources where they are often referred to as count data.
Quantitative variables can also be continuous where they take on any value within an interval. An example is the current air temperature. Generally, if you can add, subtract, multiply, and divide a variable by another variable, it has the properties of a quantitative variable and is not categorical.
DATA ANALYSIS TIP: Oftentimes categorical variables are coded as quantitative variables in a data set. For example, a study on the biology of deer may code its sex as a 1 (male) or 0 (female). This may not matter much in the data exploration stage (although you’ll need to know which sex each number represents), but it will have tremendous impacts at the stage of statistical analysis. It is a good practice to recode these “categorical variables in disguise” so that R recognizes them as categorical variables, e.g., “Male” and “Female.”
1.2.3 The elm data set
The elm data set in the stats4nr package contains several variables that are useful for understanding the components of data. These data contain observations on 333 cedar elm trees (Ulmus crassifolia Nutt.) measured in Austin, Texas (Russell 2020). We will load the data as a data frame and then print out to the R console:
library(stats4nr)
elm
## # A tibble: 333 x 8
## STATUSCD SPCD DIA HT CROWN_HEIGHT CROWN_DIAM_WIDE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 838 5 32 19.2 19
## 2 1 838 5 25 11.3 11
## 3 1 838 5.1 21 6.3 10
## 4 1 838 5.1 27 18.9 13
## 5 1 838 5.1 22 18.7 6
## 6 1 838 5.1 27 18.9 11
## 7 1 838 5.2 29 11.6 12
## 8 1 838 5.2 20 7 9
## 9 1 838 5.2 18 17.8 12
## 10 1 838 5.2 17 6 17
## # ... with 323 more rows, and 2 more variables:
## # UNCOMP_CROWN_RATIO <dbl>, CROWN_CLASS_CD <dbl>
By default, R will print the first 10 rows of a data frame in the tibble format. Tibbles are essentially data frames used within the tidyverse package. Note that R reads in the variables in the col_double()
format, or a quantitative variable that is not an integer.
The variables in elm include:
DIA
, the tree’s diameter at breast height, measured in inches,HT
, the total height of the tree, measured in feet,CROWN_HEIGHT
, the height at the base of the crown, measured in feet,CROWN_DIAM_WIDE
, the width of the live crown at the widest point, measured in feet, andCROWN_CLASS_CD
, the tree’s crown class code that indicates the relative crown position of the tree: Open grown (1), Dominant (2), Co-dominant (3), Intermediate (4), or Suppressed (5).
A few handy functions allow you to inspect the contents of any data frame. The dim()
function returns the number of observations and variables, or dimensions in a data frame:
dim(elm)
## [1] 333 8
head()
and tail()
return the first and last six lines of observations, respectively:
head(elm)
## # A tibble: 6 x 8
## STATUSCD SPCD DIA HT CROWN_HEIGHT CROWN_DIAM_WIDE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 838 5 32 19.2 19
## 2 1 838 5 25 11.3 11
## 3 1 838 5.1 21 6.3 10
## 4 1 838 5.1 27 18.9 13
## 5 1 838 5.1 22 18.7 6
## 6 1 838 5.1 27 18.9 11
## # ... with 2 more variables: UNCOMP_CROWN_RATIO <dbl>,
## # CROWN_CLASS_CD <dbl>
tail(elm)
## # A tibble: 6 x 8
## STATUSCD SPCD DIA HT CROWN_HEIGHT CROWN_DIAM_WIDE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 838 23.7 37 27.8 53
## 2 1 838 24.7 52 44.2 42
## 3 1 838 28.1 39 37.1 48
## 4 1 838 28.7 62 55.8 55
## 5 1 838 29 46 27.6 45
## 6 1 838 43 70 66.5 50
## # ... with 2 more variables: UNCOMP_CROWN_RATIO <dbl>,
## # CROWN_CLASS_CD <dbl>
The summary()
function provides summary statistics for all quantitative variables in the data, including the mean, median, and quantile values:
summary(elm)
## STATUSCD SPCD DIA HT
## Min. :1 Min. :838 Min. : 5.00 Min. :15.00
## 1st Qu.:1 1st Qu.:838 1st Qu.: 6.60 1st Qu.:24.00
## Median :1 Median :838 Median : 8.80 Median :30.00
## Mean :1 Mean :838 Mean :10.42 Mean :31.53
## 3rd Qu.:1 3rd Qu.:838 3rd Qu.:12.70 3rd Qu.:37.00
## Max. :1 Max. :838 Max. :43.00 Max. :70.00
## CROWN_HEIGHT CROWN_DIAM_WIDE UNCOMP_CROWN_RATIO CROWN_CLASS_CD
## Min. : 4.8 Min. : 4.00 Min. :15.0 Min. :1.000
## 1st Qu.:15.0 1st Qu.:15.00 1st Qu.:50.0 1st Qu.:3.000
## Median :18.9 Median :20.00 Median :65.0 Median :3.000
## Mean :20.3 Mean :23.41 Mean :64.8 Mean :3.174
## 3rd Qu.:24.7 3rd Qu.:30.00 3rd Qu.:80.0 3rd Qu.:3.000
## Max. :66.5 Max. :57.00 Max. :99.0 Max. :5.000
You might notice that the summary()
function does not provide much detail on the categorical variable for CROWN_CLASS_CD
. For categorical variables, the table()
function works well and provides the number of observations for each category. You can follow this by using prop.table()
to calculate each category’s proportion of observations relative to the entire data set.
We can call any variable from a data frame by typing \(dataframe\$variable\). To calculate the number and proportion of observations in the elm data, we can use:
<- table(elm$CROWN_CLASS_CD)
n_Crowns
n_Crowns
##
## 1 2 3 4 5
## 4 6 269 36 18
prop.table(n_Crowns)
##
## 1 2 3 4 5
## 0.01201201 0.01801802 0.80780781 0.10810811 0.05405405
As the data show, over 80% of the cedar elm trees in Austin, Texas have a co-dominant crown class.
1.2.4 Exercises
1.1 In your own discipline, find a data set or experiment that you’re familiar with and reflect on the variables contained in the data. For categorical variables, list which variables are binary or ordinal. For quantitative variables, list which ones are discrete or continuous.
1.2 R has several built-in data sets that can be explored. Load CO2, a data set containing carbon dioxide uptake in plant grasses, by typing CO2 <- tibble(CO2)
. Learn about the variables in the data by typing ?CO2
. Inspect the data and report the minimum and maximum values for the uptake
variable. Determine how many plants were measured in the experiment and the number of chilled observations in the Treatment
variable.
1.3 We can create new variables in existing data frames by using the mutate()
function from the dplyr package. To add a new variable to an existing data frame, we can type the name of the data frame followed by “the pipe,” written as %>%
(or sometimes |>
). The pipe is shorthand for saying “then.” In other words, use my data frame “then” make a new variable in it.
As an example, we might be interested in making a new variable in the elm data that converts the diameter in inches to centimeters. To accomplish this, we would type elm %>% mutate(DIA_cm = DIA * 2.54)
. The result is a new column called DIA_cm
that contains the tree’s diameter at breast height in centimeters. Multiple pipes can be written in a block of code, which we’ll see later in this book.
Add a new variable to the CO2 data frame, by collapsing the conc
variable into a binary one. Research the ifelse()
function and use it within the mutate()
function to label all observations with an ambient carbon dioxide concentrations of 500 mL/L or greater as “HIGH” and all others as “LOW.”
1.3 Graphics for visualizing data
Run the following code to plot data from the elm data set using the ggplot2 package. In this example, we will create a scatter plot showing the diameter of elm trees on the x-axis and their height on the y-axis:
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point()
What results is a trend that we would expect—trees that are larger in diameter are also taller. This is helpful because a tree’s diameter is relatively easy to measure, but a tree’s height often requires more time and effort to obtain.
Now, we’ll step through the code that produced the scatter plot above:
- The
ggplot()
function tells R that we want to produce a plot, and we need to tell it which data set and variables to use. - The
data =
statement specifies that we want to plot variables from the elm data set. - We specify the variables
DIA
andHT
within theaes()
statement. Theaes
is the abbreviation for aesthetics, which allows us to change the properties of how the data are shown in the graph. As it turns out, we have not done anything special with the current aesthetics in the scatter plot, but in the future we can add to the aesthetics by specifying different colors, shapes, and sizes for the data points. - The
geom_point()
statement tells R that we want to produce a geometric object with points, i.e., a scatter plot. We will learn there are many “geom” types that can plot different layers of objects depending on the nature of the data and what you want to see.
Note that we add a +
at the end of the line when we use ggplot()
. This indicates we have more instructions to R before creating our scatter plot.
The way that we created the elm scatter plot will be the same style we’ll use for all of our graphs. That is, the first line will tell ggplot()
the data set and variables to plot (along with any additional aesthetics). The second line will specify which kind of graph to create (i.e., which “geom”).
1.3.1 Visualizing categorical data
1.3.1.1 Bar plots
Bar plots are one of the most effective ways to display categorical data. These plots represent the categories as bars and their length shows the counts or percentages within each category. Before creating a bar plot, it is useful to investigate the values of the data in the plot. As an example, we may wish to analyze the number of trees in the elm data set by crown class codes. We can find the number of tree observations within each CROWN_CLASS_CD
by using the count()
function with a pipe:
%>%
elm count(CROWN_CLASS_CD)
## # A tibble: 5 x 2
## CROWN_CLASS_CD n
## <dbl> <int>
## 1 1 4
## 2 2 6
## 3 3 269
## 4 4 36
## 5 5 18
We observe that the greatest and fewest number of trees have a co-dominant and open grown crown class, respectively. We can also obtain the counts of observations with different code. The count()
function makes this easy, but we also might want to determine the percentage of trees within each crown class category by grouping the data and then summarizing it.
To do this, we will start by grouping the data by CROWN_CLASS_CD
using the group_by()
statement. Then, we use the summarize()
function to make a new variable called n_trees
that sums the number of observations by CROWN_CLASS_CD
. This step produces the same output that the count()
function provides in the previous step, so it’s not particularly novel. In the last line, we use the mutate()
function to calculate a new variable Pct
that provides the percentage of observations in the data within each crown class.
We’ll also want to use this new data set in the future, so we will assign it a name elm_summ
. These operations are handy tools found in the dplyr package, a core package in the tidyverse that allows you to transform and reshape data:
<- elm %>%
elm_summ group_by(CROWN_CLASS_CD) %>%
summarize(n_trees = n()) %>%
mutate(Pct = n_trees / sum(n_trees) * 100)
elm_summ
## # A tibble: 5 x 3
## CROWN_CLASS_CD n_trees Pct
## <dbl> <int> <dbl>
## 1 1 4 1.20
## 2 2 6 1.80
## 3 3 269 80.8
## 4 4 36 10.8
## 5 5 18 5.41
We observe that the co-dominant and open grown crown classes represent 80.8% and 1.2% of the data, respectively.
1.3.1.2 Pie charts
Pie charts show the distribution of variables as a “pie” whose slices are sized by the counts or percentages for the categories. We can plot the elm tree crown class data as a pie chart, recalling the elm_summ data we created in an earlier step. Because we are plotting specific values in that data set, we’ll specify stat = "identity"
in the geom_bar()
layer. The key to creating a pie chart is to specify coord_polar()
in the last line. A pie chart shows the same information as a bar plot, but in polar coordinates:
ggplot(data = elm_summ, aes(x = "", y = n_trees,
fill = CROWN_CLASS_CD)) +
geom_bar(stat = "identity") +
coord_polar("y")
Pie charts should be used with caution. There are several reasons for overlooking pie graphs and instead favoring other kinds of graphs. Several reasons behind this are pointed out in early work by Cleveland and McGill (1984, 1985):
- The human eye does a poor job in judging angles. This makes it difficult to discern the values within pie graphs.
- Pie graphs contain color, and the human eye does a very poor job in judging color in graphs. This is especially true for individuals that are color blind or have other visual impairments.
Regardless, because of the widespread use of pie graphs it is helpful to learn how to construct and interpret them. As an alternative, place more priority in developing graphs such as bar plots when displaying categorical data because the human eye does a good job of discerning position and lengths of objects.
1.3.1.3 Polar area graphs
Another use of polar coordinates in graphing is a polar area graph, also called a coxcomb plot or rose diagram. Florence Nightingale, a nurse and pioneer in statistical graphics, popularized the use of the polar area graphs in the mid-19th century. They are similar to pie charts, but they have identical angles and extend from the plot’s center depending on the magnitude of the values that are plotted.
Think of the polar area diagram as a pie chart meets a histogram. A few advantages of polar area diagrams include:
- They are useful for plotting cyclical data. For example, the counts of a phenomenon in each of the 12 calendar months of a year.
- They are easy to read around the “rose” because data are presented chronologically.
- Multiple layers can be added within a diagram. Nightingale presented these kinds of layers in her visualizations of soldier deaths and wounds during the Crimean War.
In R, a polar area graph can be created by combining the geom_bar()
and coord_polar()
layers:
ggplot(data = elm, aes(x = CROWN_CLASS_CD)) +
geom_bar() +
coord_polar()
For the elm data, the polar area graph reveals the large number of co-dominant trees. As you can see, polar area graphs have an advantage over pie charts because they allow the reader to see the depth within each category.
1.3.2 Visualizing quantitative data
1.3.2.1 Stem plots
Stem plots, also termed stem-and-leaf plots, are useful to separate each observation into a stem (all but the rightmost digit) and a leaf (the remaining digit). The process involves first positioning the stems in a vertical column, then drawing a vertical line to the right of the stems. The last step positions each leaf in the row to the right of its stem.
In R, the stem()
function produces a stem plot and sorts its leaves ascending. Here is an example of a stem plot of the HT
variable from the elm data set:
stem(elm$HT, width = 50)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 1 | 55566777777778888888999999
## 2 | 00000000001111111111111222222222222223+11
## 2 | 55555555555555555566666666666677777777+27
## 3 | 00000000000000000011111112222222222222
## 3 | 55555555555555555666666666666667777777+1
## 4 | 00000001122222333344444
## 4 | 555555555666666677777888899
## 5 | 00011122333
## 5 | 55568
## 6 | 2
## 6 |
## 7 | 0
We can quickly observe that the greatest number of observations are between 25 and 29 feet tall.
1.3.2.2 Histograms and density plots
Histograms divide the possible values of a variable into classes or intervals of equal widths. The histogram shows how many observations fall within each interval. The height of each bar is equal to the number (or percent) of observations in its interval. The following code creates a histogram for HT
:
ggplot(data = elm, aes(HT)) +
geom_histogram()
Any histogram can be reshaped depending on how many bins you specify in the geom_histogram()
layer. For most applications, a bin width between 20 and 30 works well, but the appropriate number of bins will depend on the number of observations and the distribution of the data. Here is an example with 10 bins where you can see the coarser resolution that the histogram provides:
ggplot(data = elm, aes(HT)) +
geom_histogram(bins = 10)
Density plots are similar to histograms but show the distribution of a variable using a smoothed curve. They may be advantageous over histograms because they show a detailed distribution and are not affected by the number of bins you select. The geom_density()
layer provides density plots:
ggplot(data = elm, aes(HT)) +
geom_density(color = "blue")
DATA ANALYSIS TIP We can describe the density or distribution of a variable as symmetric, left-skewed, or right-skewed. A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other. A distribution is right-skewed if the right side of the graph is much longer than the left side. A distribution is left-skewed if the left side of the graph is much longer than the right side. For the elm heights we can say that the distribution is right-skewed, indicating that there a few tall trees relative to many shorter trees.
1.3.2.3 Box plots and violin plots
The median of a variable and its quartiles divide the distribution roughly into quarters. The median is the middle value when a quantitative variable is sorted by its value. Three quartiles separate a variable into four parts, where the second quartile represents the median that separates the upper and lower half of observations. The first and third quartiles contain 25 and 75% of values below them, respectively. A box plot shows the minimum, first quartile (Q1), median, third quartile (Q3), and maximum values of a variable. Here is a box plot for the height of elm trees:
ggplot(data = elm, aes(x = 1, y = HT)) +
geom_boxplot()
The Q1 and Q3 values make up the ends of the “box” in the box plot, while the minimum and maximum values make the “whiskers,” leading to the alternative “box-and-whisker” plot name. You will also note that three observations are categorized as outliers. These outliers are greater than 55 feet tall and are shown as points. We will discuss how to deal with outliers in our statistical analyses later, but for now, you can understand that the tree observations are considered outliers because their magnitude sets them apart from the range of all the other observations.
Violin plots are similar to a box plot, but also show a kernel probability density of the data. Violin plots are quite similar to density plots. In a violin plot that shows the heights of the elm trees, we can see that the height peaks between 25 and 30 feet:
ggplot(data = elm, aes(x = 1, y = HT)) +
geom_violin()
One of the greatest attributes of the ggplot2 package is the relative ease that you can add additional variables to existing plots. For example, we can add the variable CROWN_CLASS_CD
in the box plots and violin plots along the x-axis to investigate the trends in height across different tree crown classes. Note that although the crown class codes are numeric, we use factor(CROWN_CLASS_CD)
to tell R to treat them as categorical variables:
ggplot(data = elm, aes(x = factor(CROWN_CLASS_CD), y = HT)) +
geom_boxplot()
ggplot(data = elm, aes(x = factor(CROWN_CLASS_CD), y = HT)) +
geom_violin()
1.3.2.4 Scatter plots
Scatter plots, like we saw with the elm data previously, are some of the best tools for visualizing bivariate numerical data. The power of the aesthetics and geom features in ggplot()
allow us to present the same general scatter plot of elm diameter and height that we saw previously, but “supercharging” them with some advanced features.
To start, we can add the tree’s crown class code as a mapping variable in the aes()
statement to add color to our scatter plot. We see that most of the tallest trees are a co-dominant crown class (CROWN_CLASS_CD
= 3):
ggplot(data = elm, aes(x = DIA, y = HT,
color = factor(CROWN_CLASS_CD))) +
geom_point()
Adding a trend line can easily reveal a relationship between two continuous variables. This is helpful if you need to make a quick approximation between two variables. Adding a trend line can also reveal whether or not a linear or nonlinear relationship exists in the data. The geom_smooth()
can be added to the code and will reveal the trends between the two variables. This function fits a smoothed conditional mean to the data (in blue), along with confidence intervals surrounding the estimate (in gray). After fitting a trend line to the elm data you could say “A 20-inch diameter elm tree will be approximately 45 feet tall.”
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
geom_smooth()
Another way to easily see the differences in ranges of two numerical variables in a scatter plot is to plot each level of a categorical variable in its own panel. The facet_wrap()
statement allows you to do this.
In this case we easily see that co-dominant trees have a full range of DBH-HT, while the other crown classes have a narrower range with fewer observations:
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
facet_wrap(~CROWN_CLASS_CD)
The facet_wrap()
function works well when you have a single categorical variable to facet. The facet_grid()
function allows you to plot two categorical variables simultaneously. We don’t have another categorical variable in the elm data that could serve as a second variable. But you could imagine that if we had multiple tree species in the data set, we could plot the five crown classes vertically and the species horizontally.
A “hexagonal scatter plot” can be produced in ggplot()
, divides the x- and y-axes into hexagons, and the color of that hexagon reflects the number of observations in each hexagon. The geom_hex()
layer fills in the number of observations within each hexagon. You can think of this as a scatter plot meets a histogram, where the colors indicate how many observations are contained within each bin.
The hexbin package in R provides functions to plot hexagonal scatter plots. Install the package first, and then load it to use the functions. Here’s an example with the number of bins along the x- and y-axis set to 20:
install.packages("hexbin")
library(hexbin)
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_hex(bins = 20)
The hexagonal scatter plot shows that most of the observations in the elm data set are less than 10 inches in diameter and are shorter than 30 feet tall. In the original scatter plot, due to overlapping points in “busy” areas of the graph, this finding can’t really be observed. Knowing that this pattern exists can be insightful for future data analysis.
The number of bins in geom_hex()
can be increased to see a finer resolution (with fewer observations grouped into each hexagon). Or it can can be decreased to see a coarser resolution (with more observations grouped into each hexagon).
1.3.3 Exercises
1.4 Run the code ggplot(data = elm, aes(x = DIA, y = HT))
. What is the result and why do you see what you see?
1.5 Using the CO2 data set, write code using ggplot()
that creates a bar plot displaying the number of measurements for each plant. Note that because the number of measurements are the same for each plant, the plot will look uniform.
1.6 Create a series of hexagonal scatter plots using the uptake
variable in the CO2 data set. Change the observations contained within each hexagonal bin to 5, 20, 40, and 60. (Remember to load the hexbin package.) What do you notice as the number of bins increases?
1.7 Create a grid of scatter plots by specifying the facet_grid
statement in ggplot(). This statement creates a matrix of panels using two faceting variables. Plot the conc
and uptake
variables from the CO2 data set within the scatter plot. Then, specify Type
and Treatment
as the faceting variables. Which Type
, i.e., the location of the plants in the data set, contains the greater values of uptake?
1.4 Enhancing graphs
1.4.1 Adding text elements
Up to now, we’ve been using some of the basic plotting features available in ggplot2. Oftentimes these techniques are all that we need to complete our stage of exploratory data analysis. However, you will likely need to produce publication-quality graphs and figures to share with a broader audience as you continue your journey in data analysis. This section describes some additional code you can use in the ggplot()
function to improve the quality of your graphs.
To start, we have already broken the cardinal sin in designing figures: we have not added units to the x- and y-axes to our plots. The labs()
statement can be added, as we see here in the diameter and height trends from the elm data:
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
labs( x = "Tree diameter (inches)",
y = "Tree height (feet)")
With labs()
you can also add a title, subtitle, and caption to a graph. This can be helpful for describing the contents of the graph, a key result that the graph displays, and the source of the data:
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
labs( x = "Tree diameter (inches)",
y = "Tree height (feet)",
title = "Cedar elm trees in Austin, Texas",
subtitle = "Trees that have a larger
diameter are taller.",
caption = "Source: USDA Forest Inventory and Analysis")
You can also arrange the elements of a graph so that it is easier for the reader to understand. One of the key messages from the work of Cleveland and McGill (1984, 1985) is that trends are easily distinguished when they arranged as objects representing length along a common scale. This is helpful when displaying categorical variables such as bar plots. You can order elements of a graph ascendingly using the fct_infreq()
statement. This function is available in the forcats package, another core set of functions in the tidyverse that deals with factor level variables (also known as categorical variables). Here is a bar plot of the number of trees by different crown classes from the elm data set with the values ascending:
ggplot(data = elm,
aes(x = fct_infreq(factor(CROWN_CLASS_CD)))) +
geom_bar()
You can also use the fct_rev()
function within forcats combined with fct_infreq()
to sort the values in ascending order:
ggplot(data = elm, aes(
x = fct_rev(fct_infreq(factor(CROWN_CLASS_CD))))) +
geom_bar()
Natural resources data contain many ordinal variables. If a variable is ordinal in its design, it is effective in ordering the graph manually depending on the values of the variable. For example, because the tree crown classes depict how much sunlight a tree receives, we could reorder the elm data from the categories from open grown, dominant, dominant, intermediate, suppressed.
After exploring your data to understand trends, spending time to arrange the values and graphs so that they make more sense from a biological or numerical perspective will help the reader to better understand your analysis.
1.4.2 Adjusting plot layouts and themes
We have also been using the default ggplot output when looking at our graphs. A number of additional components can be adjusted to change the layout of a plot. A few examples include:
- The scale of x and y-axes can be adjusted with the statements
scale_x_continuous()
andscale_y_continuous()
. Thelimits =
statement can be specified here and allows you to change the upper and lower bounds of the axis. - Legends can be repositioned to appear on the top, bottom, left, or right of the graph within the
theme()
statement. - You can change the default style to a white background using
panel.background = element_rect(fill = "NA")
You can continue to modify elements within the theme statement in ggplot()
, however, there are also several default themes you can select. For example, here are the scatter plots for the elm data with the theme_bw()
, theme_classic()
, and theme_void()
themes available in ggplot2:
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
labs(title = "theme_bw()") +
theme_bw()
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
labs(title = "theme_classic()") +
theme_classic()
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point() +
labs(title = "theme_void()") +
theme_void()
You also may want to save your plot to your desktop to use in a report or to share on the web. The ggsave()
function allows you to output a plot to disk. The ggsave()
function allows you to create common image formats such as jpeg, tiff, png, and pdf. You can also change the width and height of the figure as an argument within ggsave()
. Here’s an example that will save a JPEG image named elm_scatter.jpg
that saves a scatter plot of the elm data that is 3 inches in height and 5 inches width:
ggplot(data = elm, aes(x = DIA, y = HT)) +
geom_point()
ggsave("elm_scatter.jpg", height = 3, width = 5, units = "in")
1.4.3 Exercises
1.8 Make a series of box plots for the uptake
variable from the CO2 data set. Each box plot should display the origin of the location of the plants in the data set, i.e., Quebec or Mississippi. Label the axes with the appropriate units and use the theme_bw()
theme for your graph.
1.9 As you might have guessed, there is a “geom” for a bar plot and it’s called geom_bar()
. We discussed how the CROWN_CLASS_CD
variable is an ordinal variable representing how much sunlight a tree receives. Create a bar plot showing the crown class codes on the x-axis and the number of observations on the y-axis using ggplot()
.
1.10 Use ggsave()
to save the graph you made in exercise 1.9 as a png image to your computer. Set the height and width of the figure to 10 cm and 10 cm, respectively.
1.11 Use scale_x_continuous()
and scale_y_continuous()
to “zoom in” on the scatter plot of elm diameter and height. Use the limits =
statement to set the lower and upper bounds from 10 to 15 inches along the x-axis for diameter and 30 to 40 feet along the y-axis for height. Are there any trends in the data when looking at this region of the scatter plot?
1.5 Summary
Visualizing your data should be one of the first steps you take before doing any statistical analysis. Visualizing data allows you to better understand the data, spot any unusual trends or observations within your data, and provides you inspiration for deciding which kinds of statistical analyses might be appropriate for your data. It is important to be able to characterize the variables in your data. Whether data are categorical or quantitative, this will impact which types of visualizations are appropriate. For categorical data, bar charts, pie graphs, and polar area diagrams are tools to help visualize patterns. For quantitative data, histograms, box plots, and scatter plots are examples that work well. The ggplot2 package is a core part of the tidyverse and provides a framework for visualizing data. There are a lot of options to supercharge your code for effective data visualization, but the components from this chapter will allow you to quickly visualize data so that we’re better prepared to run statistical analysis in the following chapters.
1.6 References
Broman, K.W., and K.H. Woo. 2018. Data organization in spreadsheets. American Statistician 72(1): 2–10.
Cleveland, W.S., and R. McGill. 1984. Graphical perception: theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association 79: 531–554.
Cleveland, W.S., and R. McGill. 1985. Graphical perception and graphical methods for analyzing scientific data. Science 229: 828–833.
Russell, M.B. 2020. Nine tips to improve your everyday forest data analysis. Journal of Forestry 118(6): 636–643.
Tukey, J.W. 1977. Exploratory data analysis. Addison-Wesley, Reading, MA. 712 p.
Weiskittel, A.R., N.L. Crookston, and P.J. Radtke. 2011. Linking climate, gross primary productivity, and site index across forests of the western United States. Canadian Journal of Forest Research 41: 1710–1721.
Wickham, H. 2010. A layered grammar of graphics. Journal of Computational and Graphical Statistics 19: 3–28.
Wickham, H. 2014. Tidy data. Journal of Statistical Software 59: 1–24. Available at: http://dx.doi.org/10.18637/jss.v059.i10
Wilkinson, L. 2005. The grammar of graphics (2nd ed.). Statistics and Computing. New York: Springer. 688 p.