The primary goal of this book is to learn and apply common statistical methods used in natural resources by using the R programming language. This book encompasses applied and theoretical techniques commonly used in these disciplines. A key component of the book is learning how to make inference with diverse data sources using R.
To manage our environment sustainably, professionals must understand the quality and quantity of our natural resources. Statistical analysis provides information that supports management decisions and is universally used across scientific disciplines. This book focuses on the application of statistical analyses in the environmental, agricultural, and natural resources disciplines.
Chapters 1 through 10 form the basis of most semester-long introductory statistics classes at the undergraduate level. Chapters 11 through 15 could also be included in an introductory statistics class for graduate students that moves at an accelerated pace.
If you dedicate considerable time to this book, you do the following:
- Develop analytical and visualization skills for investigating the behavior of agricultural and natural resources data.
- Become competent in importing, analyzing, and visualizing complex data sets in the R environment.
- Recode, combine, and restructure data sets for statistical analysis and visualization.
- Appreciate probability concepts as they apply to environmental problems.
- Understand common distributions used in statistical applications and inference.
- Summarize data effectively and efficiently for reporting purposes.
- Learn the tasks required to perform a variety of statistical hypothesis tests and interpret their results.
- Understand which modeling frameworks are appropriate for your data and how to interpret predictions.
This book does not cover how natural resources data are collected and/or sampled. For the purposes of learning, we will mostly work with “tidy” data sets that have been collected by others. Many other texts in the disciplines of environmental sampling and experimental design are available that cover these topics in more depth. You would do well to have one of these courses as a part of your quantitative expertise.
The discipline of data science has emerged in the last decade with a number of associated quantitative methods, such as machine learning, support vector machines, and unsupervised learning. Data science is an interdisciplinary field that often uses statistics in combination with computer science and a deep knowledge within a domain of interest. This book will focus instead on the inference one can make by performing statistical tests. As Blei and Smyth (2017) describe, the field of data science differs from statistics in many ways, but also has some similarities.
You will get the most out of this book if you have already completed an undergraduate course in statistics or have comparable quantitative skills. If you can explain the difference between a mean and median, understand the basic concepts behind linear regression, and can describe what a two-sample t-test seeks to accomplish, you will be well prepared for the concepts in this book. There are a large number of textbooks, blogs, and online resources that cover these topics in depth. This book will also cover some of those topics, but will present them in more depth.
To a lesser degree, experience in programming is a plus but not required. A member of my PhD committee commented to me once that many universities counted programming courses as satisfying a foreign language requirement for students in the late 20th century. While diminishing to the world’s spoken word, this observation reflects the value of programming skills in today’s workforce. As it relates to R, books such as Hands-On Programming with R by Garrett Grolemund and R for Data Science by Hadley Wickham and Grolemund are excellent places to start.
Historically, many students learned about the concepts of statistics in a lecture format by doing calculations on paper with small data sets. Often in a separate lab component of a course, students learned how to use software to apply the theoretical concepts learned through lectures. So, to excel in a statistics class, students were required to learn both the theory and application of statistical concepts. Today, software such as R (and more specifically the tidyverse suite of packages), provide a more seamless integration of both the theory and application of statistical concepts.
To get started, you will need to do three things:
Download R. R is one of the most popular programs for learning statistics. In short, R is a programming language and software environment for statistical computing and graphics. Download R at the Comprehensive R Archive Network (CRAN) and choose your operating system.
Download RStudio. RStudio is an integrated development environment that provides an interface to R. If it were a car, R would be the engine that moves it and RStudio would be the dashboard where the driver controls the wheel and can see how the car is performing. In short, RStudio allows us to efficiently and effectively work with data using the R system. Download RStudio Desktop at the RStudio products webpage. If you are hesitant to jump right into the book or have issues with installing programs, consider using RStudio Cloud, a web-based interface to RStudio. A free version allows you to begin 15 projects with up to 15 usage hours per month.
Install packages. In the R program, packages are a collection of functions and data sets written by R users. You can perform a lot of analyses using base R, or what may be considered “off the shelf”. There are tens of thousands of packages archived by CRAN. For perspective, 31 of these packages were developed by forestry professionals for specific applications within that discipline (Russell 2020).
The tidyverse package is a “megapackage” that includes several packages that import, reshape, and visualize data in a consistent manner, among other tasks. Install the tidyverse package with the following line of code:
After installing the package, you will need to load it using the
library command to use its functions:
Interestingly, few statistical functions are available in the tidyverse package. So why spend the time learning about it? The tidyverse has emerged as an excellent foundation to learn about statistics through its philosophy of organizing and manipulating data (Wickham et al. 2019). We will use the tidyverse heavily to import, wrangle, and visualize data.
All of the data sets used in this book can be accessed by installing stats4nr, an R package developed for this book. First, you’ll need to install the devtools package:
You can then install the package that contains the data sets for this book by installing it through GitHub:
You can load the package using
library() and load a data set by typing the name of it. For example, ant is the name of one of the data sets containing information on ant species richness in bogs and forests in New England, USA. We can use the
head() function in R to print the first six rows in the data:
## site ecotype spprich lat elev ## 1 TPB Forest 6 41.97 389 ## 2 HBC Forest 16 42.00 8 ## 3 CKB Forest 18 42.03 152 ## 4 SKP Forest 17 42.05 1 ## 5 CB Forest 9 42.05 210 ## 6 RP Forest 15 42.17 78
A brief description and sources of all data sets in this book can be found on GitHub: https://github.com/mbrussell/stats4nr
There are many formats to store data in R such as vectors, matrices, or lists. Nearly all of our data sets used in this book will be stored as data frames, or what the tidyverse terms “tibbles.” The data frame is analogous to how you would view data organized in a spreadsheet: a column contains values of a variable and a row contains values from each column.
At appropriate times in the book we will install and use other data sets and packages to perform specific statistical tasks.
The collection of statistical functions available in base R are numerous and we will rely on these heavily. An excellent introduction to many of these base R functions can be found in Dalgaard (2008). We will also use the tidyverse suite of functions to help recode and structure data to bring it to a format that can be analyzed. The Wickham and Grolemund text is an excellent source for learning these tidyverse functions. In our application of these methods in this text, it will not be uncommon to use a base R function within a tidyverse function to perform a statistical analysis.
Understanding how certain text and fonts appear in the book can aid you in understanding the concepts:
- Words that are in bold indicate new concepts, names of R packages, or names of data sets.
- Words that appear in
constant widthindicate elements of a data set such as a variable name.
- Words that appear in
constant widthfollowed by a parentheses
()indicate a function that performs an operation.
This version of the book was built with R version 4.1.0 (2021-05-18).
I am indebted to the countless colleagues, professors, and students that have been with me along my journey in learning statistics. This includes faculty, staff, and students from the University of Minnesota, University of Maine, and Virginia Tech. I am also thankful to the community of officers and members of the Society of American Foresters A1-Inventory and Biometrics Working Group.
I am particularly indebted to the graduate students of the University of Minnesota’s Natural Resources Science and Management program for their desire for a graduate-level statistics class offered through the program. Were it not for my time spent with graduate students during my faculty interview in January of 2014, a lunch hour pizza session with Q&A, I may not have known the true importance of offering quantitative classes from within the discipline. Thank you students.
Finally, a special appreciation to my wife who lent her support and encouragement throughout the pandemic while I toiled on this project. Thanks, Annie.
Blei, D.M., P. Smyth. 2017. Science and data science. Proceedings of the National Academies of Science 114(33): 8689–8692.
Grolemund, G. 2014. Hands-On programming with R: write your own functions and simulations. O’Reilly Media. 230 p. Available at: https://rstudio-education.github.io/hopr/index.html
Wickham, H., and G. Grolemund. 2017. R for data science: import, tidy, transform, visualize, and model data. O’Reilly Media. 520 p. Available at: https://r4ds.had.co.nz/
Russell, M.B. 2020. Nine tips to improve your everyday forest data analysis. Journal of Forestry 118: 636–643.
Wickham, H., et al. 2019. Welcome to the tidyverse. Journal of Open Source Software 4(43): 1686. Available at: https://doi.org/10.21105/joss.01686