Workshop Goal:

Answer the research question: What are the densities and distribution of redband parrotfish and white grunt off the coast of South Florida? Provide maps to help readers visualize the results.

Workshop Objectives:

Gain a working knowledge of data cleaning steps using the Tidyverse R.
Perform computations and basic statistical analyses using various base functions and packages in R.
Produce publishable content using the ggplot package in R.

Install and Load Packages

We are going to do this in bulk by putting all of the package names into a character string.

packages.load = c("tidyverse","ggplot2","ggmap","maps","maptools","maps","raster","ggsn","cowplot")

#install.packages(packages.load) 

lapply(packages.load, require, character.only = TRUE)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE

Note: If you have quick questions about how to use the basic packages we will be using today, ggplot and tidyverse, check out the cheat sheets:

ggplot2 cheat sheet

dplyr cheat sheet

… or Google it!

Import Data

Read in all of the data from Github using read.csv() and give them the object name “.data” using = or -> as the assignment operator. You can right click and save it all to a folder locally or you can pull the data directly from the github site using the code below.

# Set and object to point to the data folder
data.folder = "https://raw.githubusercontent.com/OfficialBweems/DirtyData-CleanMaps/main/Data/"

# Read in the different data sources using paste0().
site.dat=read.csv(paste0(data.folder,"Site_metadata.csv"))
species.dat=read.csv(paste0(data.folder,"Species.csv"))
transects.dat=read.csv(paste0(data.folder,"Transects.csv"))

Data Exploration

Make sure the data was read in correctly using the functions head(), colnames(), dim(), str(), summary(), and table().

head(species.dat) # shows the first 5 rows

##   Transect_id            Species TL_cm
## 1           1 redband parrotfish    18
## 2           1 redband parrotfish     7
## 3           1 redband parrotfish     9
## 4           1 redband parrotfish     4
## 5           2 redband parrotfish    14
## 6           2 redband parrotfish     4

colnames(species.dat)

## [1] "Transect_id" "Species"     "TL_cm"

dim(species.dat)  # shows the dimensions of your dataframe in rows and columns

## [1] 9927    3

table(species.dat$Species) # See how many of each species you have

## 
## redband parrotfish        white grunt 
##               2757               7170

table(!is.na(species.dat$Transect_id)) # Are there any NAs in the data?

## 
## TRUE 
## 9927

str(species.dat) # displays the internal structure of your object

## 'data.frame':    9927 obs. of  3 variables:
##  $ Transect_id: int  1 1 1 1 2 2 2 2 2 2 ...
##  $ Species    : chr  "redband parrotfish" "redband parrotfish" "redband parrotfish" "redband parrotfish" ...
##  $ TL_cm      : int  18 7 9 4 14 4 5 2 4 5 ...

summary(species.dat)

##   Transect_id      Species              TL_cm       
##  Min.   :    1   Length:9927        Min.   :  2.00  
##  1st Qu.: 9620   Class :character   1st Qu.:  7.00  
##  Median : 9903   Mode  :character   Median : 10.00  
##  Mean   : 9014                      Mean   : 11.38  
##  3rd Qu.:10036                      3rd Qu.: 15.00  
##  Max.   :10099                      Max.   :121.00

ProTip: Not sure what a function does? Type ?head() into the console and see what pops up in the help menu. Scroll down to the bottom–there are often examples of how to use the function that you can practice with.

Answering the Question

Goal: Answer the question, does invasive species removal impact native species density and distribution?

What do we need to answer this question?

Density and the location of those densities.

Density = number of occurrences at a site/area of the site

Problems:

We don’t know the area of these sites.
We don’t know how many occurrences of a species are at each site.
The data is spread across three data sheets.

Simple Solutions:

Do some pre-calculations
Join the data
Do some final manipulations/calculations.

Step 1

Calculate the area of each site from the site_metadata.csv. We will need to create a new column by the name ‘site_area’.

colnames(site.dat) # first messy data issue: column names read in weird so lets change them (i.e.,rename columns and remove the spaces)

## [1] "Region"          "Location"        "Site_name"       "Treatment"      
## [5] "Lat"             "Long"            "Type"            "Site.Length..m."
## [9] "Site.Width..m."

# Here we use the tidyverse package 
# which allows us to do piping (%>%) and use the dplyr package
site.dat = site.dat%>%rename(Site_Length_m=Site.Length..m.,
                                 Site_Width_m =Site.Width..m.)

# Now we can do the area calculation
site.dat$Site_area = site.dat$Site_Length_m*site.dat$Site_Width_m

That was easy! Now for the weird part!

Step 2

Joining the data! Here are some steps to work through that typically can help you get started.

A. Determine the base file. The base file establishes the unit of analysis. Some Hints:

Normally, when the relationship is that of one to many [1:0-N], the base file is the one with the many entities.

If the relationship is that of one to one [1:0-1],the base file is normally the one with the least number of cases.

str(site.dat)

## 'data.frame':    42 obs. of  10 variables:
##  $ Region       : chr  "Florida" "Florida" "Florida" "Florida" ...
##  $ Location     : chr  "Biscayne" "Biscayne" "Biscayne" "Biscayne" ...
##  $ Site_name    : chr  "Biscayne1" "Biscayne2" "Biscayne3" "Biscayne4" ...
##  $ Treatment    : chr  "non-removal" "Removal" "non-removal" "Removal" ...
##  $ Lat          : num  25.4 25.4 25.4 25.4 25.3 ...
##  $ Long         : num  -80.2 -80.2 -80.2 -80.2 -80.2 ...
##  $ Type         : chr  "patch" "patch" "patch" "patch" ...
##  $ Site_Length_m: int  48 56 29 88 56 43 48 50 50 50 ...
##  $ Site_Width_m : int  34 39 26 70 35 42 35 40 40 40 ...
##  $ Site_area    : int  1632 2184 754 6160 1960 1806 1680 2000 2000 2000 ...

str(transects.dat)

## 'data.frame':    431 obs. of  3 variables:
##  $ Transect_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site_name  : chr  "KL19" "KL19" "Ten2" "Ten2" ...
##  $ Date       : chr  "10/23/2015" "10/23/2015" "12/11/2015" "12/11/2015" ...

str(species.dat)

## 'data.frame':    9927 obs. of  3 variables:
##  $ Transect_id: int  1 1 1 1 2 2 2 2 2 2 ...
##  $ Species    : chr  "redband parrotfish" "redband parrotfish" "redband parrotfish" "redband parrotfish" ...
##  $ TL_cm      : int  18 7 9 4 14 4 5 2 4 5 ...

# Species.dat is our base file (it is the one with the many entities)

Determine the common identifiers. Which csv file has a common identifier with the species csv file?

str(site.dat)

## 'data.frame':    42 obs. of  10 variables:
##  $ Region       : chr  "Florida" "Florida" "Florida" "Florida" ...
##  $ Location     : chr  "Biscayne" "Biscayne" "Biscayne" "Biscayne" ...
##  $ Site_name    : chr  "Biscayne1" "Biscayne2" "Biscayne3" "Biscayne4" ...
##  $ Treatment    : chr  "non-removal" "Removal" "non-removal" "Removal" ...
##  $ Lat          : num  25.4 25.4 25.4 25.4 25.3 ...
##  $ Long         : num  -80.2 -80.2 -80.2 -80.2 -80.2 ...
##  $ Type         : chr  "patch" "patch" "patch" "patch" ...
##  $ Site_Length_m: int  48 56 29 88 56 43 48 50 50 50 ...
##  $ Site_Width_m : int  34 39 26 70 35 42 35 40 40 40 ...
##  $ Site_area    : int  1632 2184 754 6160 1960 1806 1680 2000 2000 2000 ...

str(transects.dat)

## 'data.frame':    431 obs. of  3 variables:
##  $ Transect_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site_name  : chr  "KL19" "KL19" "Ten2" "Ten2" ...
##  $ Date       : chr  "10/23/2015" "10/23/2015" "12/11/2015" "12/11/2015" ...

str(species.dat)

## 'data.frame':    9927 obs. of  3 variables:
##  $ Transect_id: int  1 1 1 1 2 2 2 2 2 2 ...
##  $ Species    : chr  "redband parrotfish" "redband parrotfish" "redband parrotfish" "redband parrotfish" ...
##  $ TL_cm      : int  18 7 9 4 14 4 5 2 4 5 ...

# transect.dat has a common identifier which is "Transect_id", it should be unique to each

C. Determine how you wish to join your data. The options are Inner, left ,right, full. See the data wrangling cheatsheet for details. Image from the data wrangling cheatsheet.

We are trying to add the information from transect.dat to the species data. Each observation of a species should now have a site name and a date sampled.

species_trans = left_join(species.dat,transects.dat,"Transect_id")

dim(species_trans) # Theres is a problem here! You have 10321 records instead of 9927

## [1] 10321     5

# You can also look at the Global Environment to check this

Do some exploring using n_distinct(), unique(), which(),table(), and %in%. Becuase the unique number of Transect_ids is different from the total number of Transect_ids in the Transect.dat data, we can tell there are some duplicates in the data.

n_distinct(species.dat$Transect_id)

## [1] 413

n_distinct(transects.dat$Transect_id) #427

## [1] 427

dim(transects.dat) # 431, so we need to remove the duplicates

## [1] 431   3

# Now create a new transect.dat dataframe without any duplicates
trans.clean = unique(transects.dat)

# Rejoin!
species_trans.clean = left_join(species.dat,trans.clean)

## Joining, by = "Transect_id"

# Check the dimensions
dim(species_trans.clean) # Good to go!

## [1] 9927    5

# Additional exploration code.
# table(unique(transects.dat$Transect_id) %in% unique(species.dat$Transect_id))
# trans = unique(transects.dat$Transect_id)
# spec = unique(species.dat$Transect_id)
# used.id = spec[which(!trans%in%spec )]

Add site information to the data.

full_data = left_join(species_trans.clean,site.dat)

## Joining, by = "Site_name"

dim(full_data)

## [1] 9927   14

names(full_data) # all columns are here

##  [1] "Transect_id"   "Species"       "TL_cm"         "Site_name"    
##  [5] "Date"          "Region"        "Location"      "Treatment"    
##  [9] "Lat"           "Long"          "Type"          "Site_Length_m"
## [13] "Site_Width_m"  "Site_area"

Step 3

We want number of occurrences of a species per site using dplyr.

data_sum = full_data %>% # data to manipulate
  group_by(Site_name, Species) %>% # what column names to group by, can be any column
  tally() # count them
# Note: adding column names to the group by argument will increase your number of rows/ the amount of data you retain

colnames(data_sum)

## [1] "Site_name" "Species"   "n"

# Notice you have removed some of the other identifiers (any thing at a finer level than site)

# Now apply this new count data back to the original site data using a left join
full_summary = left_join(data_sum, site.dat)

## Joining, by = "Site_name"

colnames(full_summary) # make sure everything is there (n, area, lat, lon, etc.)

##  [1] "Site_name"     "Species"       "n"             "Region"       
##  [5] "Location"      "Treatment"     "Lat"           "Long"         
##  [9] "Type"          "Site_Length_m" "Site_Width_m"  "Site_area"

# Note: Date and TL_cm are gone since you have aggregated the number of fish over multiple days

# Lets play with renaming more columns ... 
full_summary= full_summary%>%rename(Abundance = n)

# And finally calculate the density. Make a new column and calculate density.
full_summary$Density = full_summary$Abundance / full_summary$Site_area

Basic Boxplot and ANOVA

Now that we have the data together, lets look at the impact of removal Treatment on density using a box plot and ANOVA.

# box plot in ggplot split by species and treatment
ggplot(full_summary, aes(x= Treatment, y=Density)) + 
  geom_boxplot()+facet_wrap(~Species)+ ylim(0,1)

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

We missed something! We have another messy data issue. The Treatment “non-removal” was entered two different ways.We can use dplyr recode() to fix this.

full_summary = full_summary %>%
  mutate(Treatment = recode(Treatment, "non-removal" = "Non-Removal",
                          "Non-removal" = "Non-Removal"))
# Try the boxplot again
ggplot(full_summary, aes(x= Treatment, y=Density)) + 
  geom_boxplot()+facet_wrap(~Species)+ ylim(0,1)

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

Looks good! Now lets explore the Treatments on each species density, statistically. An ANOVA is a quick way to identify statistical differences between groups. You can use aov() from base r for this. Use summary() to determine the influence of your factors.

Full.aov = aov(Density~Treatment*Species, data = full_summary)
summary(Full.aov)

##                   Df Sum Sq Mean Sq F value Pr(>F)   
## Treatment          1  0.644  0.6438   3.955 0.0503 . 
## Species            1  1.228  1.2285   7.547 0.0075 **
## Treatment:Species  1  0.789  0.7890   4.847 0.0307 * 
## Residuals         76 12.371  0.1628                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# We see that the species and treatment have a significant impact on our results

Mapping

Now for the part you came here for, MAPPING! We are going to produce one map with three steps. The first step is to plot the species densities at the different study sites. The second step is to provide a general map of the study area. And finally we will inset the maps to make one big nice one! We will use the ggplot package with some additional mapping packages. We will point them out as we go along.

#First lets pull the raster data for the United States from the internet using the raster package.
us <- getData("GADM",country="USA",level=1)
# GADM' is a database of global administrative boundaries
# you must also provide the level of administrative subdivision (0=country, 1=first level subdivision)

# We want to keep our study area which is in Florida. You can add states by adding them to this vector.
states <- c('Florida')
us.states <- us[us$NAME_1 %in% states,] # find the NAME_1 that matches your states

1 - Site Map

OK now lets map out sites!

# the keys are your shape polygon for Florida and your data full_summary
sites_map = ggplot()+ 
  geom_polygon(data = us.states,aes(x=long,y=lat,group=group), colour = 'grey40', size = 0.01,fill = 'grey90')+
  geom_jitter(data = full_summary, aes(x=Long,y=Lat, color = Species, size=Density), height = 0.05)

## Regions defined for each Polygons

# Just like this the map looks crazy, lets zoom in. Its messy data so we have some weird data points also. Think about how you can filter those out. That's a hint! 
sites_map =sites_map+ coord_cartesian(xlim = c(-81.1,-80.0), ylim = c(24.65,25.70)) 

#Every good map has a compass and a scale bar lets add them.The functions north() and scalebar are both from the package ggsn.
sites_map =sites_map+north(location = 'topright', scale = 0.9, symbol = 12, x.min = -80.09, x.max = -80.0, y.min = 25.6, y.max = 25.7)+
  scalebar(x.min = -80.80, x.max = -81.1, y.min = 24.66, y.max = 24.72, dist = 10, dist_unit = 'km', transform = TRUE, model = 'WGS84', location = "bottomleft", st.dist = 0.49, height = 0.18) #transform = TRUE assumes coordinates are in decimal degrees 

# the ocean isn't grey, lets change that
sites_map =sites_map+ theme_bw()+
  theme(panel.border = element_rect(colour = 'black'))

# We also may want to play with where this legend is later 
sites_map = sites_map +  theme(legend.position = c(0.85, 0.20))


sites_map

Great! The next one should be easy now.

2 - Study Area Map

Now we are making the larger map, to provide some context for the site map.

# Start off the map similarly to the site map but instead of our data we will be pulling in the shape file for Florida
continent = ggplot()+
  geom_polygon(data = us,aes(x=long,y=lat,group=group), colour = 'grey40', 
               size = 0.01, fill = 'grey90')+
  geom_polygon(data = us.states,aes(x=long,y=lat,group=group), colour = 'midnightblue', size = 0.01,fill = 'grey70')

## Regions defined for each Polygons
## Regions defined for each Polygons

# now reduce the viewing area and delimit where the research took place
continent = continent+coord_cartesian(xlim = c(-93.9,-75.85), ylim = c(24.1,32.5)) 
 
# Add in the scale bar again
continent = continent+scalebar(x.min = -93, x.max = -85, y.min = 24.5, y.max = 25.5, dist = 250, dist_unit = 'km', st.size = 3.6, #add scalebar
                    transform = TRUE, model = 'WGS84', location = 'bottomleft', st.dist = 0.42, height = 0.18) #transform = TRUE assumes coordinates are in decimal degrees

# New and specific to this study area map - add a shaded study area box
continent = continent+ annotate("rect", xmin = -79, xmax = -82, ymin = 24.5, ymax = 25.5, alpha = .7) 

# Add an arrow to point to the study region  
continent = continent + annotate('segment',x=-87, y=27.28, xend=-82.2, yend=25.1, arrow=arrow(length = unit(0.04, "npc")), #arrow
           alpha = 0.8, size=1.1, color="black")

# add some notation of where you are
continent = continent+ annotate('text', x = -91, y = 25.8, label = 'Gulf of Mexico', size = 4)+ 
  annotate('text', x = -78, y = 31.2, label = 'Atlantic Ocean', size = 4)+
  annotate('text', x = -87, y = 27.8, label = 'Study Area', size = 5)

# Change the background of things
continent = continent+ theme_bw(base_size = 9) + # black and white theme, gets rid of grey background
  theme(panel.background = element_rect(fill= 'white',color = 'white')) +
  theme(panel.border = element_rect(colour = 'black'))

continent

## You can add these arguments to your plot to clean up the edges of the map for final print
continent + 
  theme(axis.ticks = element_blank())+
  theme(axis.text.x = element_blank()) + 
  theme(axis.text.y = element_blank()) + 
  theme(axis.title.x = element_blank()) +
  theme(axis.title.y = element_blank())

3 - Inset the Maps

# this just draws the maps on top of one another
insetmap = ggdraw()+
  draw_plot(sites_map) + 
  draw_plot(continent) 

# doesn't look right, right? Now you have to work with dimensions for the 2nd map. See below: x,y, width, and height for the continent/study area map

insetmap = ggdraw()+
  draw_plot(sites_map) + 
  draw_plot(continent, x=0.102, y=0.62, width=0.38, height=0.35) 

# Lets see what it looks like 
insetmap # You can look at it, but it won't look good until you print it

# Use the function ggsave to output your map. Make sure you know where it is being saved to.
ggsave('study_map.png', plot = insetmap,
       width = 8, height = 7.5,
       dpi = 300)

That’s all folks!

For more practice, attempt to map the answer to the question of how reef type influences fish density.

Final ProTips

Comment your code! ( Use the hashtag - # )
R is case-sensitive. For example, “head()” is not the same as “Head().”
Be smart when naming files, folders, etc., you will be looking for them later. Adding the date or sequential numbers to a folder name is helpful.
Debugging is the hardest part. Copy the error you get and google it. The Stack Overflow community will be your best friend.
There are a million ways to skin a cat! What we showed you here is only one way of using R to do the things we want to do. Find what works for you.

From Messy Data to Clean Maps

LaTreese Denson & Alex Davis

06/16/2021