Checking fish data exported from EventMeasure

Introduction

This script will take fish annotation data (either exported from EventMeasure or “Generic” format, see format requirements here) and check for any errors in the annotation. The script will then format the annotation data into a tidy format.

You also need a formatted sample metadata file “*_Metadata.csv” (see format requirements here)

R set up

First you will need to load the necessary libraries. If you haven’t installed CheckEM before you will need to install CheckEM using the install_github function.

library(devtools)
# devtools::install_github("GlobalArchiveManual/CheckEM") # Use this to install the CheckEM package if you have not already done so
library(CheckEM)
library(tidyverse)
library(googlesheets4)
library(sf)
library(terra)
library(here)

Next set the study name. This can be any name you like, all files saved using this script will be prefixed with this name. Avoid having a name that is too long. We recommend using a short project name that includes the method e.g. “2020_ningaloo_stereo-BRUVs”.

name <- "example-bruv-workflow"

Metadata

Now we load and tidy the metadata. If you have already completed this step while checking your habitat data, then you can skip these next two chunks of code and simply read in the metadata (see below).

metadata <- read_metadata(here::here("r-workflows/data/raw/"), method = "BRUVs") %>% # Change here to "DOVs"
  dplyr::select(campaignid, sample, longitude_dd, latitude_dd, date_time, location, site, depth_m, successful_count, successful_length, successful_habitat_forward, successful_habitat_backward) %>%
  glimpse()

## reading metdata file: /home/runner/work/CheckEM/CheckEM/r-workflows/data/raw//2022-05_PtCloates_stereo-BRUVS_metadata.csv

## reading metdata file: /home/runner/work/CheckEM/CheckEM/r-workflows/data/raw//2023-03_SwC_stereo-BRUVs_Metadata.csv

## Rows: 94
## Columns: 12
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ sample                      <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9…
## $ longitude_dd                <chr> "113.5447", "113.5628", "113.5515", "113.5…
## $ latitude_dd                 <chr> "-22.7221", "-22.6957", "-22.7379", "-22.7…
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "77.3", "78.3", "73.9", "81.9", "7…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Save the metadata as an R data file (this creates a lighter file than saving as a .csv or similar, it also maintains any column formatting).

saveRDS(metadata, file = here::here(paste0("r-workflows/data/tidy/",
                                name, "_Metadata.rds")))

If you have already exported in the ‘check-habitat’ script then load the metadata. If you have just loaded and saved the metadata, then you can skip this chunk of code.

metadata <- readRDS(here::here(paste0("r-workflows/data/tidy/",
                                name, "_Metadata.rds")))

Marine Parks

Load marine park shape files, and extract fishing status (e.g. ‘Fished’ or ‘No-take’) for use in modelling. The data set used here is the 2022 Collaborative Australian Protected Areas Database, which you can download for free here.

You may change this shape file to any suitable data set that is available for your study area.

marine_parks <- st_read(here::here("r-workflows/data/spatial/shapefiles/Collaborative_Australian_Protected_Areas_Database_(CAPAD)_2022_-_Marine.shp"))  %>%
  dplyr::select(geometry, ZONE_TYPE) %>%
  st_transform(4326) %>%
  st_make_valid()

## Reading layer `Collaborative_Australian_Protected_Areas_Database_(CAPAD)_2022_-_Marine' from data source `/home/runner/work/CheckEM/CheckEM/r-workflows/data/spatial/shapefiles/Collaborative_Australian_Protected_Areas_Database_(CAPAD)_2022_-_Marine.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3775 features and 26 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 70.71702 ymin: -58.44947 xmax: 170.3667 ymax: -8.473407
## Geodetic CRS:  WGS 84

metadata_sf <- st_as_sf(metadata, coords = c("longitude_dd", "latitude_dd"), crs = 4326)

metadata <- metadata_sf %>%
  st_intersection(marine_parks) %>%
  bind_cols(st_coordinates(.)) %>%
  as.data.frame() %>%
  dplyr::select(-c(geometry)) %>%
  dplyr::rename(longitude_dd = X, latitude_dd = Y) %>%
  dplyr::mutate(status = if_else(str_detect(ZONE_TYPE, "National|Sanctuary"),
                                "No-take", "Fished")) %>%
  clean_names() %>%
  glimpse()

## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

## Rows: 94
## Columns: 14
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ sample                      <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9…
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "77.3", "78.3", "73.9", "81.9", "7…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ longitude_dd                <dbl> 113.5447, 113.5628, 113.5515, 113.5555, 11…
## $ latitude_dd                 <dbl> -22.7221, -22.6957, -22.7379, -22.7337, -2…
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…

Find nearest Marine Region

Now we need to find the nearest marine region for each sample in the metadata. Then we can use the life history lists to find species that have not been observed in that marine region before.

metadata_sf <- st_as_sf(metadata, coords = c("longitude_dd", "latitude_dd"), crs = 4326)
regions <- st_as_sf(CheckEM::aus_regions, crs = st_crs(4326))

## Loading required package: sp

regions <- st_transform(regions, 4326) %>%
  dplyr::select(REGION)

metadata <- st_join(metadata_sf, regions, join = st_nearest_feature) %>%
  dplyr::rename(marine_region = REGION) %>%
  dplyr::mutate(sample = as.character(sample)) %>%
  as.data.frame() %>%
  dplyr::select(-c(geometry)) %>%
  glimpse()

## Rows: 94
## Columns: 13
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ sample                      <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9…
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "77.3", "78.3", "73.9", "81.9", "7…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…
## $ marine_region               <chr> "North-west", "North-west", "North-west", …

Fish annotation data

There are two types of fish annotation data formats that you can check and format using this script: EventMeasure database outputs or Generic files.

For EventMeasure database outputs you will need to export the database outputs using the EventMeasure software and the .EMObs files. This will give you the ’_Points.txt’, ’_Lengths.txt’ and ’_3DPoints.txt’ files.
Generic data is a much simpler format and allows users who haven’t used EventMeasure to format QC their annotation data. You will need a _Count.csv file and a _Length.csv file.

For more information on the format of these files please see the CheckEM user guide

We recommend using the EventMeasure database outputs if they are available and up to date. There are more checks possible with EventMeasure data than Generic data. Note: If you have used EventMeasure software to annotate your samples BUT have made corrections on the exported data (e.g. in Excel), this corrected data is now the true copy of the data and you should import your data as Generic annotation files (e.g. count and length).

Load any EventMeasure Points.txt files

This section of code will read in any Points.txt files that you have saved in the directory you set. It will combine all of the files into one data-frame, and get the campaignid name from the name of the file. It is important that you consistently name your files with the same campaignid (look out for different separators e.g. ‘-’, ’_‘, or’.’)

points <- read_points(here::here("r-workflows/data/raw/")) %>%
  glimpse()

## Rows: 17,425
## Columns: 22
## $ sample      <chr> "10", "10", "10", "10", "10", "10", "10", "10", "10", "10"…
## $ pointindex  <chr> "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1…
## $ filename    <chr> "Left_MAH00355.MP4", "Left_MAH00355.MP4", "Left_MAH00355.M…
## $ frame       <chr> "11347", "11403", "11403", "11473", "11473", "11473", "127…
## $ time        <chr> "25.71152", "25.72709", "25.72709", "25.74655", "25.74655"…
## $ period      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ periodtime  <chr> "0.33422", "0.34979", "0.34979", "0.36926", "0.36926", "0.…
## $ imagecol    <chr> "463.20000", "474.00000", "1495.20000", "1562.40000", "157…
## $ imagerow    <chr> "669.60000", "655.20000", "541.20000", "543.60000", "571.2…
## $ rectwidth   <chr> "0.00000", "0.00000", "0.00000", "0.00000", "0.00000", "0.…
## $ rectheight  <chr> "0.00000", "0.00000", "0.00000", "0.00000", "0.00000", "0.…
## $ family      <chr> "Labridae", "Labridae", "Labridae", "Labridae", "Labridae"…
## $ genus       <chr> "Ophthalmolepis", "Ophthalmolepis", "Ophthalmolepis", "Pse…
## $ species     <chr> "lineolata", "lineolata", "lineolata", "biserialis", "bise…
## $ code        <chr> "37384040", "37384040", "37384040", "37384149", "37384149"…
## $ number      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ stage       <chr> "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD"…
## $ activity    <chr> "Passing", "Passing", "Passing", "Passing", "Passing", "Pa…
## $ comment     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ attribute9  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ attribute10 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ campaignid  <chr> "2023-03_SwC_stereo-BRUVs", "2023-03_SwC_stereo-BRUVs", "2…

Load any Generic count files

This section of code will read in any Count.csv files that you have saved in the directory you set. It will combine all of the files into one data frame, and get the campaignid name from the name of the file.

It is important that you consistently name your files with the same campaignid (look out for different separators e.g. ‘-’, ’_‘, or’.’)

counts <- read_counts(here::here("r-workflows/data/raw/")) %>%
  glimpse()

## reading count file: /home/runner/work/CheckEM/CheckEM/r-workflows/data/raw//2022-05_PtCloates_stereo-BRUVS_count.csv

## Rows: 802
## Columns: 6
## $ campaignid <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates_stereo…
## $ opcode     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2",…
## $ family     <chr> "Balistidae", "Carangidae", "Carangidae", "Lethrinidae", "L…
## $ genus      <chr> "Abalistes", "Decapterus", "Turrum", "Gymnocranius", "Lethr…
## $ species    <chr> "filamentosus", "spp", "fulvoguttatum", "sp1", "rubriopercu…
## $ count      <chr> "3", "4", "18", "2", "1", "9", "3", "2", "20", "1", "1", "4…
## Rows: 802
## Columns: 7
## $ campaignid <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates_stereo…
## $ opcode     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2",…
## $ family     <chr> "Balistidae", "Carangidae", "Carangidae", "Lethrinidae", "L…
## $ genus      <chr> "Abalistes", "Decapterus", "Turrum", "Gymnocranius", "Lethr…
## $ species    <chr> "filamentosus", "spp", "fulvoguttatum", "sp1", "rubriopercu…
## $ count      <chr> "3", "4", "18", "2", "1", "9", "3", "2", "20", "1", "1", "4…
## $ sample     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2",…

Combine EventMeasure and Generic count together

Create and tidy the MaxN file.

# Only run this if there is data in the counts data frame
if(nrow(points) > 1){
  maxn_points <- points %>% 
    dplyr::group_by(campaignid, sample, filename, periodtime, frame, family, genus, species) %>% # If you have MaxN'd by stage (e.g. Adult, Juvenile) add stage here
    dplyr::mutate(number = as.numeric(number)) %>%
    dplyr::summarise(maxn = sum(number)) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(campaignid, sample, family, genus, species) %>%
    dplyr::slice(which.max(maxn)) %>%
    dplyr::ungroup() %>%
    dplyr::filter(!is.na(maxn)) %>%
    dplyr::select(-frame) %>%
    tidyr::replace_na(list(maxn = 0)) %>%
    dplyr::mutate(maxn = as.numeric(maxn)) %>%
    dplyr::filter(maxn > 0) %>%
    dplyr::inner_join(metadata, by = join_by(campaignid, sample)) %>%
    dplyr::filter(successful_count %in% c("Yes")) %>% 
    dplyr::filter(maxn > 0) %>%
    dplyr::select(campaignid, sample, family, genus, species, maxn) %>%
    dplyr::glimpse()
}

## `summarise()` has grouped output by 'campaignid', 'sample', 'filename',
## 'periodtime', 'frame', 'family', 'genus'. You can override using the `.groups`
## argument.

## Rows: 462
## Columns: 6
## $ campaignid <chr> "2023-03_SwC_stereo-BRUVs", "2023-03_SwC_stereo-BRUVs", "20…
## $ sample     <chr> "10", "10", "10", "10", "10", "10", "10", "10", "10", "10",…
## $ family     <chr> "Dasyatidae", "Gerreidae", "Heterodontidae", "Labridae", "L…
## $ genus      <chr> "Dasyatis", "Parequula", "Heterodontus", "Austrolabrus", "C…
## $ species    <chr> "brevicaudata", "melbournensis", "portusjacksoni", "maculat…
## $ maxn       <dbl> 1, 1, 1, 1, 71, 9, 8, 1, 5, 2, 9, 1, 1, 2, 2, 1, 5, 12, 3, …

# Only run this if there is data in the counts data frame
if(nrow(counts) > 1){
  maxn_counts <- counts %>%
    dplyr::group_by(campaignid, sample, family, genus, species) %>%
    dplyr::mutate(number = as.numeric(count)) %>%
    dplyr::summarise(maxn = sum(number)) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(campaignid, sample, family, genus, species) %>%
    dplyr::slice(which.max(maxn)) %>%
    dplyr::ungroup() %>%
    dplyr::filter(!is.na(maxn)) %>%
    tidyr::replace_na(list(maxn = 0)) %>%
    dplyr::mutate(maxn = as.numeric(maxn)) %>%
    dplyr::filter(maxn > 0) %>%
    dplyr::inner_join(metadata, by = join_by(campaignid, sample)) %>%
    dplyr::filter(successful_count %in% c("Yes")) %>%
    dplyr::filter(maxn > 0) %>%
    dplyr::select(campaignid, sample, family, genus, species, maxn) %>%
    dplyr::glimpse()
}

## `summarise()` has grouped output by 'campaignid', 'sample', 'family', 'genus'.
## You can override using the `.groups` argument.

## Rows: 802
## Columns: 6
## $ campaignid <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates_stereo…
## $ sample     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "11", "11", "11", "…
## $ family     <chr> "Balistidae", "Carangidae", "Carangidae", "Lethrinidae", "L…
## $ genus      <chr> "Abalistes", "Decapterus", "Turrum", "Gymnocranius", "Lethr…
## $ species    <chr> "filamentosus", "spp", "fulvoguttatum", "sp1", "rubriopercu…
## $ maxn       <dbl> 3, 4, 18, 2, 1, 9, 3, 2, 2, 1, 6, 3, 1, 3, 1, 2, 1, 1, 1, 1…

# If only EventMeasure data then MaxN only includes Points data
# If only Generic data then MaxN only includes Count data
# If both exist, then MaxN includes both Points and Count data
maxn <- bind_rows(get0("maxn_points"), get0("maxn_counts")) # this works even if you only have one type of data

Load any EventMeasure Lengths.txt and/or 3DPoints.txt files

This section of code will read in any Lengths.txt and/or 3DPoints.txt files that you have saved in the directory you set. It will combine all of the files into one data-frame, and get the campaignid name from the name of the file. It is important that you consistently name your files with the same campaignid (look out for different separators e.g. ‘-’, ’_‘, or’.’)

em_length3dpoints <- read_em_length(here::here("r-workflows/data/raw/")) %>%                   
  dplyr::select(-c(comment))%>% # there is a comment column in metadata, so you will need to remove this column from EM data
  dplyr::inner_join(metadata, by = join_by(sample, campaignid)) %>%
  dplyr::filter(successful_length %in% "Yes") %>%
  dplyr::rename(length_mm = length) %>%
  glimpse()

## Rows: 2,428
## Columns: 47
## $ opcode                      <chr> "10", "10", "10", "10", "10", "12", "12", …
## $ imageptpair                 <chr> "72", "73", "75", "81", "93", "0", "11", "…
## $ filenameleft                <chr> "Left_MAH00355.MP4", "Left_MAH00355.MP4", …
## $ frameleft                   <chr> "65608", "74525", "21436", "33446", "46833…
## $ filenameright               <chr> "Right_MAH00344.MP4", "Right_MAH00344.MP4"…
## $ frameright                  <chr> "64140", "73057", "20028", "32038", "45425…
## $ time                        <chr> "40.79909", "43.27851", "51.07269", "54.41…
## $ period                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ periodtime                  <chr> "15.42180", "17.90122", "25.69539", "29.03…
## $ x                           <chr> "-430.67367", "-278.33750", "-997.34125", …
## $ y                           <chr> "-140.98709", "-257.03677", "-186.86483", …
## $ z                           <chr> "-868.53490", "-762.92425", "-1522.45323",…
## $ sx                          <chr> "0.64880", "0.48001", "1.91708", "0.42177"…
## $ sy                          <chr> "0.46100", "0.47613", "0.79836", "0.44960"…
## $ sz                          <chr> "1.11439", "0.88751", "3.13793", "0.83996"…
## $ rms                         <chr> "0.67020", "0.47105", "0.63932", "0.10770"…
## $ range                       <chr> "979.64791", "851.81752", "1829.60976", "7…
## $ direction                   <chr> "27.65589", "26.62953", "33.74391", "22.82…
## $ family                      <chr> "Labridae", "Urolophidae", "Labridae", "Tr…
## $ genus                       <chr> "Austrolabrus", "Trygonoptera", "Ophthalmo…
## $ species                     <chr> "maculatus", "ovalis", "lineolata", "dumer…
## $ code                        <chr> "37384025", "37038016", "37384040", "37027…
## $ number                      <chr> "1", "1", "1", "1", "1", "40", "1", "1", "…
## $ stage                       <chr> "AD", "AD", "AD", "AD", "AD", "AD", "AD", …
## $ activity                    <chr> "Passing", "Passing", "Passing", "Passing"…
## $ attribute9                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ attribute10                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ campaignid                  <chr> "2023-03_SwC_stereo-BRUVs", "2023-03_SwC_s…
## $ length_mm                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ precision                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ horzdir                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ vertdir                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ midx                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ midy                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ midz                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ sample                      <chr> "10", "10", "10", "10", "10", "12", "12", …
## $ date_time                   <chr> "18/03/2023 2:44", "18/03/2023 2:44", "18/…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "44.3", "44.3", "44.3", "44.3", "44.3", "4…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_backward <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…
## $ marine_region               <chr> "South-west", "South-west", "South-west", …

Load any Generic length files

This section of code will read in any Length.csv files that you have saved in the directory you set. It will combine all of the files into one data frame, and get the campaignid name from the name of the file. It is important that you consistently name your files with the same campaignid (look out for different separators e.g. ‘-’, ’’, or ‘.’)

gen_length <- read_gen_length(here::here("r-workflows/data/raw/")) %>%                   
  dplyr::full_join(metadata, by = join_by(campaignid, sample)) %>%
  dplyr::filter(successful_length %in% "Yes") %>%
  glimpse()

## reading length file: /home/runner/work/CheckEM/CheckEM/r-workflows/data/raw//2022-05_PtCloates_stereo-BRUVS_length.csv

## Rows: 2,130
## Columns: 11
## $ campaignid <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates_stereo…
## $ opcode     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",…
## $ family     <chr> "Balistidae", "Balistidae", "Balistidae", "Carangidae", "Ca…
## $ genus      <chr> "Abalistes", "Abalistes", "Abalistes", "Decapterus", "Decap…
## $ species    <chr> "filamentosus", "filamentosus", "filamentosus", "spp", "spp…
## $ length_mm  <chr> "336.77888", "324.90775", "281.17035", "310.83692", "336.70…
## $ number     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",…
## $ range      <chr> "3775.30808", "3918.72008", "2559.81966", "3356.76229", "35…
## $ rms        <chr> "11.10507", "14.14936", "7.13303", "16.3268", "15.62525", "…
## $ precision  <chr> "21.13063", "9.07687", "5.16125", "13.59407", "18.2343", "1…
## $ code       <chr> "37465089", "37465089", "37465089", NA, NA, NA, NA, "373370…
## Rows: 2,162
## Columns: 23
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ opcode                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ family                      <chr> "Balistidae", "Balistidae", "Balistidae", …
## $ genus                       <chr> "Abalistes", "Abalistes", "Abalistes", "De…
## $ species                     <chr> "filamentosus", "filamentosus", "filamento…
## $ length_mm                   <chr> "336.77888", "324.90775", "281.17035", "31…
## $ number                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ range                       <chr> "3775.30808", "3918.72008", "2559.81966", …
## $ rms                         <chr> "11.10507", "14.14936", "7.13303", "16.326…
## $ precision                   <chr> "21.13063", "9.07687", "5.16125", "13.5940…
## $ code                        <chr> "37465089", "37465089", "37465089", NA, NA…
## $ sample                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "93.9", "93.9", "93.9", "93.9", "9…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…
## $ marine_region               <chr> "North-west", "North-west", "North-west", …

Combine EventMeasure and Generic length data

# If only EventMeasure data then length only includes Length and 3D points data
# If only Generic data then length only includes generic length data
# If both exist, then length includes both Length and 3D points and generic length data
length <- bind_rows(get0("em_length3dpoints"), get0("gen_length")) # this works even if you only have one type of data

Format and add zeros where a species isn’t present

In the count data

Tidy and “complete” MaxN data (e.g. add zeros in the data where a species wasn’t observed). The final data set will have a row for each species in every sample (deployment).

count <- maxn %>%
  dplyr::mutate(family = ifelse(family %in% c("NA", "NANA", NA, "unknown", "", NULL, " ", NA_character_), "Unknown", as.character(family))) %>%
  dplyr::mutate(genus = ifelse(genus %in% c("NA", "NANA", NA, "unknown", "", NULL, " ", NA_character_), "Unknown", as.character(genus))) %>%
  dplyr::mutate(species = ifelse(species %in% c("NA", "NANA", NA, "unknown", "", NULL, " ", NA_character_), "spp", as.character(species))) %>%
  dplyr::select(campaignid, sample, family, genus, species, maxn) %>%
  tidyr::complete(nesting(campaignid, sample), nesting(family, genus, species)) %>%
  tidyr::replace_na(list(maxn = 0)) %>%
  group_by(campaignid, sample, family, genus, species) %>%
  dplyr::summarise(count = sum(maxn)) %>%
  ungroup() %>%
  mutate(scientific = paste(family, genus, species, sep = " "))%>%
  dplyr::select(campaignid, sample, scientific, count)%>%
  spread(scientific, count, fill = 0)

## `summarise()` has grouped output by 'campaignid', 'sample', 'family', 'genus'.
## You can override using the `.groups` argument.

count_families <- maxn %>%
  dplyr::mutate(scientific = paste(family, genus, species, sep = " ")) %>%
  filter(!(family %in% "Unknown")) %>%
  dplyr::select(c(family, genus, species, scientific)) %>%
  distinct()

complete_count <- count %>%
  pivot_longer(names_to = "scientific", values_to = "count",
               cols = 3:ncol(.)) %>%
  inner_join(count_families, by = c("scientific")) %>%
  full_join(metadata)%>%
  glimpse()

## Joining with `by = join_by(campaignid, sample)`

## Rows: 17,766
## Columns: 18
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ sample                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ scientific                  <chr> "Acanthuridae Naso brachycentron", "Acanth…
## $ count                       <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, …
## $ family                      <chr> "Acanthuridae", "Acanthuridae", "Acanthuri…
## $ genus                       <chr> "Naso", "Naso", "Naso", "Albula", "Aplodac…
## $ species                     <chr> "brachycentron", "fageni", "hexacanthus", …
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "93.9", "93.9", "93.9", "93.9", "9…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…
## $ marine_region               <chr> "North-west", "North-west", "North-west", …

In the length data

Tidy and complete length data (e.g. complete zeroes in the data). The final data set will have a row for each species in every sample (deployment).

complete_length <- length %>%
  dplyr::mutate(family = ifelse(family %in% c("NA", "NANA", NA, "unknown", "", NULL, " ", NA_character_), "Unknown", as.character(family))) %>%
  dplyr::mutate(genus = ifelse(genus %in% c("NA", "NANA", NA, "unknown", "", NULL, " ", NA_character_), "Unknown", as.character(genus))) %>%
  dplyr::mutate(species = ifelse(species %in% c("NA", "NANA", NA, "unknown", "", NULL, " ", NA_character_), "spp", as.character(species))) %>%
  dplyr::filter(!family %in% "Unknown")%>%
  # First make one row for every length measurement
  dplyr::mutate(number = as.numeric(number)) %>%
  tidyr::uncount(number) %>%
  dplyr::mutate(number = 1) %>% 
  # Add in missing samples
  dplyr::right_join(metadata) %>%
  dplyr::filter(successful_length %in% "Yes") %>%
  # Complete the data (add in zeros for every species)
  dplyr::select(campaignid, sample, family, genus, species, length_mm, number, any_of(c("range", "rms", "precision"))) %>% # this will keep EM only columns
  tidyr::complete(nesting(campaignid, sample), nesting(family, genus, species)) %>%
  replace_na(list(number = 0)) %>%
  ungroup() %>%
  dplyr::filter(!is.na(number)) %>%
  dplyr::mutate(length_mm = as.numeric(length_mm)) %>%
  left_join(., metadata) %>%
  glimpse()

## Joining with `by = join_by(campaignid, sample, date_time, location, site,
## depth_m, successful_count, successful_length, successful_habitat_forward,
## successful_habitat_backward, zone_type, status, marine_region)`
## Joining with `by = join_by(campaignid, sample)`

## Rows: 23,119
## Columns: 21
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ sample                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ family                      <chr> "Acanthuridae", "Acanthuridae", "Acanthuri…
## $ genus                       <chr> "Naso", "Naso", "Naso", "Albula", "Aplodac…
## $ species                     <chr> "brachycentron", "fageni", "hexacanthus", …
## $ length_mm                   <dbl> NA, NA, NA, NA, NA, NA, NA, 336.7789, 324.…
## $ number                      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
## $ range                       <chr> NA, NA, NA, NA, NA, NA, NA, "3775.30808", …
## $ rms                         <chr> NA, NA, NA, NA, NA, NA, NA, "11.10507", "1…
## $ precision                   <chr> NA, NA, NA, NA, NA, NA, NA, "21.13063", "9…
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "93.9", "93.9", "93.9", "93.9", "9…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…
## $ marine_region               <chr> "North-west", "North-west", "North-west", …

Quality Control Checks

Now we have some-what tidy data we can begin to run some checks.

Number of unique samples in the metadata

This is the total number of unique samples in the sample metadata (it should also be the number of rows in the metadata data frame)

number_of_samples <- metadata %>%
  dplyr::distinct(campaignid, sample)

message(paste(nrow(number_of_samples), "unique samples in the metadata"))

## 94 unique samples in the metadata

Check for duplicate sample names

If you have any duplicate samples within a campaign they will be displayed here

duplicate_samples <- metadata %>%
  dplyr::group_by(campaignid, sample) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::ungroup() %>%
  dplyr::filter(n > 1)

## `summarise()` has grouped output by 'campaignid'. You can override using the
## `.groups` argument.

message(paste(nrow(duplicate_samples), "samples duplicated in the metadata"))

## 0 samples duplicated in the metadata

Number of sample(s) without points or count data

This could be due to samples not observing fish (not an error) or a sample that should be marked as successful_count = No. It could also be due to a sample name spelt incorrectly in the count/points or sample metadata file.

metadata_samples <- metadata %>%
  dplyr::select(campaignid, sample, dplyr::any_of(c("opcode", "period")), 
                successful_count, successful_length) %>%
  distinct()

samples <- maxn %>%
  distinct(campaignid, sample)

missing_count <- anti_join(metadata_samples, samples, by = join_by(campaignid, sample))

message(paste(nrow(missing_count), "samples in the metadata missing count data"))

## 0 samples in the metadata missing count data

Samples in the count data missing metadata

This next chunk checks for any samples that are in the count data but do not have a match in the sample metadata.

missing_metadata <- anti_join(samples, metadata_samples, by = join_by(campaignid, sample))
message(paste(nrow(missing_metadata), "samples in count data missing metadata"))

## 0 samples in count data missing metadata

Number of sample(s) without length or 3D point data

This could be due to samples not observing fish (not an error) or a sample that should be marked as successful_length = No. It could also be due to a sample name spelt incorrectly in the EMObs or Length file or the sample metadata file.

metadata_samples <- metadata %>%
  dplyr::select(campaignid, sample, dplyr::any_of(c("opcode", "period")), 
                successful_count, successful_length) %>%
  distinct()

samples <- length %>%
  distinct(campaignid, sample)

missing_length <- anti_join(metadata_samples, samples, by = join_by(campaignid, sample))

message(paste(nrow(missing_length), "samples in metadata missing length data"))

## 0 samples in metadata missing length data

Samples in the length data missing metadata

This next chunk checks for any samples that are in the length data but do not have a match in the sample metadata.

missing_metadata <- anti_join(samples, metadata_samples, by = join_by(campaignid, sample))

message(paste(nrow(missing_metadata), "samples in length data missing metadata"))

## 0 samples in length data missing metadata

Periods without an end (EM only)

This check is only important if you have used periods to define your sampling duration. It looks for any periods in the EMObs that do not have an end time. This is important if you want to check the duration of each period.

periods <- read_periods(here::here("r-workflows/data/raw/")) %>%
  glimpse()

## Rows: 32
## Columns: 11
## $ sample        <chr> "10", "12", "14", "15", "16", "17", "19", "2", "21", "22…
## $ periodindex   <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "…
## $ period        <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ filenamestart <chr> "Left_MAH00355.MP4", "L031_MAH01383.MP4", "L031_14_MAH01…
## $ framestart    <chr> "10145", "3746", "24760", "76519", "67465", "71918", "46…
## $ time_start    <dbl> 25.37730, 23.59802, 29.42440, 21.27653, 18.75902, 19.997…
## $ filenameend   <chr> "Left_MAH00357.MP4", "L031_MAH01385.MP4", "L031_14_MAH01…
## $ frameend      <chr> "63748", "57409", "78422", "49060", "40006", "44459", "1…
## $ time_end      <dbl> 85.37724, 83.59796, 89.42433, 81.27647, 78.75896, 79.997…
## $ has_end       <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ campaignid    <chr> "2023-03_SwC_stereo-BRUVs", "2023-03_SwC_stereo-BRUVs", …

periods_without_end <- periods %>%
  dplyr::filter(has_end == 0)

message(paste(nrow(periods_without_end), "periods without an end"))

## 0 periods without an end

glimpse(periods_without_end)

## Rows: 0
## Columns: 11
## $ sample        <chr> 
## $ periodindex   <chr> 
## $ period        <chr> 
## $ filenamestart <chr> 
## $ framestart    <chr> 
## $ time_start    <dbl> 
## $ filenameend   <chr> 
## $ frameend      <chr> 
## $ time_end      <dbl> 
## $ has_end       <chr> 
## $ campaignid    <chr>

Samples without periods (EM only)

This check is only important if you have used periods to define your sampling duration. You can use it to find any samples that are missing periods.

metadata_samples <- metadata %>%
  dplyr::select(campaignid, sample, dplyr::any_of(c("opcode", "period")), successful_count, successful_length) %>%
  dplyr::distinct() %>%
  dplyr::mutate(sample = as.factor(sample))

periods_samples <- periods %>%
  dplyr::select(campaignid, sample, dplyr::any_of(c("opcode", "period"))) %>%
  distinct()

missing_periods <- anti_join(metadata_samples, periods_samples) %>%
  dplyr::select(!sample)

## Joining with `by = join_by(campaignid, sample)`

message(paste(nrow(missing_periods), "samples missing period"))

## 62 samples missing period

glimpse(missing_periods)

## Rows: 62
## Columns: 3
## $ campaignid        <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates…
## $ successful_count  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ successful_length <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…

Points outside Periods (EM only)

This check identifies any points that have been annotated outside of a period.

points_outside_periods <- points %>%
  dplyr::filter(period %in% c("NA", NA, NULL, "")) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species, number, frame)

message(paste(nrow(points_outside_periods), "points outside a period"))

## 0 points outside a period

glimpse(points_outside_periods)

## Rows: 0
## Columns: 7
## $ campaignid <chr> 
## $ period     <chr> 
## $ family     <chr> 
## $ genus      <chr> 
## $ species    <chr> 
## $ number     <chr> 
## $ frame      <chr>

Length measurement(s) or 3D point(s) outside periods (EM only)

This check identifies any length measurements or 3D points that have been annotated outside of a period.

lengths_outside_periods <- em_length3dpoints %>%
  dplyr::filter(period %in% c("NA", NA, NULL, "")) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species, number)

message(paste(nrow(lengths_outside_periods), "lengths/3D points outside period"))

## 1 lengths/3D points outside period

glimpse(lengths_outside_periods)

## Rows: 1
## Columns: 7
## $ campaignid <chr> "2023-03_SwC_stereo-BRUVs"
## $ opcode     <chr> "32"
## $ period     <chr> NA
## $ family     <chr> NA
## $ genus      <chr> NA
## $ species    <chr> NA
## $ number     <chr> NA

Period(s) that are not the correct duration (EM only)

In this check you define the correct sampling duration (e.g. 60 minutes for stereo-BRUVs) and then identify any periods that are not that length.

period_length <- 60 # in minutes

periods_wrong <- periods %>%
        dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), time_start, time_end, has_end) %>%
        dplyr::distinct() %>%
        dplyr::mutate(period_time = round(time_end - time_start)) %>%
        dplyr::filter(!period_time %in% period_length)

message(paste(nrow(periods_wrong), "periods not", period_length, "minutes long"))

## 1 periods not 60 minutes long

glimpse(periods_wrong)

## Rows: 1
## Columns: 6
## $ campaignid  <chr> "2023-03_SwC_stereo-BRUVs"
## $ period      <chr> "1"
## $ time_start  <dbl> 12.19858
## $ time_end    <dbl> 65.25241
## $ has_end     <chr> "1"
## $ period_time <dbl> 53

Total number of individuals observed

This is the total number of individuals observed in the count data:

total_count <- sum(complete_count$count)
message(paste(total_count, "fish counted in the count data"))

## 6411 fish counted in the count data

This is the total number of individuals observed in the length data:

total_length <- sum(complete_length$count)

## Warning: Unknown or uninitialised column: `count`.

message(paste(total_length, "fish counted in the length data"))

## 0 fish counted in the length data

Points without a number (EM only)

This is a check for EventMeasure data. Sometimes analysts will add points of interest that are not fish/sharks and remove the number so they are not summed in total abundance metrics. You should check to make sure no fish species accidentally had their number deleted.

points_without_number <- points %>%
  filter(number %in% c("NA", NA, 0, NULL, "", " "))

message(paste(nrow(points_without_number), "points in the _Points.txt file that do not have a number"))

## 11 points in the _Points.txt file that do not have a number

glimpse(points_without_number)

## Rows: 11
## Columns: 22
## $ sample      <chr> "10", "10", "10", "21", "21", "21", "26", "26", "26", "34"…
## $ pointindex  <chr> "166", "485", "486", "238", "254", "256", "66", "67", "68"…
## $ filename    <chr> "Left_MAH00355.MP4", "Left_MAH00356.MP4", "Left_MAH00356.M…
## $ frame       <chr> "17138", "31042", "31042", "32458", "35214", "45239", "311…
## $ time        <chr> "27.32174", "53.74369", "53.74369", "54.10433", "54.87065"…
## $ period      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"
## $ periodtime  <chr> "1.94444", "28.36639", "28.36639", "24.85844", "25.62477",…
## $ imagecol    <chr> "1861.20000", "289.20000", "408.00000", "973.20000", "728.…
## $ imagerow    <chr> "243.60000", "292.80000", "414.00000", "67.20000", "50.400…
## $ rectwidth   <chr> "0.00000", "0.00000", "0.00000", "0.00000", "0.00000", "0.…
## $ rectheight  <chr> "0.00000", "0.00000", "0.00000", "0.00000", "0.00000", "0.…
## $ family      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ genus       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ species     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ code        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ number      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ stage       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ activity    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ comment     <chr> "squid", "squid", "squid", "Squid", "Squid", "Squid", "squ…
## $ attribute9  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ attribute10 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ campaignid  <chr> "2023-03_SwC_stereo-BRUVs", "2023-03_SwC_stereo-BRUVs", "2…

Length measurements or 3D points without a number (EM only)

This is a check for EventMeasure data. Sometimes analysts will add 3D points to record the sync point. These can remain in the data but you should double check that no fish are accidentally missing a number.

lengths_without_number <- em_length3dpoints %>%
  filter(number %in% c("NA", NA, 0, NULL, "", " "))

message(paste(nrow(lengths_without_number), "lengths or 3D points in the EMObs that do not have a number"))

## 1 lengths or 3D points in the EMObs that do not have a number

glimpse(lengths_without_number)

## Rows: 1
## Columns: 47
## $ opcode                      <chr> "32"
## $ imageptpair                 <chr> "0"
## $ filenameleft                <chr> "L039_MAH00349.MP4"
## $ frameleft                   <chr> "58009"
## $ filenameright               <chr> "R040_MAH00356.MP4"
## $ frameright                  <chr> "57032"
## $ time                        <chr> "16.12972"
## $ period                      <chr> NA
## $ periodtime                  <chr> "-1.00000"
## $ x                           <chr> "1.53837"
## $ y                           <chr> "-121.01451"
## $ z                           <chr> "-714.43170"
## $ sx                          <chr> "0.38485"
## $ sy                          <chr> "0.38601"
## $ sz                          <chr> "0.84527"
## $ rms                         <chr> "0.01491"
## $ range                       <chr> "724.60992"
## $ direction                   <chr> "10.04716"
## $ family                      <chr> NA
## $ genus                       <chr> NA
## $ species                     <chr> NA
## $ code                        <chr> NA
## $ number                      <chr> NA
## $ stage                       <chr> NA
## $ activity                    <chr> NA
## $ attribute9                  <chr> NA
## $ attribute10                 <chr> NA
## $ campaignid                  <chr> "2023-03_SwC_stereo-BRUVs"
## $ length_mm                   <chr> NA
## $ precision                   <chr> NA
## $ horzdir                     <chr> NA
## $ vertdir                     <chr> NA
## $ midx                        <chr> NA
## $ midy                        <chr> NA
## $ midz                        <chr> NA
## $ sample                      <chr> "32"
## $ date_time                   <chr> "15/03/2023 0:23"
## $ location                    <chr> NA
## $ site                        <chr> NA
## $ depth_m                     <chr> "44.4"
## $ successful_count            <chr> "Yes"
## $ successful_length           <chr> "Yes"
## $ successful_habitat_forward  <chr> "Yes"
## $ successful_habitat_backward <chr> "Yes"
## $ zone_type                   <chr> "National Park Zone (IUCN II)"
## $ status                      <chr> "No-take"
## $ marine_region               <chr> "South-west"

Species names that have been updated

Taxanomic advances means that species names are updated all the time, and analysts sometimes spell species names incorrectly. This check uses a data-frame saved inside the CheckEM package to identifies species names that have been updated or spelt incorrectly.

You can choose if you would like to find the synonyms using an Australia specific list (CheckEM::aus_synonyms) or a Global list (TO ADD).

Synonyms in the count data

synonyms_in_count <- dplyr::left_join(complete_count, CheckEM::aus_synonyms) %>%
      dplyr::filter(!is.na(genus_correct)) %>%
      dplyr::mutate('old name' = paste(family, genus, species, sep = " ")) %>%
      dplyr::mutate('new name' = paste(family_correct, genus_correct, species_correct, sep = " ")) %>%
      dplyr::select('old name', 'new name') %>%
      dplyr::distinct()

## Joining with `by = join_by(family, genus, species)`

message(paste(nrow(synonyms_in_count), "synonyms used in the count data"))

## 12 synonyms used in the count data

glimpse(synonyms_in_count)

## Rows: 12
## Columns: 2
## $ `old name` <chr> "Carangidae Carangoides spp", "Dasyatidae Dasyatis brevicau…
## $ `new name` <chr> "Carangidae Unknown spp", "Dasyatidae Bathytoshia brevicaud…

Synonyms in the length data

synonyms_in_length <- dplyr::left_join(complete_length, CheckEM::aus_synonyms) %>%
      dplyr::filter(!is.na(genus_correct)) %>%
      dplyr::mutate('old name' = paste(family, genus, species, sep = " ")) %>%
      dplyr::mutate('new name' = paste(family_correct, genus_correct, species_correct, sep = " ")) %>%
      dplyr::select('old name', 'new name') %>%
      dplyr::distinct()

## Joining with `by = join_by(family, genus, species)`

message(paste(nrow(synonyms_in_length), "synonyms used in the length data"))

## 12 synonyms used in the length data

glimpse(synonyms_in_length)

## Rows: 12
## Columns: 2
## $ `old name` <chr> "Carangidae Carangoides spp", "Dasyatidae Dasyatis brevicau…
## $ `new name` <chr> "Carangidae Unknown spp", "Dasyatidae Bathytoshia brevicaud…

Change synonyms names in data

Now that we have identified species names that have been updated or spelt wrong you need to decide if you want to change the names in your data or continue using the old names. If you want to update the names use the next two chunks. If you would like to retain the old names skip the next two chunks.

NOTE: this does not change your original annotation files, only the data you save at the end of the script.

complete_count <- dplyr::left_join(complete_count, CheckEM::aus_synonyms) %>%
  dplyr::mutate(genus = ifelse(!genus_correct%in%c(NA), genus_correct, genus)) %>%
  dplyr::mutate(species = ifelse(!is.na(species_correct), species_correct, species)) %>%
  dplyr::mutate(family = ifelse(!is.na(family_correct), family_correct, family)) %>%
  dplyr::select(-c(family_correct, genus_correct, species_correct)) %>%
  dplyr::mutate(scientific = paste(family, genus, species))

## Joining with `by = join_by(family, genus, species)`

complete_length <- dplyr::left_join(complete_length, CheckEM::aus_synonyms) %>%
  dplyr::mutate(genus = ifelse(!genus_correct%in%c(NA), genus_correct, genus)) %>%
  dplyr::mutate(species = ifelse(!is.na(species_correct), species_correct, species)) %>%
  dplyr::mutate(family = ifelse(!is.na(family_correct), family_correct, family)) %>%
  dplyr::select(-c(family_correct, genus_correct, species_correct)) %>%
  dplyr::mutate(scientific = paste(family, genus, species))

## Joining with `by = join_by(family, genus, species)`

Species not observed in the region before

In this check you use a list of species and their known geographical ranges to check your data against to identify any species that are outside of that range. There are two life history data sets to choose from to check any species that are outside of their known geographical range. They are the Australia and Global lists.

To use the Australia life history list use CheckEM::australia_life_history to use the Global life history list use CheckEM::global_life_history.

We run the check on both the count and length data, as sometimes a species will be in the count and not the length. All species identified in these checks are present in your chosen life-history list. Sometimes range data is limited, so it is possible that a species flagged in this check is actually present in that area. You need to critically think about the species flagged by this check.

Species out of range in the count data

Check for any species that are out of range in the count data.

count_species_not_observed_region <- complete_count %>%
  dplyr::distinct(campaignid, sample, family, genus, species, marine_region, count) %>%
  dplyr::anti_join(., expand_life_history(CheckEM::australia_life_history), by = c("family", "genus", "species", "marine_region")) %>%
  dplyr::filter(count > 0) %>%
  dplyr::left_join(metadata) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species, marine_region) %>%
  dplyr::distinct() %>%
  dplyr::rename('marine region not observed in' = marine_region) %>%
  dplyr::semi_join(., CheckEM::australia_life_history, by = c("family", "genus", "species"))

## Joining with `by = join_by(campaignid, sample, marine_region)`

message(paste(nrow(count_species_not_observed_region), "species not observed in the region before"))

## 0 species not observed in the region before

glimpse(count_species_not_observed_region)

## Rows: 0
## Columns: 5
## $ campaignid                      <chr> 
## $ family                          <chr> 
## $ genus                           <chr> 
## $ species                         <chr> 
## $ `marine region not observed in` <chr>

Species out of range in the length data

Check for any species that are out of range in the length data.

length_species_not_observed_region <- complete_length %>%
  dplyr::distinct(campaignid, sample, family, genus, species, marine_region, number) %>%
  dplyr::anti_join(., expand_life_history(CheckEM::australia_life_history), by = c("family", "genus", "species", "marine_region")) %>%
  dplyr::filter(number > 0) %>%
  dplyr::left_join(metadata) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species, marine_region) %>%
  dplyr::distinct() %>%
  dplyr::rename('marine region not observed in' = marine_region) %>%
  dplyr::semi_join(., CheckEM::australia_life_history, by = c("family", "genus", "species"))

## Joining with `by = join_by(campaignid, sample, marine_region)`

message(paste(nrow(length_species_not_observed_region), "species not observed in the region before"))

## 0 species not observed in the region before

glimpse(length_species_not_observed_region)

## Rows: 0
## Columns: 5
## $ campaignid                      <chr> 
## $ family                          <chr> 
## $ genus                           <chr> 
## $ species                         <chr> 
## $ `marine region not observed in` <chr>

Species not listed in the life history list

This next check identifies any species that are not listed in your chosen life history list. It could be that you have misspelt the family/genus/species name, the species name is invalid, the name has been updated or that the species name should be included in the life history list but is missing. Again you will need to critically think about the species that are flagged to determine if this is an error or not.

To use the Australia life history list use CheckEM::australia_life_history to use the Global life history list use CheckEM::global_life_history.

NOTE. If you believe that a species flagged by this check should be included in the life history list please email brooke.gibbons@uwa.edu.au with the full species name and which list it should be added to (Global or Australia).

Species in the count data that are not listed

If you chose to update the names that have changed (synonyms) then this check won’t include the previously used names.

count_species_not_in_list <- complete_count %>%
  dplyr::anti_join(., CheckEM::australia_life_history, by = c("family", "genus", "species")) %>%
  dplyr::filter(count > 0) %>%
  dplyr::left_join(metadata) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species) %>%
  dplyr::distinct()

## Joining with `by = join_by(campaignid, sample, date_time, location, site,
## depth_m, successful_count, successful_length, successful_habitat_forward,
## successful_habitat_backward, zone_type, status, marine_region)`

message(paste(nrow(count_species_not_in_list), "species not in chosen life history list"))

## 11 species not in chosen life history list

glimpse(count_species_not_in_list)

## Rows: 11
## Columns: 4
## $ campaignid <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates_stereo…
## $ family     <chr> "Lethrinidae", "Lutjanidae", "Sus", "Rhinidae", "Scaridae",…
## $ genus      <chr> "Gymnocranius", "Pristipomoides", "Sus", "Rhynchobatus", "S…
## $ species    <chr> "sp1", "sp1", "sus", "laevis", "sp3", "sp10", "sp1", "SUS",…

Species in the length data that are not listed

length_species_not_in_list <- complete_length %>%
  dplyr::anti_join(., CheckEM::australia_life_history, by = c("family", "genus", "species")) %>%
  dplyr::filter(number > 0) %>%
  dplyr::left_join(metadata) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species) %>%
  dplyr::distinct()

## Joining with `by = join_by(campaignid, sample, date_time, location, site,
## depth_m, successful_count, successful_length, successful_habitat_forward,
## successful_habitat_backward, zone_type, status, marine_region)`

message(paste(nrow(length_species_not_in_list), "species not in chosen life history list"))

## 12 species not in chosen life history list

glimpse(length_species_not_in_list)

## Rows: 12
## Columns: 4
## $ campaignid <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates_stereo…
## $ family     <chr> "Lethrinidae", "Lutjanidae", "Sus", "Rhinidae", "Scaridae",…
## $ genus      <chr> "Gymnocranius", "Pristipomoides", "Sus", "Rhynchobatus", "S…
## $ species    <chr> "sp1", "sp1", "sus", "laevis", "sp3", "sp10", "sp1", "SUS",…

Length measurements smaller or bigger than the fishbase minimum and maximums

Length data is extremely valuable so it is important that fish are precisely and accurately measured. You can use this check to identify any measurements that are bigger than the maximum size listed on fishbase. However, we think it is important that you check any species that are nearing their maximum size (e.g 85% of maximum). We also think it is important that you check for any species that are very small. Fishbase does not list a minimum size, so we recommend checking the length measurements against 15% of the maximum size.

In the function below you can edit these cut offs (15% and 85%) as you see fit. If you believe the size limits for a particular species is incorrect you fill out the feedback form on the CheckEM web based app (see tab Edit maximum lengths) to supply a new maximum size limit.

incorrect_lengths <- left_join(complete_length, create_min_max(CheckEM::australia_life_history, minimum = 0.15, maximum = 0.85)) %>%
  dplyr::filter(length_mm < min_length_mm | length_mm > max_length_mm) %>%
  mutate(reason = ifelse(length_mm < min_length_mm, "too small", "too big")) %>%
  dplyr::select(campaignid, sample, family, genus, species, length_mm, min_length_mm, max_length_mm, length_max_mm, reason, any_of(c("em_comment", "frame_left")), length_max_mm) %>%
  mutate(difference = ifelse(reason %in% c("too small"), (min_length_mm - length_mm), (length_mm - max_length_mm))) %>%
  dplyr::mutate(percent_of_fb_max = (length_mm/length_max_mm)*100) %>%
  dplyr::left_join(metadata) %>%
  dplyr::select(campaignid, dplyr::any_of(c("opcode", "period")), family, genus, species, length_mm, min_length_mm, max_length_mm, length_max_mm, reason, any_of(c("em_comment", "frame_left")), difference, percent_of_fb_max)

## Joining with `by = join_by(genus)`
## Joining with `by = join_by(family, genus, species)`
## Joining with `by = join_by(campaignid, sample)`

too_small <- incorrect_lengths %>%
  dplyr::filter(reason %in% "too small")

too_big <- incorrect_lengths %>%
  dplyr::filter(reason %in% "too big")

message(paste(nrow(too_small), "lengths are too small"))

## 184 lengths are too small

glimpse(too_small)

## Rows: 184
## Columns: 11
## $ campaignid        <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates…
## $ family            <chr> "Monacanthidae", "Echeneidae", "Echeneidae", "Echene…
## $ genus             <chr> "Unknown", "Echeneis", "Echeneis", "Echeneis", "Eche…
## $ species           <chr> "spp", "naucrates", "naucrates", "naucrates", "naucr…
## $ length_mm         <dbl> 32.95594, 116.45890, 141.68877, 133.69953, 70.35057,…
## $ min_length_mm     <dbl> 44.98729, 165.00000, 165.00000, 165.00000, 165.00000…
## $ max_length_mm     <dbl> 254.9280, 935.0000, 935.0000, 935.0000, 935.0000, 38…
## $ length_max_mm     <dbl> 299.9153, 1100.0000, 1100.0000, 1100.0000, 1100.0000…
## $ reason            <chr> "too small", "too small", "too small", "too small", …
## $ difference        <dbl> 12.0313480, 48.5411000, 23.3112300, 31.3004700, 94.6…
## $ percent_of_fb_max <dbl> 10.988417, 10.587173, 12.880797, 12.154503, 6.395506…

message(paste(nrow(too_big), "lengths are too big"))

## 399 lengths are too big

glimpse(too_big)

## Rows: 399
## Columns: 11
## $ campaignid        <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05_PtCloates…
## $ family            <chr> "Balistidae", "Balistidae", "Balistidae", "Labridae"…
## $ genus             <chr> "Abalistes", "Abalistes", "Abalistes", "Suezichthys"…
## $ species           <chr> "filamentosus", "filamentosus", "filamentosus", "cya…
## $ length_mm         <dbl> 336.7789, 324.9078, 281.1703, 90.9945, 404.3352, 428…
## $ min_length_mm     <dbl> 49.5, 49.5, 49.5, 15.0, 67.5, 67.5, 49.5, 195.0, 195…
## $ max_length_mm     <dbl> 280.5, 280.5, 280.5, 85.0, 382.5, 382.5, 280.5, 1105…
## $ length_max_mm     <dbl> 330, 330, 330, 100, 450, 450, 330, 1300, 1300, 450, …
## $ reason            <chr> "too big", "too big", "too big", "too big", "too big…
## $ difference        <dbl> 56.27888, 44.40775, 0.67035, 5.99450, 21.83520, 45.9…
## $ percent_of_fb_max <dbl> 102.05421, 98.45689, 85.20314, 90.99450, 89.85227, 9…

Number of 3D points and length measurements over the RMS limit (EM only)

In this check you can set the RMS limit, and then identify any measurements that have a larger RMS.

rms_limit <- 20 # in mm

over_rms <- complete_length %>%
  dplyr::filter(as.numeric(rms) > rms_limit)

message(paste(nrow(over_rms), "lengths over RMS limit"))

## 0 lengths over RMS limit

glimpse(over_rms)

## Rows: 0
## Columns: 22
## $ campaignid                  <chr> 
## $ sample                      <chr> 
## $ family                      <chr> 
## $ genus                       <chr> 
## $ species                     <chr> 
## $ length_mm                   <dbl> 
## $ number                      <dbl> 
## $ range                       <chr> 
## $ rms                         <chr> 
## $ precision                   <chr> 
## $ date_time                   <chr> 
## $ location                    <chr> 
## $ site                        <chr> 
## $ depth_m                     <chr> 
## $ successful_count            <chr> 
## $ successful_length           <chr> 
## $ successful_habitat_forward  <chr> 
## $ successful_habitat_backward <chr> 
## $ zone_type                   <chr> 
## $ status                      <chr> 
## $ marine_region               <chr> 
## $ scientific                  <chr>

Number of length measurements over the precision limit (EM only)

In this check you can set the precision limit, and then identify any measurements that have a larger precision.

precision_limit <- 10 # in %

over_precision <- complete_length %>%
  dplyr::filter(as.numeric(precision) > precision_limit)

message(paste(nrow(over_precision), "lengths over precision limit"))

## 430 lengths over precision limit

glimpse(over_precision)

## Rows: 430
## Columns: 22
## $ campaignid                  <chr> "2022-05_PtCloates_stereo-BRUVS", "2022-05…
## $ sample                      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ family                      <chr> "Balistidae", "Carangidae", "Carangidae", …
## $ genus                       <chr> "Abalistes", "Decapterus", "Decapterus", "…
## $ species                     <chr> "filamentosus", "spp", "spp", "spp", "fulv…
## $ length_mm                   <dbl> 336.7789, 310.8369, 336.7015, 303.8198, 26…
## $ number                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ range                       <chr> "3775.30808", "3356.76229", "3562.62956", …
## $ rms                         <chr> "11.10507", "16.3268", "15.62525", "11.232…
## $ precision                   <chr> "21.13063", "13.59407", "18.2343", "16.221…
## $ date_time                   <chr> "2022-05-22T10:03:24+08:00", "2022-05-22T1…
## $ location                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ site                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ depth_m                     <chr> "93.9", "93.9", "93.9", "93.9", "93.9", "9…
## $ successful_count            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_length           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ successful_habitat_forward  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ successful_habitat_backward <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ zone_type                   <chr> "National Park Zone (IUCN II)", "National …
## $ status                      <chr> "No-take", "No-take", "No-take", "No-take"…
## $ marine_region               <chr> "North-west", "North-west", "North-west", …
## $ scientific                  <chr> "Balistidae Abalistes filamentosus", "Cara…

Number of 3D points and length measurements over the range limit (EM only)

In this check you can set the range limit, and then identify any measurements that have a larger range.

range_limit <- 10 # in metres

over_range <- complete_length %>%
  dplyr::filter(as.numeric(range) > (range_limit* 1000))

message(paste(nrow(over_range), "lengths over range limit"))

## 0 lengths over range limit

glimpse(over_range)

## Rows: 0
## Columns: 22
## $ campaignid                  <chr> 
## $ sample                      <chr> 
## $ family                      <chr> 
## $ genus                       <chr> 
## $ species                     <chr> 
## $ length_mm                   <dbl> 
## $ number                      <dbl> 
## $ range                       <chr> 
## $ rms                         <chr> 
## $ precision                   <chr> 
## $ date_time                   <chr> 
## $ location                    <chr> 
## $ site                        <chr> 
## $ depth_m                     <chr> 
## $ successful_count            <chr> 
## $ successful_length           <chr> 
## $ successful_habitat_forward  <chr> 
## $ successful_habitat_backward <chr> 
## $ zone_type                   <chr> 
## $ status                      <chr> 
## $ marine_region               <chr> 
## $ scientific                  <chr>

Samples where the MaxN does not equal the number of length measurements

Percent of MaxN Measured

Save the checked data

Save MaxN as an R data file.

saveRDS(complete_count,
          file = here::here(paste0("r-workflows/data/staging/",
                       name, "_complete-count.rds")))

Save lengths as an R data file.

saveRDS(complete_length,
          file = here::here(paste0("r-workflows/data/staging/",
                       name, "_complete-length.rds")))

Claude Spencer & Brooke Gibbons

2024-01-19

Introduction

R set up

Metadata

Marine Parks

Find nearest Marine Region

Fish annotation data

Load any EventMeasure Points.txt files

Load any Generic count files

Combine EventMeasure and Generic count together

Load any EventMeasure Lengths.txt and/or 3DPoints.txt files

Load any Generic length files

Combine EventMeasure and Generic length data

Format and add zeros where a species isn’t present

In the count data

In the length data

Quality Control Checks

Number of unique samples in the metadata

Check for duplicate sample names

Number of sample(s) without points or count data

Samples in the count data missing metadata

Number of sample(s) without length or 3D point data

Samples in the length data missing metadata

Periods without an end (EM only)

Samples without periods (EM only)

Points outside Periods (EM only)

Length measurement(s) or 3D point(s) outside periods (EM only)

Period(s) that are not the correct duration (EM only)

Total number of individuals observed

Points without a number (EM only)

Length measurements or 3D points without a number (EM only)

Species names that have been updated

Synonyms in the count data

Synonyms in the length data

Change synonyms names in data

Species not observed in the region before

Species out of range in the count data

Species out of range in the length data

Species not listed in the life history list

Species in the count data that are not listed

Species in the length data that are not listed

Length measurements smaller or bigger than the fishbase minimum and maximums

Number of 3D points and length measurements over the RMS limit (EM only)

Number of length measurements over the precision limit (EM only)

Number of 3D points and length measurements over the range limit (EM only)

Samples where the MaxN does not equal the number of length measurements

Percent of MaxN Measured

Save the checked data