Skip to contents

Subsamples from a given sample event in a season.

Usage

generate_subsample(con, sampling_event, season)

Arguments

con

connection to the database

sampling_event

A numeric value for the sampling event you are processing (i.e. 1:10)

season

The season for which you want to pull subsamples. A season consists of all sampling events from the given year up to September 30th and from the previous year after October 1st.

Value

A named list containing a subsample_for_sherlock table, a subsample_summary table, and a remainders_for_gt_seq table. The subsample_for_sherlock table contains all samples selected by the process with the following columns:

  • sample_id

  • datetime_collected

  • stream_name

  • sample_bin_code

  • sample_event_number

  • scenario

The subsample_summary table contains a short summary summarizing the number of samples for a given subsampling scenario by stream, event, and bin. It contains the following columns:

  • stream

  • event

  • bin

  • scenario

  • subsamples

The remainders_for_gt_seq table contains all samples not selected by the subsampling process with the following columns:

  • sample_id

  • datetime_collected

  • stream_name

  • sample_bin_code

  • sample_event_number

  • scenario

Details

This function takes a subsample from all samples returned from field for a given sampling event and a given season. It uses the following subsampling logic:

  • At least 50% of samples per site per event will be sampled.

  • If the number of samples is odd, divide that number by 2 and round the resulting number up to the nearest integer.

  • If the total number of samples for a given site in a given event is less than 20, process all samples for that site/event.

  • If multiple bins are represented in a set of samples for a given site and event, select 50% of the samples from each bin for processing.

  • If the total number of samples in a bin is less than or equal to 5, process all of the samples for that bin.

  • If this rule contradicts the “less than 20” rule (above), this rule should be prioritized. For example, if we receive a sample set from a given site and event where Bins A, B, C, D, and E are each represented by five samples (total sample size = 25), process all of the samples for that site/event.

  • Subsampling should be random.

Examples

# connect with database
con <- gr_db_connect()
#>  refreshing Azure auth token
#>  refreshing Azure auth token ... done
#> 
#> Error: connection to server at "localhost" (::1), port 5432 failed: Connection refused (0x0000274D/10061)
#> 	Is the server running on that host and accepting TCP/IP connections?
#> connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused (0x0000274D/10061)
#> 	Is the server running on that host and accepting TCP/IP connections?

# subsample from all sampling events in season 2024
subsamples <- generate_subsample(con, 1, 2024)
#> Error in eval(expr, envir, enclos): object 'con' not found