Generate Subsample
generate_subsample.Rd
Subsamples from a given sample event in a season.
Arguments
- con
connection to the database
- sampling_event
A numeric value for the sampling event you are processing (i.e. 1:10)
- season
The season for which you want to pull subsamples. A season consists of all sampling events from the given year up to September 30th and from the previous year after October 1st.
Value
A named list containing a subsample_for_sherlock
table, a subsample_summary
table,
and a remainders_for_gt_seq
table.
The subsample_for_sherlock
table contains all samples selected by the process with the following columns:
sample_id
datetime_collected
stream_name
sample_bin_code
sample_event_number
scenario
The subsample_summary
table contains a short summary summarizing the number of samples for a given subsampling scenario by stream,
event, and bin. It contains the following columns:
stream
event
bin
scenario
subsamples
The remainders_for_gt_seq
table contains all samples not selected by the subsampling process with the following columns:
sample_id
datetime_collected
stream_name
sample_bin_code
sample_event_number
scenario
Details
This function takes a subsample from all samples returned from field for a given sampling event and a given season. It uses the following subsampling logic:
At least 50% of samples per site per event will be sampled.
If the number of samples is odd, divide that number by 2 and round the resulting number up to the nearest integer.
If the total number of samples for a given site in a given event is less than 20, process all samples for that site/event.
If multiple bins are represented in a set of samples for a given site and event, select 50% of the samples from each bin for processing.
If the total number of samples in a bin is less than or equal to 5, process all of the samples for that bin.
If this rule contradicts the “less than 20” rule (above), this rule should be prioritized. For example, if we receive a sample set from a given site and event where Bins A, B, C, D, and E are each represented by five samples (total sample size = 25), process all of the samples for that site/event.
Subsampling should be random.
Examples
# connect with database
con <- gr_db_connect()
#> ℹ refreshing Azure auth token
#> ✔ refreshing Azure auth token ... done
#>
#> Error: connection to server at "localhost" (::1), port 5432 failed: Connection refused (0x0000274D/10061)
#> Is the server running on that host and accepting TCP/IP connections?
#> connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused (0x0000274D/10061)
#> Is the server running on that host and accepting TCP/IP connections?
# subsample from all sampling events in season 2024
subsamples <- generate_subsample(con, 1, 2024)
#> Error in eval(expr, envir, enclos): object 'con' not found