Regression Analysis for Water Temperature Data

The goal of this analysis is to fill in data gaps where temperature data are missing or the time series is incomplete in order to make the dataset more useful for SR JPE modeling. Temperature is an important covariate in understanding juvenile production though the completeness of these data vary by location.

Currently this analysis relies on a regression model and is performed for the Feather River and Yuba River. The resulting dataset with predicted values is saved and integrated in the development of a water temperature dataset.

Data used to build models: Butte Creek

Butte Creek is used to build the regression models because the time series is complete and the data are high quality.

  • Date range covered in the Butte Creek temperature data is 1999 - 2024

Overall approach for water temperature regression:

  1. Prepare datasets for regression analysis (dataset with no missing data is used to train the model and dataset with missing data are predicted using the model)

  2. Fit and evaluate linear regression models for mean, min, and max temperatures

  3. Make predictions for missing data using the fitted models

  4. Combine predictions with actual measurements

  5. Visualize the predicted and actual temperature over time to asses model performance trends

Feather River

Data Preparation and Approach

  1. Pull in gage data from CDEC (GRL will represent the High Flow Channel (HFC) and FRA will represent the Low Flow Channel (LFC))
  • GRL (2003-03-05 to 2007-06-01 H; 2020-01-04 to present): located after Thermalito Afterbay
  • FRA (2002-01-01 to present): located between Lake Oroville and Thermalito Afterbay
  1. Prepare datasets for regression analysis
  • Dataset with no missing data to train and test the model (Butte Creek and Feather River)
  • Dataset with missing data to make predictions (Feather River)
  1. Use data where there are no missing data from either dataset for regression modeling

  2. Use the regression model to make predictions from the testing dataset and evaluate

  3. Use the model to make predictions for missing data

Low Flow Channel (LFC)

Exploratory analysis

Before we developed any models, we explored the relationship between water temperature at each location. There is a linear correlation between mean, min, and max water temperature on Butte Creek and Feather River LFC. For example, the plot below suggests a strong linear relationship between the mean water temperatures of Butte Creek and Feather River LFC. The positive slope of the linear trend line implies that higher water temperatures in Butte Creek are associated with higher water temperatures in Feather River LFC. These visual representations support the results of the linear regression analysis, which identified a statistically significant relationship between the mean, max and min water temperatures of these two locations.

Plot of mean temp for Feather River LFC and Butte Creek

Building regression models

We built 3 regression models for Feather River LFC - one each for mean, min, and max water temperature relationships. We evaluated the models using the Mean Absolute Percentage Error (MAPE). The MAPE for all three models indicated good predictive accuracy.

  • MAPE for the mean model: 0.08488823, which means the model’s predictions are off by about 8.49% on average.
  • MAPE for the min model: 0.08800298, which means the model’s predictions are off by about 8.80% on average.
  • MAPE for the max model: 0.07784735, which means the model’s predictions are off by about 7.78% on average.
## 
## Call:
## lm(formula = temp ~ date + butte_temp, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9322 -0.8321  0.0508  0.7829  5.8784 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.108e+00  9.996e-01  -7.111 1.69e-12 ***
## date         7.677e-04  5.306e-05  14.469  < 2e-16 ***
## butte_temp   4.171e-01  5.933e-03  70.298  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.363 on 1720 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7573, Adjusted R-squared:  0.757 
## F-statistic:  2684 on 2 and 1720 DF,  p-value: < 2.2e-16
## [1] 0.08457775
## 
## Call:
## lm(formula = temp ~ date + butte_temp, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7646 -0.8095  0.0347  0.7484  6.1049 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.752e+00  9.754e-01  -4.872 1.21e-06 ***
## date         6.408e-04  5.179e-05  12.373  < 2e-16 ***
## butte_temp   3.941e-01  6.345e-03  62.111  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.329 on 1721 degrees of freedom
## Multiple R-squared:  0.7071, Adjusted R-squared:  0.7067 
## F-statistic:  2077 on 2 and 1721 DF,  p-value: < 2.2e-16
## [1] NA
## 
## Call:
## lm(formula = temp ~ date + butte_temp, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4480 -0.7760  0.0212  0.8377  5.8177 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.321e+00  1.039e+00  -8.006 2.16e-15 ***
## date         8.257e-04  5.514e-05  14.975  < 2e-16 ***
## butte_temp   4.413e-01  5.538e-03  79.692  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.423 on 1720 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7982, Adjusted R-squared:  0.7979 
## F-statistic:  3401 on 2 and 1720 DF,  p-value: < 2.2e-16
## [1] 0.08051571
Predictions
  • Combine predictions (mean, min, max data frames into one)
  • Reshape data

The plot shows the predicted mean temperature of the Feather River LFC over time (which is similar for min and max predictions as well). The line represents the trend of the predicted mean temperatures, indicating how they change as the date progresses. This visualization helps to identify any patterns or trends in the mean water temperature over the observed period.

Full dataset
  • Join with original data (merge combined predictions with the original dataset that includes gage agency and gage number)
  • Visualize combined data

The plot below shows how the mean, min, and max temperatures for the Feather River LFC over time. Interpolated values are seamlessly integrated where observed data is missing, ensuring a continuous temperature dataset. Each water temperature type (mean, min, max) is represented with a different color to help in distinguishing the temperature trends and understanding the temperature fluctuations.

## Rows: 26,946
## Columns: 7
## Groups: stream, date, statistic, gage_agency, gage_number, site_group [26,946]
## $ date        <date> 1999-12-31, 1999-12-31, 1999-12-31, 2000-01-01, 2000-01-0…
## $ stream      <chr> "feather river", "feather river", "feather river", "feathe…
## $ site_group  <chr> "upper feather lfc", "upper feather lfc", "upper feather l…
## $ gage_agency <chr> "interpolated", "interpolated", "interpolated", "interpola…
## $ gage_number <chr> "interpolated", "interpolated", "interpolated", "interpola…
## $ statistic   <chr> "mean", "max", "min", "mean", "max", "min", "mean", "max",…
## $ value       <dbl> 3.304975, 2.844276, 4.160482, 3.375258, 3.286411, 3.964067…

HFC

Exploratory analysis

There is a linear correlation between mean, min, and max water temperature on Butte Creek and Feather River HFC. For example, the plot below suggests a strong linear relationship between the mean water temperatures of Butte Creek and Feather River HFC. The positive slope of the linear trend line implies that higher water temperatures in Butte Creek are associated with higher water temperatures in Feather River HFC. These visual representations support the results of the linear regression analysis, which identified a statistically significant relationship between the mean, max and min water temperatures of these two locations.

Plot of mean temp for Feather River HFC and Butte Creek

Building regression models

We built 3 regression models for Feather River HFC - one each for mean, min, and max water temperature relationships. We evaluated the models using the Mean Absolute Percentage Error (MAPE). The MAPE for all three models indicated good predictive accuracy.

  • MAPE for the mean model: 0.1252, which means the model’s predictions are off by about 12.52% on average.
  • MAPE for the min model: 0.0973, which means the model’s predictions are off by about 9.73% on average.
  • MAPE for the max model: 0.1364, which means the model’s predictions are off by about 13.64% on average.
Predictions
  • Combine predictions (mean, min, max data frames into one)
  • Reshape data

The plot shows the predicted mean temperature of the Feather River HFC over time (which is similar for min and max predictions as well). The line represents the trend of the predicted mean temperatures, indicating how they change as the date progresses. This visualization helps to identify any patterns or trends in the mean water temperature over the observed period.

Full dataset
  • Join with original data (merge combined predictions with the original dataset that includes gage agency and gage number)
  • Visualize combined data

The plot below shows how the mean, min, and max temperatures for the Feather River HFC over time. Interpolated values are seamlessly integrated where observed data is missing, ensuring a continuous temperature dataset. Each water temperature type (mean, min, max) is represented with a different color to help in distinguishing the temperature trends and understanding the temperature fluctuations.

## Rows: 26,982
## Columns: 7
## Groups: stream, date, statistic, gage_agency, gage_number, site_group [26,982]
## $ date        <date> 1999-12-31, 1999-12-31, 1999-12-31, 2000-01-01, 2000-01-0…
## $ stream      <chr> "feather river", "feather river", "feather river", "feathe…
## $ site_group  <chr> "upper feather hfc", "upper feather hfc", "upper feather h…
## $ gage_agency <chr> "interpolated", "interpolated", "interpolated", "interpola…
## $ gage_number <chr> "interpolated", "interpolated", "interpolated", "interpola…
## $ statistic   <chr> "mean", "max", "min", "mean", "max", "min", "mean", "max",…
## $ value       <dbl> 10.617213, 11.279177, 10.312868, 10.725223, 11.881711, 9.9…

Yuba River

Data Prepartion and Approach

  1. Pull in gage data from YR7 CDEC gage.
  • Note that this gage only contains data from 2020 onwards. Originally we included temperature data collected during RST data collection for this analysis; however, due to inconsistencies in using two different data sources the resulting predicted mean values were lower than the min values as the RST data only has mean data. We then decided to just rely on the gage data despite the small time period.
  1. Prepare datasets for regression analysis
  • Dataset with no missing data to train (Butte Creek and Yuba River)
  • Dataset with missing data to predict (Yuba River)
  1. Combine datasets with no missing data, and missing data

  2. Identify gaps to predict

  3. Use data where there are no missing data for either dataset for regression modeling

Exploratory analysis

There is a linear correlation between mean, min, and max water temperature on Butte Creek and Yuba River. For example, the plot below suggests a strong linear relationship between the mean water temperatures of Butte Creek and Yuba River. The positive slope of the linear trend line implies that higher water temperatures in Butte Creek are associated with higher water temperatures in Yuba River. These visual representations support the results of the linear regression analysis, which identified a statistically significant relationship between the mean, max and min water temperatures of these two locations.

Plot of mean temp for Yuba River and Butte Creek

Building regression models

We built 3 regression models for Yuba River - one each for mean, min, and max water temperature relationships. We evaluated the models using the Mean Absolute Percentage Error (MAPE). The MAPE for all three models indicated good predictive accuracy.

  • MAPE for the mean model: 0.0844, which means the model’s predictions are off by about 8.44% on average.
  • MAPE for the min model: 0.0838, which means the model’s predictions are off by about 8.38% on average.
  • MAPE for the max model: 0.078, which means the model’s predictions are off by about 7.80% on average.
Predictions
  • Combine predictions (mean, min, max data frames into one)
  • Reshape data

The plot shows the predicted mean temperature of the Yuba River over time (which is similar for min and max predictions as well). The line represents the trend of the predicted mean temperatures, indicating how they change as the date progresses. This visualization helps to identify any patterns or trends in the mean water temperature over the observed period.

Full dataset
  • Join with original data (merge combined predictions with the original dataset that includes gage agency and gage number)
  • Visualize combined data

The plot below shows how the mean, min, and max temperatures for the Yuba River over time. Interpolated values are seamlessly integrated where observed data is missing, ensuring a continuous temperature dataset. Each water temperature type (mean, min, max) is represented with a different color to help in distinguishing the temperature trends and understanding the temperature fluctuations.

## Rows: 26,946
## Columns: 6
## Groups: stream, date, statistic, gage_agency, gage_number [26,946]
## $ date        <date> 1999-12-31, 1999-12-31, 1999-12-31, 2000-01-01, 2000-01-0…
## $ stream      <chr> "yuba river", "yuba river", "yuba river", "yuba river", "y…
## $ gage_agency <chr> "interpolated", "interpolated", "interpolated", "interpola…
## $ gage_number <chr> "interpolated", "interpolated", "interpolated", "interpola…
## $ statistic   <chr> "mean", "max", "min", "mean", "max", "min", "mean", "max",…
## $ value       <dbl> 15.98101, 14.79745, 16.52032, 16.07775, 15.43969, 16.25013…