mlo_smo_prep module
- class ginput.priors.mlo_smo_prep.InsituMonthlyAverager(clobber: bool = False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>)
Abstract class that handles most of the logic for updating MLO or SMO monthly average files.
To implement a site-specific concrete averager, only two methods need implemented:
class_site()(a class method that returns the site code, e.g. “MLO” or “SMO”) andselect_background(), which takes in an hourly dataframe and returns one with only background data left as rows.To update a monthly file, use the
convert()method.- abstract classmethod class_site()
Returns the site code (e.g. “MLO”) for the class.
- convert(noaa_hourly_file: str, previous_monthly_file: str, output_monthly_file: str, allow_alt_sites: bool = False, site_id_override: Optional[str] = None, is_seed_file: bool = False) None
Convert a NOAA hourly file to monthly averages and append to the end of an existing file.
- Parameters
noaa_hourly_file – Path to the NOAA hourly file to use to update previous_monthly_file.
previous_monthly_file – Path to the previous monthly averages file, the output file will copy its existing data and append new monthly average(s) to the end.
output_monthly_file – Path to write the output file
allow_alt_sites – Set to
Trueto allow the input hourly file to contain data with a site ID different from the one defined byself.class_id(). Otherwise, that would raise anInsituProcessingError. Changing toTruealso changes how mismatches between the input monthly file an theself.class_id()value are reported, but such mismatches do not raise an error in either case.site_id_override –
If given, then this value will be used as the site ID for the new row(s) added to the output monthly file. When this is given,
allow_alt_siteshas no effect. Mismatches between the ID in the input hourly file and the ID given to this argument are reported, but do not raise an exception.Note
Passing a string with more than 3 characters for
site_id_overrideshould work, but is not officially tested/supported.is_seed_file – Whether this is the first monthly average file built from a NOAA monthly average file, rather than a previous ginput-managed monthly average file. (This controls how the header is constructed.)
- Returns
Writes to output_monthly_file
- Return type
None
- static get_new_hourly_data(monthly_df: DataFrame, hourly_df: DataFrame, last_expected_month: Timestamp, allow_missing_times: bool = False, creation_month: Optional[Timestamp] = None, limit_to_avail_data: bool = True) Tuple[DataFrame, DatetimeIndex]
Get the subset of hourly_df that has new data to append to the end of monthly_df
- Parameters
monthly_df – A dataframe of monthly-averaged data from the previous monthly-average file that is being updated.
hourly_df – A dataframe of new hourly data, not yet filtered for good-quality data. It must contain all hours for the months it is to add.
last_expected_month – The last month required to have data in the hourly file. Note that this data may be all fill values, it only requires that this month and all months after the end of the previous monthly data be present.
allow_missing_times – By default, an error is raised if any of the expected data is not present in the hourly file. Setting this to True reduces that error to a warning.
- Returns
pd.DataFrame – The subset of hourly_df that is new and good quality.
pd.DatetimeIndex – A sequence of dates that are the first of the month for every month to be added by the first return value.
- Raises
InsituProcessingError – previous monthly data and the last expected month. If allow_missing_times is True, then this is not raised and a warning is logged instead.
- abstract select_background(hourly_df: DataFrame) DataFrame
Select only background data from an hourly dataframe.
- Parameters
hourly_df – The dataframe with all good-quality hourly data
- Returns
A dataframe with only background data kept.
- Return type
pd.DataFrame
- classmethod write_monthly_insitu(output_file: str, monthly_df: DataFrame, previous_monthly_file: str, new_hourly_file: str, new_months: DatetimeIndex, is_seed_file: bool = False, clobber: bool = False) None
Write a new monthly average file
- Parameters
output_file – Path to write to
monthly_df – Dataframe with the monthly data to write; must have four columns: site, year, month, value.
previous_monthly_file – Path to the previous monthly file
new_hourly_file – Path to the hourly file used for this update
new_months – A sequence of datetimes giving the first of each month added
is_seed_file – Whether this is the first monthly average file built from a NOAA monthly average file, rather than a previous ginput-managed monthly average file. (This controls how the header is constructed.)
clobber – Whether to allow overwriting the output file if it already exists.
- exception ginput.priors.mlo_smo_prep.InsituProcessingError
- class ginput.priors.mlo_smo_prep.MloBackgroundMode(value)
An enumeration.
- class ginput.priors.mlo_smo_prep.MloMonthlyAverager(background_method=MloBackgroundMode.TIME_AND_PRELIM, clobber=False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>)
- classmethod class_site()
Returns the site code (e.g. “MLO”) for the class.
- select_background(hourly_df)
Select only background data from an hourly dataframe.
- Parameters
hourly_df – The dataframe with all good-quality hourly data
- Returns
A dataframe with only background data kept.
- Return type
pd.DataFrame
- class ginput.priors.mlo_smo_prep.MloPrelimMode(value)
An enumeration.
- class ginput.priors.mlo_smo_prep.SmoMonthlyAverager(smo_wind_file: str, clobber: bool = False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False)
- classmethod class_site()
Returns the site code (e.g. “MLO”) for the class.
- select_background(hourly_df)
Select only background data from an hourly dataframe.
- Parameters
hourly_df – The dataframe with all good-quality hourly data
- Returns
A dataframe with only background data kept.
- Return type
pd.DataFrame
- ginput.priors.mlo_smo_prep.compute_wind_for_times(wind_file: str, times: ~pandas.core.indexes.datetimes.DatetimeIndex, wind_alt: int = 10, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False) DataFrame
Compute winds for specific times from a file already interpolated to a specific lat/lon
- Parameters
wind_file –
Either:
A file containing a list of GEOS FP-IT surface files that span the times in the times input, or
A file summarizing the GEOS FP-IT surface variables at the SMO lat/lon. It must have UxM and VxM variables, where “x” is the wind altitude (see wind_alt)
times – Times to interpolate to.
wind_alt – Which surface wind altitude (2, 10, or 50 meters usually) to use. This will look for variables named e.g. U10M and V10M in the GEOS file(s), with the number changing based on the altitude.
run_settings – A
RunSettingsinstance that carries configuration.allow_missing_geos_files – Set to True to allow this function to complete if any of the expected GEOS files were missing. By default an error is raised.
- Returns
Data frame with the U and V wind vectors, wind velocity, and wind direction indexed by time. The vectors and velocity will have the same units as in the winds_file (usually meters/second) and the wind direction uses the convention of what direction the wind is coming FROM in degrees clockwise from north.
- Return type
pd.DataFrame
Notes
10 is the default wind_alt because Waterman et al. 1989 (JGR, vol. 94, pp. 14817–14829) indicates in the “Air intake and topography” section that sampling heights between 6 and 18 meters were suitable.
- ginput.priors.mlo_smo_prep.get_smo_winds_from_file(winds_file: str, wind_alt: int = 10) Dataset
Get surface winds interpolated to SMO lat/lon from GEOS surface files.
- Parameters
wind_file – Path to a file that lists GEOS surface files, one per line. All GEOS files between the day floor and day ceiling of the times in insitu_df must be included. That is, if insitu_df has data from 2021-08-02 16:00 to 2021-08-28 19:00 UTC, then GEOS files from 2021-08-02 00:00 to 2021-08-29 00:00 UTC are required.
wind_alt – Which altitude above the surface to draw winds from. 10 meters is the default as that is close to tha altitude of the NOAA sampling intake (see
compute_wind_for_times())
- Returns
An xarray dataset containing ‘u’ and ‘v’ variables index by time.
- Return type
xr.Dataset
- ginput.priors.mlo_smo_prep.make_geos_2d_file_list(path_pattern: str, start_date, end_date, geos_version: Optional[str] = None) Sequence[str]
Helper function to make a list of GEOS 2D files for the SMO prep.
- Parameters
path_pattern – A pattern that can use
strftimeformatting (%Y,%m, etc.) to give the correct path to each GEOS 2D file. Ifgeos_versionisNone, this must give the full path to the file. Otherwise, the file name indicated bygeos_versionis appended to the end. Note that this will append a “/” to the end of the path if it needs to be concatenated withgeos_versionand no “/” is present, so this will not work well on Windows.start_date – First time to include in the list. Any type acceptable to
pandas.date_range()will do.end_date – Last time to include in the list.
geos_version – If this is one of the strings “fpit” or “it”, it will append the appropriate file name pattern to
path_patternfor that GEOS version. Any other string is treated as a file name pattern and is directly appended. If this isNone,path_patternmust include the file name.
- Returns
A list of file paths as strings.
- Return type
paths
- ginput.priors.mlo_smo_prep.merge_insitu_with_wind(insitu_df: ~pandas.core.frame.DataFrame, wind_file: str, wind_alt: float = 10, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False) DataFrame
Merge an in situ hourly dataframe with SMO data with GEOS surface winds
- Parameters
insitu_df – The SMO hourly dataframe.
wind_file – Path to a file that lists GEOS surface files, one per line. All GEOS files between the day floor and day ceiling of the times in insitu_df must be included. That is, if insitu_df has data from 2021-08-02 16:00 to 2021-08-28 19:00 UTC, then GEOS files from 2021-08-02 00:00 to 2021-08-29 00:00 UTC are required.
wind_alt – Which altitude above the surface to draw winds from. 10 meters is the default as that is close to tha altitude of the NOAA sampling intake (see
compute_wind_for_times())run_settings – A
RunSettingsinstance that carries configuration for extra outputs.
- Returns
The insitu_df with columns for wind speed, velocity, u-component, and v-component added, each interpolated to the times in the insitu_df.
- Return type
pd.DataFrame
- ginput.priors.mlo_smo_prep.mlo_background_selection(mlo_df: DataFrame, method: MloBackgroundMode) DataFrame
Limit a Mauna Loa hourly dataframe to background data.
- Parameters
mlo_df – The MLO hourly dataframe.
method –
How to do the background selection. The two enum variants are:
TIME_AND_SIGMA: limit to midnight to 7a local time and where the standard deviation is less than 0.3 ppm.
TIME_AND_PRELIM: limit to midnight to 7a local time and where
noaa_prelim_flagging()would keep the data.
- Returns
mlo_df with non-background rows removed.
- Return type
pd.DataFrame
- ginput.priors.mlo_smo_prep.monthly_avg_rapid_data(df: DataFrame, year_field: Optional[str] = None, month_field: Optional[str] = None) DataFrame
Compute monthly averages from an hourly dataframe
- Parameters
df – The hourly dataframe to compute from
year_field – Which column in the dataframe gives the year. If this is None, will try to find a column containing “year”.
month_field – Which column in the dataframe gives the month. If this is None, will try to find a column containing “month”.
- Returns
The input dataframe averaged to months, with the index set to datetimes at the start of each month.
- Return type
pd.DataFrame
- ginput.priors.mlo_smo_prep.noaa_prelim_flagging(noaa_df: DataFrame, hr_std_dev_max: float = 0.2, hr2hr_diff_max: float = 0.25, mode: MloPrelimMode = MloPrelimMode.TIME_RELAXED_DIFF_EITHER, full_output: bool = False) DataFrame
Do preliminary selection of background data for NOAA houly in situ data
- Parameters
noaa_df – A NOAA hourly dataframe read by
read_hourly_insitu().hr_std_dev_max – The maximum standard deviation allowed within one hourly data point for it to be kept.
hr2hr_diff_max – Maximum allowed difference between adjacent hourly points for them to be retained. How this is interpreted depends on mode.
mode –
Controls how the hour-to-hour differences are used to reject data points. The different
MloPrelimModevariants mean:TIME_RELAXED_DIFF_EITHER: Keep points when the difference with either adjacent point is less than hr2hr_diff_max. If the time difference is >1 hour (due to the standard deviation or flagging), do not count that VMR difference.
TIME_RELAXED_DIFF_BOTH: Only keep points where the VMR differences are smaller than hr2hr_diff_max or the time difference is >1 on both sides.
TIME_STRICT_DIFF_EITHER: Keep a point only if it has at least one neighbor with a VMR difference smaller than hr2hr_diff_max and a time difference < 1 hr.
TIME_STRICT_DIFF_BOTH: Keep a point only if the VMR difference with both neighbors is smaller than hr2hr_diff_max and both time differences are < 1 hr.
full_output – Set to True to output two additional logical vectors that indicate which points pass the hourly standard deviation and hour-to-hour difference criteria. If False, only the dataframe limited to good points is returned.
- ginput.priors.mlo_smo_prep.read_hourly_insitu(hourly_file: str) DataFrame
Read and standardize an hourly in situ file
- Parameters
hourly_file – Path to the NOAA hourly file
- Returns
A dataframe with the hourly data, a
pandas.DatetimeIndex, and its columns standardized to “site”, “year”, “month”, “day”, “hour”, “minute”, “value”, “uncertainty”, and “flag”.- Return type
pd.DataFrame
- ginput.priors.mlo_smo_prep.read_surface_file(surface_file: str, datetime_index: Optional[Tuple[str, str]] = ('year', 'month'), match_v1_columns: bool = True, drop_fills: bool = False, fills_to_nan: bool = True, v3_known_site_codes=('mlo', 'smo')) DataFrame
Read a text file with NOAA surface data, either a monthly average or hourly file
- Parameters
surface_file – Path to the surface file to read
datetime_index – If not None, then must be a tuple giving the names of the year and month columns in the file, to be converted to a monthly datetime index. (Only supported for monthly files.)
- Returns
The data from the surface file as a dataframe. If datetime_index was not None, the index will be a
pandas.DatetimeIndex.- Return type
pd.DataFrame
- ginput.priors.mlo_smo_prep.smo_wind_filter(smo_df: DataFrame, first_wind_dir: float = 330.0, last_wind_dir: float = 160.0, min_wind_speed: float = 2.0) DataFrame
Subset an SMO CO2 dataframe to just rows with specific wind conditions
- Parameters
smo_df – The dataframe of SMO CO2 DMFs with wind data included (see
merge_insitu_with_wind())first_wind_dir –
last_wind_dir – These set the range of wind directions permitted; only data with a wind direction in the clockwise slice between first_wind_dir and last_wind_dir are retained.
min_wind_speed – The slowest wind speed allowed; only rows with a wind speed greater than or equal to this are retained.
- Returns
A data frame that has a subset of the rows in smo_df.
- Return type
pd.DataFrame
Notes
The default wind limits come from Waterman et al. 1989 (JGR, vol. 94, pp. 14817–14829). In the section “Data Processing,” they give two different criteria for wind direction. Although they found that the looser constrains kept much more data and did not introduce significant numbers of non-background measurements, I am using the stricter criteria, since I am filtering on GEOS FP-IT winds, which likely have some error compared to the surface winds measured at SMO.