mlo_smo_prep module

class ginput.priors.mlo_smo_prep.InsituMonthlyAverager(clobber: bool = False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>)

Abstract class that handles most of the logic for updating MLO or SMO monthly average files.

To implement a site-specific concrete averager, only two methods need implemented: class_site() (a class method that returns the site code, e.g. “MLO” or “SMO”) and select_background(), which takes in an hourly dataframe and returns one with only background data left as rows.

To update a monthly file, use the convert() method.

abstract classmethod class_site(): Returns the site code (e.g. “MLO”) for the class.

convert(noaa_hourly_file: str, previous_monthly_file: str, output_monthly_file: str, allow_alt_sites: bool = False, site_id_override: Optional[str] = None, is_seed_file: bool = False) → None

Convert a NOAA hourly file to monthly averages and append to the end of an existing file.

Parameters

noaa_hourly_file – Path to the NOAA hourly file to use to update previous_monthly_file.
previous_monthly_file – Path to the previous monthly averages file, the output file will copy its existing data and append new monthly average(s) to the end.
output_monthly_file – Path to write the output file
allow_alt_sites – Set to True to allow the input hourly file to contain data with a site ID different from the one defined by self.class_id(). Otherwise, that would raise an InsituProcessingError. Changing to True also changes how mismatches between the input monthly file an the self.class_id() value are reported, but such mismatches do not raise an error in either case.
site_id_override –
If given, then this value will be used as the site ID for the new row(s) added to the output monthly file. When this is given, allow_alt_sites has no effect. Mismatches between the ID in the input hourly file and the ID given to this argument are reported, but do not raise an exception.

Note

Passing a string with more than 3 characters for site_id_override should work, but is not officially tested/supported.
is_seed_file – Whether this is the first monthly average file built from a NOAA monthly average file, rather than a previous ginput-managed monthly average file. (This controls how the header is constructed.)

Returns

Writes to output_monthly_file

Return type

None

static get_new_hourly_data(monthly_df: DataFrame, hourly_df: DataFrame, last_expected_month: Timestamp, allow_missing_times: bool = False, creation_month: Optional[Timestamp] = None, limit_to_avail_data: bool = True) → Tuple[DataFrame, DatetimeIndex]

Get the subset of hourly_df that has new data to append to the end of monthly_df

Parameters

monthly_df – A dataframe of monthly-averaged data from the previous monthly-average file that is being updated.
hourly_df – A dataframe of new hourly data, not yet filtered for good-quality data. It must contain all hours for the months it is to add.
last_expected_month – The last month required to have data in the hourly file. Note that this data may be all fill values, it only requires that this month and all months after the end of the previous monthly data be present.
allow_missing_times – By default, an error is raised if any of the expected data is not present in the hourly file. Setting this to True reduces that error to a warning.

Returns

pd.DataFrame – The subset of hourly_df that is new and good quality.
pd.DatetimeIndex – A sequence of dates that are the first of the month for every month to be added by the first return value.

Raises

InsituProcessingError – previous monthly data and the last expected month. If allow_missing_times is True, then this is not raised and a warning is logged instead.

abstract select_background(hourly_df: DataFrame) → DataFrame

Select only background data from an hourly dataframe.

Parameters: hourly_df – The dataframe with all good-quality hourly data
Returns: A dataframe with only background data kept.
Return type: pd.DataFrame

classmethod write_monthly_insitu(output_file: str, monthly_df: DataFrame, previous_monthly_file: str, new_hourly_file: str, new_months: DatetimeIndex, is_seed_file: bool = False, clobber: bool = False) → None

Write a new monthly average file

Parameters

output_file – Path to write to
monthly_df – Dataframe with the monthly data to write; must have four columns: site, year, month, value.
previous_monthly_file – Path to the previous monthly file
new_hourly_file – Path to the hourly file used for this update
new_months – A sequence of datetimes giving the first of each month added
is_seed_file – Whether this is the first monthly average file built from a NOAA monthly average file, rather than a previous ginput-managed monthly average file. (This controls how the header is constructed.)
clobber – Whether to allow overwriting the output file if it already exists.

exception ginput.priors.mlo_smo_prep.InsituProcessingError

class ginput.priors.mlo_smo_prep.MloBackgroundMode(value): An enumeration.

class ginput.priors.mlo_smo_prep.MloMonthlyAverager(background_method=MloBackgroundMode.TIME_AND_PRELIM, clobber=False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>)

classmethod class_site(): Returns the site code (e.g. “MLO”) for the class.

select_background(hourly_df)

Select only background data from an hourly dataframe.

Parameters: hourly_df – The dataframe with all good-quality hourly data
Returns: A dataframe with only background data kept.
Return type: pd.DataFrame

class ginput.priors.mlo_smo_prep.MloPrelimMode(value): An enumeration.

class ginput.priors.mlo_smo_prep.SmoMonthlyAverager(smo_wind_file: str, clobber: bool = False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False)

classmethod class_site(): Returns the site code (e.g. “MLO”) for the class.

select_background(hourly_df)

Select only background data from an hourly dataframe.

Parameters: hourly_df – The dataframe with all good-quality hourly data
Returns: A dataframe with only background data kept.
Return type: pd.DataFrame

ginput.priors.mlo_smo_prep.compute_wind_for_times(wind_file: str, times: ~pandas.core.indexes.datetimes.DatetimeIndex, wind_alt: int = 10, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False) → DataFrame

Compute winds for specific times from a file already interpolated to a specific lat/lon

Parameters

wind_file –
Either:
1. A file containing a list of GEOS FP-IT surface files that span the times in the times input, or
2. A file summarizing the GEOS FP-IT surface variables at the SMO lat/lon. It must have UxM and VxM variables, where “x” is the wind altitude (see wind_alt)
times – Times to interpolate to.
wind_alt – Which surface wind altitude (2, 10, or 50 meters usually) to use. This will look for variables named e.g. U10M and V10M in the GEOS file(s), with the number changing based on the altitude.
run_settings – A RunSettings instance that carries configuration.
allow_missing_geos_files – Set to True to allow this function to complete if any of the expected GEOS files were missing. By default an error is raised.

Returns

Data frame with the U and V wind vectors, wind velocity, and wind direction indexed by time. The vectors and velocity will have the same units as in the winds_file (usually meters/second) and the wind direction uses the convention of what direction the wind is coming FROM in degrees clockwise from north.

Return type

pd.DataFrame

Notes

10 is the default wind_alt because Waterman et al. 1989 (JGR, vol. 94, pp. 14817–14829) indicates in the “Air intake and topography” section that sampling heights between 6 and 18 meters were suitable.

ginput.priors.mlo_smo_prep.get_smo_winds_from_file(winds_file: str, wind_alt: int = 10) → Dataset

Get surface winds interpolated to SMO lat/lon from GEOS surface files.

Parameters

wind_file – Path to a file that lists GEOS surface files, one per line. All GEOS files between the day floor and day ceiling of the times in insitu_df must be included. That is, if insitu_df has data from 2021-08-02 16:00 to 2021-08-28 19:00 UTC, then GEOS files from 2021-08-02 00:00 to 2021-08-29 00:00 UTC are required.
wind_alt – Which altitude above the surface to draw winds from. 10 meters is the default as that is close to tha altitude of the NOAA sampling intake (see compute_wind_for_times())

Returns

An xarray dataset containing ‘u’ and ‘v’ variables index by time.

Return type

xr.Dataset

ginput.priors.mlo_smo_prep.make_geos_2d_file_list(path_pattern: str, start_date, end_date, geos_version: Optional[str] = None) → Sequence[str]

Helper function to make a list of GEOS 2D files for the SMO prep.

Parameters

path_pattern – A pattern that can use strftime formatting (%Y, %m, etc.) to give the correct path to each GEOS 2D file. If geos_version is None, this must give the full path to the file. Otherwise, the file name indicated by geos_version is appended to the end. Note that this will append a “/” to the end of the path if it needs to be concatenated with geos_version and no “/” is present, so this will not work well on Windows.
start_date – First time to include in the list. Any type acceptable to pandas.date_range() will do.
end_date – Last time to include in the list.
geos_version – If this is one of the strings “fpit” or “it”, it will append the appropriate file name pattern to path_pattern for that GEOS version. Any other string is treated as a file name pattern and is directly appended. If this is None, path_pattern must include the file name.

Returns

A list of file paths as strings.

Return type

paths

ginput.priors.mlo_smo_prep.merge_insitu_with_wind(insitu_df: ~pandas.core.frame.DataFrame, wind_file: str, wind_alt: float = 10, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False) → DataFrame

Merge an in situ hourly dataframe with SMO data with GEOS surface winds

Parameters

insitu_df – The SMO hourly dataframe.
wind_file – Path to a file that lists GEOS surface files, one per line. All GEOS files between the day floor and day ceiling of the times in insitu_df must be included. That is, if insitu_df has data from 2021-08-02 16:00 to 2021-08-28 19:00 UTC, then GEOS files from 2021-08-02 00:00 to 2021-08-29 00:00 UTC are required.
wind_alt – Which altitude above the surface to draw winds from. 10 meters is the default as that is close to tha altitude of the NOAA sampling intake (see compute_wind_for_times())
run_settings – A RunSettings instance that carries configuration for extra outputs.

Returns

The insitu_df with columns for wind speed, velocity, u-component, and v-component added, each interpolated to the times in the insitu_df.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.mlo_background_selection(mlo_df: DataFrame, method: MloBackgroundMode) → DataFrame

Limit a Mauna Loa hourly dataframe to background data.

Parameters

mlo_df – The MLO hourly dataframe.
method –
How to do the background selection. The two enum variants are:
- TIME_AND_SIGMA: limit to midnight to 7a local time and where the standard deviation is less than 0.3 ppm.
- TIME_AND_PRELIM: limit to midnight to 7a local time and where noaa_prelim_flagging() would keep the data.

Returns

mlo_df with non-background rows removed.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.monthly_avg_rapid_data(df: DataFrame, year_field: Optional[str] = None, month_field: Optional[str] = None) → DataFrame

Compute monthly averages from an hourly dataframe

Parameters

df – The hourly dataframe to compute from
year_field – Which column in the dataframe gives the year. If this is None, will try to find a column containing “year”.
month_field – Which column in the dataframe gives the month. If this is None, will try to find a column containing “month”.

Returns

The input dataframe averaged to months, with the index set to datetimes at the start of each month.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.noaa_prelim_flagging(noaa_df: DataFrame, hr_std_dev_max: float = 0.2, hr2hr_diff_max: float = 0.25, mode: MloPrelimMode = MloPrelimMode.TIME_RELAXED_DIFF_EITHER, full_output: bool = False) → DataFrame

Do preliminary selection of background data for NOAA houly in situ data

Parameters

noaa_df – A NOAA hourly dataframe read by read_hourly_insitu().
hr_std_dev_max – The maximum standard deviation allowed within one hourly data point for it to be kept.
hr2hr_diff_max – Maximum allowed difference between adjacent hourly points for them to be retained. How this is interpreted depends on mode.
mode –
Controls how the hour-to-hour differences are used to reject data points. The different MloPrelimMode variants mean:
- TIME_RELAXED_DIFF_EITHER: Keep points when the difference with either adjacent point is less than hr2hr_diff_max. If the time difference is >1 hour (due to the standard deviation or flagging), do not count that VMR difference.
- TIME_RELAXED_DIFF_BOTH: Only keep points where the VMR differences are smaller than hr2hr_diff_max or the time difference is >1 on both sides.
- TIME_STRICT_DIFF_EITHER: Keep a point only if it has at least one neighbor with a VMR difference smaller than hr2hr_diff_max and a time difference < 1 hr.
- TIME_STRICT_DIFF_BOTH: Keep a point only if the VMR difference with both neighbors is smaller than hr2hr_diff_max and both time differences are < 1 hr.
full_output – Set to True to output two additional logical vectors that indicate which points pass the hourly standard deviation and hour-to-hour difference criteria. If False, only the dataframe limited to good points is returned.

ginput.priors.mlo_smo_prep.read_hourly_insitu(hourly_file: str) → DataFrame

Read and standardize an hourly in situ file

Parameters: hourly_file – Path to the NOAA hourly file
Returns: A dataframe with the hourly data, a pandas.DatetimeIndex, and its columns standardized to “site”, “year”, “month”, “day”, “hour”, “minute”, “value”, “uncertainty”, and “flag”.
Return type: pd.DataFrame

ginput.priors.mlo_smo_prep.read_surface_file(surface_file: str, datetime_index: Optional[Tuple[str, str]] = ('year', 'month'), match_v1_columns: bool = True, drop_fills: bool = False, fills_to_nan: bool = True, v3_known_site_codes=('mlo', 'smo')) → DataFrame

Read a text file with NOAA surface data, either a monthly average or hourly file

Parameters

surface_file – Path to the surface file to read
datetime_index – If not None, then must be a tuple giving the names of the year and month columns in the file, to be converted to a monthly datetime index. (Only supported for monthly files.)

Returns

The data from the surface file as a dataframe. If datetime_index was not None, the index will be a pandas.DatetimeIndex.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.smo_wind_filter(smo_df: DataFrame, first_wind_dir: float = 330.0, last_wind_dir: float = 160.0, min_wind_speed: float = 2.0) → DataFrame

Subset an SMO CO2 dataframe to just rows with specific wind conditions

Parameters

smo_df – The dataframe of SMO CO2 DMFs with wind data included (see merge_insitu_with_wind())
first_wind_dir –
last_wind_dir – These set the range of wind directions permitted; only data with a wind direction in the clockwise slice between first_wind_dir and last_wind_dir are retained.
min_wind_speed – The slowest wind speed allowed; only rows with a wind speed greater than or equal to this are retained.

Returns

A data frame that has a subset of the rows in smo_df.

Return type

pd.DataFrame

Notes

The default wind limits come from Waterman et al. 1989 (JGR, vol. 94, pp. 14817–14829). In the section “Data Processing,” they give two different criteria for wind direction. Although they found that the looser constrains kept much more data and did not introduce significant numbers of non-background measurements, I am using the stricter criteria, since I am filtering on GEOS FP-IT winds, which likely have some error compared to the surface winds measured at SMO.