mlo_smo_prep module

class ginput.priors.mlo_smo_prep.InsituMonthlyAverager(clobber: bool = False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>)

Abstract class that handles most of the logic for updating MLO or SMO monthly average files.

To implement a site-specific concrete averager, only two methods need implemented: class_site() (a class method that returns the site code, e.g. “MLO” or “SMO”) and select_background(), which takes in an hourly dataframe and returns one with only background data left as rows.

To update a monthly file, use the convert() method.

abstract classmethod class_site()

Returns the site code (e.g. “MLO”) for the class.

convert(noaa_hourly_file: str, previous_monthly_file: str, output_monthly_file: str, allow_alt_sites: bool = False, site_id_override: Optional[str] = None, is_seed_file: bool = False) None

Convert a NOAA hourly file to monthly averages and append to the end of an existing file.

Parameters
  • noaa_hourly_file – Path to the NOAA hourly file to use to update previous_monthly_file.

  • previous_monthly_file – Path to the previous monthly averages file, the output file will copy its existing data and append new monthly average(s) to the end.

  • output_monthly_file – Path to write the output file

  • allow_alt_sites – Set to True to allow the input hourly file to contain data with a site ID different from the one defined by self.class_id(). Otherwise, that would raise an InsituProcessingError. Changing to True also changes how mismatches between the input monthly file an the self.class_id() value are reported, but such mismatches do not raise an error in either case.

  • site_id_override

    If given, then this value will be used as the site ID for the new row(s) added to the output monthly file. When this is given, allow_alt_sites has no effect. Mismatches between the ID in the input hourly file and the ID given to this argument are reported, but do not raise an exception.

    Note

    Passing a string with more than 3 characters for site_id_override should work, but is not officially tested/supported.

  • is_seed_file – Whether this is the first monthly average file built from a NOAA monthly average file, rather than a previous ginput-managed monthly average file. (This controls how the header is constructed.)

Returns

Writes to output_monthly_file

Return type

None

static get_new_hourly_data(monthly_df: DataFrame, hourly_df: DataFrame, last_expected_month: Timestamp, allow_missing_times: bool = False, creation_month: Optional[Timestamp] = None, limit_to_avail_data: bool = True) Tuple[DataFrame, DatetimeIndex]

Get the subset of hourly_df that has new data to append to the end of monthly_df

Parameters
  • monthly_df – A dataframe of monthly-averaged data from the previous monthly-average file that is being updated.

  • hourly_df – A dataframe of new hourly data, not yet filtered for good-quality data. It must contain all hours for the months it is to add.

  • last_expected_month – The last month required to have data in the hourly file. Note that this data may be all fill values, it only requires that this month and all months after the end of the previous monthly data be present.

  • allow_missing_times – By default, an error is raised if any of the expected data is not present in the hourly file. Setting this to True reduces that error to a warning.

Returns

  • pd.DataFrame – The subset of hourly_df that is new and good quality.

  • pd.DatetimeIndex – A sequence of dates that are the first of the month for every month to be added by the first return value.

Raises

InsituProcessingError – previous monthly data and the last expected month. If allow_missing_times is True, then this is not raised and a warning is logged instead.

abstract select_background(hourly_df: DataFrame) DataFrame

Select only background data from an hourly dataframe.

Parameters

hourly_df – The dataframe with all good-quality hourly data

Returns

A dataframe with only background data kept.

Return type

pd.DataFrame

classmethod write_monthly_insitu(output_file: str, monthly_df: DataFrame, previous_monthly_file: str, new_hourly_file: str, new_months: DatetimeIndex, is_seed_file: bool = False, clobber: bool = False) None

Write a new monthly average file

Parameters
  • output_file – Path to write to

  • monthly_df – Dataframe with the monthly data to write; must have four columns: site, year, month, value.

  • previous_monthly_file – Path to the previous monthly file

  • new_hourly_file – Path to the hourly file used for this update

  • new_months – A sequence of datetimes giving the first of each month added

  • is_seed_file – Whether this is the first monthly average file built from a NOAA monthly average file, rather than a previous ginput-managed monthly average file. (This controls how the header is constructed.)

  • clobber – Whether to allow overwriting the output file if it already exists.

exception ginput.priors.mlo_smo_prep.InsituProcessingError
class ginput.priors.mlo_smo_prep.MloBackgroundMode(value)

An enumeration.

class ginput.priors.mlo_smo_prep.MloMonthlyAverager(background_method=MloBackgroundMode.TIME_AND_PRELIM, clobber=False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>)
classmethod class_site()

Returns the site code (e.g. “MLO”) for the class.

select_background(hourly_df)

Select only background data from an hourly dataframe.

Parameters

hourly_df – The dataframe with all good-quality hourly data

Returns

A dataframe with only background data kept.

Return type

pd.DataFrame

class ginput.priors.mlo_smo_prep.MloPrelimMode(value)

An enumeration.

class ginput.priors.mlo_smo_prep.SmoMonthlyAverager(smo_wind_file: str, clobber: bool = False, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False)
classmethod class_site()

Returns the site code (e.g. “MLO”) for the class.

select_background(hourly_df)

Select only background data from an hourly dataframe.

Parameters

hourly_df – The dataframe with all good-quality hourly data

Returns

A dataframe with only background data kept.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.compute_wind_for_times(wind_file: str, times: ~pandas.core.indexes.datetimes.DatetimeIndex, wind_alt: int = 10, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False) DataFrame

Compute winds for specific times from a file already interpolated to a specific lat/lon

Parameters
  • wind_file

    Either:

    1. A file containing a list of GEOS FP-IT surface files that span the times in the times input, or

    2. A file summarizing the GEOS FP-IT surface variables at the SMO lat/lon. It must have UxM and VxM variables, where “x” is the wind altitude (see wind_alt)

  • times – Times to interpolate to.

  • wind_alt – Which surface wind altitude (2, 10, or 50 meters usually) to use. This will look for variables named e.g. U10M and V10M in the GEOS file(s), with the number changing based on the altitude.

  • run_settings – A RunSettings instance that carries configuration.

  • allow_missing_geos_files – Set to True to allow this function to complete if any of the expected GEOS files were missing. By default an error is raised.

Returns

Data frame with the U and V wind vectors, wind velocity, and wind direction indexed by time. The vectors and velocity will have the same units as in the winds_file (usually meters/second) and the wind direction uses the convention of what direction the wind is coming FROM in degrees clockwise from north.

Return type

pd.DataFrame

Notes

10 is the default wind_alt because Waterman et al. 1989 (JGR, vol. 94, pp. 14817–14829) indicates in the “Air intake and topography” section that sampling heights between 6 and 18 meters were suitable.

ginput.priors.mlo_smo_prep.get_smo_winds_from_file(winds_file: str, wind_alt: int = 10) Dataset

Get surface winds interpolated to SMO lat/lon from GEOS surface files.

Parameters
  • wind_file – Path to a file that lists GEOS surface files, one per line. All GEOS files between the day floor and day ceiling of the times in insitu_df must be included. That is, if insitu_df has data from 2021-08-02 16:00 to 2021-08-28 19:00 UTC, then GEOS files from 2021-08-02 00:00 to 2021-08-29 00:00 UTC are required.

  • wind_alt – Which altitude above the surface to draw winds from. 10 meters is the default as that is close to tha altitude of the NOAA sampling intake (see compute_wind_for_times())

Returns

An xarray dataset containing ‘u’ and ‘v’ variables index by time.

Return type

xr.Dataset

ginput.priors.mlo_smo_prep.make_geos_2d_file_list(path_pattern: str, start_date, end_date, geos_version: Optional[str] = None) Sequence[str]

Helper function to make a list of GEOS 2D files for the SMO prep.

Parameters
  • path_pattern – A pattern that can use strftime formatting (%Y, %m, etc.) to give the correct path to each GEOS 2D file. If geos_version is None, this must give the full path to the file. Otherwise, the file name indicated by geos_version is appended to the end. Note that this will append a “/” to the end of the path if it needs to be concatenated with geos_version and no “/” is present, so this will not work well on Windows.

  • start_date – First time to include in the list. Any type acceptable to pandas.date_range() will do.

  • end_date – Last time to include in the list.

  • geos_version – If this is one of the strings “fpit” or “it”, it will append the appropriate file name pattern to path_pattern for that GEOS version. Any other string is treated as a file name pattern and is directly appended. If this is None, path_pattern must include the file name.

Returns

A list of file paths as strings.

Return type

paths

ginput.priors.mlo_smo_prep.merge_insitu_with_wind(insitu_df: ~pandas.core.frame.DataFrame, wind_file: str, wind_alt: float = 10, run_settings: ~ginput.priors.mlo_smo_prep.RunSettings = <ginput.priors.mlo_smo_prep.RunSettings object>, allow_missing_geos_files: bool = False) DataFrame

Merge an in situ hourly dataframe with SMO data with GEOS surface winds

Parameters
  • insitu_df – The SMO hourly dataframe.

  • wind_file – Path to a file that lists GEOS surface files, one per line. All GEOS files between the day floor and day ceiling of the times in insitu_df must be included. That is, if insitu_df has data from 2021-08-02 16:00 to 2021-08-28 19:00 UTC, then GEOS files from 2021-08-02 00:00 to 2021-08-29 00:00 UTC are required.

  • wind_alt – Which altitude above the surface to draw winds from. 10 meters is the default as that is close to tha altitude of the NOAA sampling intake (see compute_wind_for_times())

  • run_settings – A RunSettings instance that carries configuration for extra outputs.

Returns

The insitu_df with columns for wind speed, velocity, u-component, and v-component added, each interpolated to the times in the insitu_df.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.mlo_background_selection(mlo_df: DataFrame, method: MloBackgroundMode) DataFrame

Limit a Mauna Loa hourly dataframe to background data.

Parameters
  • mlo_df – The MLO hourly dataframe.

  • method

    How to do the background selection. The two enum variants are:

    • TIME_AND_SIGMA: limit to midnight to 7a local time and where the standard deviation is less than 0.3 ppm.

    • TIME_AND_PRELIM: limit to midnight to 7a local time and where noaa_prelim_flagging() would keep the data.

Returns

mlo_df with non-background rows removed.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.monthly_avg_rapid_data(df: DataFrame, year_field: Optional[str] = None, month_field: Optional[str] = None) DataFrame

Compute monthly averages from an hourly dataframe

Parameters
  • df – The hourly dataframe to compute from

  • year_field – Which column in the dataframe gives the year. If this is None, will try to find a column containing “year”.

  • month_field – Which column in the dataframe gives the month. If this is None, will try to find a column containing “month”.

Returns

The input dataframe averaged to months, with the index set to datetimes at the start of each month.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.noaa_prelim_flagging(noaa_df: DataFrame, hr_std_dev_max: float = 0.2, hr2hr_diff_max: float = 0.25, mode: MloPrelimMode = MloPrelimMode.TIME_RELAXED_DIFF_EITHER, full_output: bool = False) DataFrame

Do preliminary selection of background data for NOAA houly in situ data

Parameters
  • noaa_df – A NOAA hourly dataframe read by read_hourly_insitu().

  • hr_std_dev_max – The maximum standard deviation allowed within one hourly data point for it to be kept.

  • hr2hr_diff_max – Maximum allowed difference between adjacent hourly points for them to be retained. How this is interpreted depends on mode.

  • mode

    Controls how the hour-to-hour differences are used to reject data points. The different MloPrelimMode variants mean:

    • TIME_RELAXED_DIFF_EITHER: Keep points when the difference with either adjacent point is less than hr2hr_diff_max. If the time difference is >1 hour (due to the standard deviation or flagging), do not count that VMR difference.

    • TIME_RELAXED_DIFF_BOTH: Only keep points where the VMR differences are smaller than hr2hr_diff_max or the time difference is >1 on both sides.

    • TIME_STRICT_DIFF_EITHER: Keep a point only if it has at least one neighbor with a VMR difference smaller than hr2hr_diff_max and a time difference < 1 hr.

    • TIME_STRICT_DIFF_BOTH: Keep a point only if the VMR difference with both neighbors is smaller than hr2hr_diff_max and both time differences are < 1 hr.

  • full_output – Set to True to output two additional logical vectors that indicate which points pass the hourly standard deviation and hour-to-hour difference criteria. If False, only the dataframe limited to good points is returned.

ginput.priors.mlo_smo_prep.read_hourly_insitu(hourly_file: str) DataFrame

Read and standardize an hourly in situ file

Parameters

hourly_file – Path to the NOAA hourly file

Returns

A dataframe with the hourly data, a pandas.DatetimeIndex, and its columns standardized to “site”, “year”, “month”, “day”, “hour”, “minute”, “value”, “uncertainty”, and “flag”.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.read_surface_file(surface_file: str, datetime_index: Optional[Tuple[str, str]] = ('year', 'month'), match_v1_columns: bool = True, drop_fills: bool = False, fills_to_nan: bool = True, v3_known_site_codes=('mlo', 'smo')) DataFrame

Read a text file with NOAA surface data, either a monthly average or hourly file

Parameters
  • surface_file – Path to the surface file to read

  • datetime_index – If not None, then must be a tuple giving the names of the year and month columns in the file, to be converted to a monthly datetime index. (Only supported for monthly files.)

Returns

The data from the surface file as a dataframe. If datetime_index was not None, the index will be a pandas.DatetimeIndex.

Return type

pd.DataFrame

ginput.priors.mlo_smo_prep.smo_wind_filter(smo_df: DataFrame, first_wind_dir: float = 330.0, last_wind_dir: float = 160.0, min_wind_speed: float = 2.0) DataFrame

Subset an SMO CO2 dataframe to just rows with specific wind conditions

Parameters
  • smo_df – The dataframe of SMO CO2 DMFs with wind data included (see merge_insitu_with_wind())

  • first_wind_dir

  • last_wind_dir – These set the range of wind directions permitted; only data with a wind direction in the clockwise slice between first_wind_dir and last_wind_dir are retained.

  • min_wind_speed – The slowest wind speed allowed; only rows with a wind speed greater than or equal to this are retained.

Returns

A data frame that has a subset of the rows in smo_df.

Return type

pd.DataFrame

Notes

The default wind limits come from Waterman et al. 1989 (JGR, vol. 94, pp. 14817–14829). In the section “Data Processing,” they give two different criteria for wind direction. Although they found that the looser constrains kept much more data and did not introduce significant numbers of non-background measurements, I am using the stricter criteria, since I am filtering on GEOS FP-IT winds, which likely have some error compared to the surface winds measured at SMO.