3.1. The input data validation stage#

In this section, the input validation steps are presented through an example.

%load_ext autoreload
%autoreload 2
import pandas as pd

from eensight.config import ConfigLoader
from eensight.methods.preprocessing.validation import (
    check_column_exists, 
    check_column_type_datetime, 
    check_column_values_increasing,
    check_column_values_unique,
    remove_duplicate_dates,
    validate_dataset,
)
from eensight.settings import PROJECT_PATH
from eensight.utils import load_catalog

3.1.1. Load dataset#

First, we load the catalog for one of the available datasets (the one with site_id="b03"):

catalog = load_catalog(store_uri="../../../data", site_id="b03", namespace="train")

Get the raw input data:

features = catalog.load("train.input-features")
labels = catalog.load("train.input-labels")

3.1.2. Validate labels#

Let’s work first with the label data:

type(labels)
<class 'dict'>

Raw input data is always treated as a partitioned dataset. This means that both features and labels are loaded by eensight as dictionaries with file names as keys and load functions as values.

consumption = []

for load_fn in labels.values():
    consumption.append(load_fn())

consumption = pd.concat(consumption, axis=0)
consumption.head()
Unnamed: 0 timestamp consumption
0 0 2015-11-18 18:45:00 4671.259736
1 1 2015-11-18 19:00:00 4211.127331
2 2 2015-11-18 19:15:00 4393.182131
3 3 2015-11-18 19:30:00 4562.470893
4 4 2015-11-18 19:45:00 4535.828727

Check if a column with the name timestamp exists

assert check_column_exists(consumption, "timestamp").success

Parse the contents of the timestamp column as dates

date_format = "%Y-%m-%d %H:%M:%S"

if not check_column_type_datetime(consumption, "timestamp").success:
    try:
        consumption["timestamp"] = pd.to_datetime(consumption["timestamp"], format=date_format)
        consumption = consumption.dropna(subset="timestamp")
    except ValueError:
        raise ValueError(f"Column `timestamp` must be in datetime format")

Sort the the contents of the timestamp column if they are not already in an increasing order

if not check_column_values_increasing(consumption, "timestamp").success:
    consumption = consumption.sort_values(by=["timestamp"])

Check that the values of the timestamp column are unique, and if they are not, remove duplicate dates

We can test this functionality by adding some duplicate rows to the data. Half of these rows correspond to consumption values that differ more than 0.25 (default value of threshold) times the standard deviation of the data. These should be replaced by NaN at the end of this task.

n_duplicate = 100
n_out_of_range = 50
nan_before = consumption["consumption"].isna()

consumption_with_dup = pd.concat(
    (consumption, consumption[~nan_before].sample(n=n_duplicate, replace=False)),
    axis=0,
    ignore_index=True,
)
assert not check_column_values_unique(consumption_with_dup, "timestamp").success
data_std = consumption_with_dup["consumption"].std()

for i, (_, grouped) in enumerate(
        consumption_with_dup[consumption_with_dup.duplicated(subset="timestamp")].groupby(
            "timestamp"
        )
):
    if i < n_out_of_range:
        consumption_with_dup.loc[grouped.index[0], "consumption"] = (
            grouped["consumption"].iloc[0] + 2 * data_std
        )
consumption_no_dup = remove_duplicate_dates(consumption_with_dup, "timestamp", threshold=0.25)

assert check_column_values_unique(consumption_no_dup, "timestamp").success
assert consumption_no_dup["consumption"].isna().sum() == n_out_of_range + nan_before.sum()

Finally, the timestamp column becomes the dataframe’s index, and columns including “Unnamed” in their name are dropped:

consumption = consumption.set_index("timestamp")

to_drop = consumption.filter(like="Unnamed", axis=1).columns
if len(to_drop) > 0:
    consumption = consumption.drop(to_drop, axis=1)

All, the aforementioned tasks are carried out by the eensight.methods.preprocessing.validation.validate_dataset function.

3.1.3. Validate features#

The features of this dataset include two files:

features.keys()
dict_keys(['holidays.csv', 'temperature.csv'])

Each file is separately validated:

load_fn = features["holidays.csv"]
holidays = load_fn()
holidays = validate_dataset(holidays)
holidays.head()
holiday
timestamp
2015-01-01 New year
2015-01-06 Epiphany
2015-04-06 Easter Monday
2015-04-25 Liberation Day
2015-05-01 International Workers' Day
assert check_column_type_datetime(holidays.reset_index(), "timestamp").success
assert check_column_values_increasing(holidays.reset_index(), "timestamp").success
assert check_column_values_unique(holidays.reset_index(), "timestamp").success
load_fn = features["temperature.csv"]
temperature = load_fn()
temperature = validate_dataset(temperature)
temperature.head()
temperature
timestamp
2015-12-07 12:00:00 14.3
2015-12-07 13:00:00 15.2
2015-12-07 14:00:00 16.0
2015-12-07 15:00:00 16.2
2015-12-07 16:00:00 15.8
assert check_column_type_datetime(temperature.reset_index(), "timestamp").success
assert check_column_values_increasing(temperature.reset_index(), "timestamp").success
assert check_column_values_unique(temperature.reset_index(), "timestamp").success

3.1.4. Parameters#

The parameters of the input data validation stage - as they can be found in the eensight/conf/base/parameters/preprocess.yml file - are:

params = ConfigLoader(PROJECT_PATH / "conf").get("parameters*", "parameters*/**", "**/parameters*")
{
    "rebind_names": params["rebind_names"],
    "date_format": params["date_format"],
    "validation": params["validation"],
}
{
    'rebind_names': {'consumption': None, 'temperature': None, 'timestamp': None},
    'date_format': '%Y-%m-%d %H:%M:%S',
    'validation': {'threshold': 0.25}
}