The input data validation stage
Contents
3.1. The input data validation stage#
In this section, the input validation steps are presented through an example.
%load_ext autoreload
%autoreload 2
import pandas as pd
from eensight.config import ConfigLoader
from eensight.methods.preprocessing.validation import (
check_column_exists,
check_column_type_datetime,
check_column_values_increasing,
check_column_values_unique,
remove_duplicate_dates,
validate_dataset,
)
from eensight.settings import PROJECT_PATH
from eensight.utils import load_catalog
3.1.1. Load dataset#
First, we load the catalog for one of the available datasets (the one with site_id="b03"
):
catalog = load_catalog(store_uri="../../../data", site_id="b03", namespace="train")
Get the raw input data:
features = catalog.load("train.input-features")
labels = catalog.load("train.input-labels")
3.1.2. Validate labels#
Let’s work first with the label data:
type(labels)
<class 'dict'>
Raw input data is always treated as a partitioned dataset. This means that both features and labels are loaded by eensight
as dictionaries with file names as keys and load functions as values.
consumption = []
for load_fn in labels.values():
consumption.append(load_fn())
consumption = pd.concat(consumption, axis=0)
consumption.head()
Unnamed: 0 | timestamp | consumption | |
---|---|---|---|
0 | 0 | 2015-11-18 18:45:00 | 4671.259736 |
1 | 1 | 2015-11-18 19:00:00 | 4211.127331 |
2 | 2 | 2015-11-18 19:15:00 | 4393.182131 |
3 | 3 | 2015-11-18 19:30:00 | 4562.470893 |
4 | 4 | 2015-11-18 19:45:00 | 4535.828727 |
Check if a column with the name timestamp
exists
assert check_column_exists(consumption, "timestamp").success
Parse the contents of the timestamp
column as dates
date_format = "%Y-%m-%d %H:%M:%S"
if not check_column_type_datetime(consumption, "timestamp").success:
try:
consumption["timestamp"] = pd.to_datetime(consumption["timestamp"], format=date_format)
consumption = consumption.dropna(subset="timestamp")
except ValueError:
raise ValueError(f"Column `timestamp` must be in datetime format")
Sort the the contents of the timestamp
column if they are not already in an increasing order
if not check_column_values_increasing(consumption, "timestamp").success:
consumption = consumption.sort_values(by=["timestamp"])
Check that the values of the timestamp
column are unique, and if they are not, remove duplicate dates
We can test this functionality by adding some duplicate rows to the data. Half of these rows correspond to consumption values that differ more than 0.25 (default value of threshold
) times the standard deviation of the data. These should be replaced by NaN
at the end of this task.
n_duplicate = 100
n_out_of_range = 50
nan_before = consumption["consumption"].isna()
consumption_with_dup = pd.concat(
(consumption, consumption[~nan_before].sample(n=n_duplicate, replace=False)),
axis=0,
ignore_index=True,
)
assert not check_column_values_unique(consumption_with_dup, "timestamp").success
data_std = consumption_with_dup["consumption"].std()
for i, (_, grouped) in enumerate(
consumption_with_dup[consumption_with_dup.duplicated(subset="timestamp")].groupby(
"timestamp"
)
):
if i < n_out_of_range:
consumption_with_dup.loc[grouped.index[0], "consumption"] = (
grouped["consumption"].iloc[0] + 2 * data_std
)
consumption_no_dup = remove_duplicate_dates(consumption_with_dup, "timestamp", threshold=0.25)
assert check_column_values_unique(consumption_no_dup, "timestamp").success
assert consumption_no_dup["consumption"].isna().sum() == n_out_of_range + nan_before.sum()
Finally, the timestamp
column becomes the dataframe’s index, and columns including “Unnamed” in their name are dropped:
consumption = consumption.set_index("timestamp")
to_drop = consumption.filter(like="Unnamed", axis=1).columns
if len(to_drop) > 0:
consumption = consumption.drop(to_drop, axis=1)
All, the aforementioned tasks are carried out by the eensight.methods.preprocessing.validation.validate_dataset
function.
3.1.3. Validate features#
The features of this dataset include two files:
features.keys()
dict_keys(['holidays.csv', 'temperature.csv'])
Each file is separately validated:
load_fn = features["holidays.csv"]
holidays = load_fn()
holidays = validate_dataset(holidays)
holidays.head()
holiday | |
---|---|
timestamp | |
2015-01-01 | New year |
2015-01-06 | Epiphany |
2015-04-06 | Easter Monday |
2015-04-25 | Liberation Day |
2015-05-01 | International Workers' Day |
assert check_column_type_datetime(holidays.reset_index(), "timestamp").success
assert check_column_values_increasing(holidays.reset_index(), "timestamp").success
assert check_column_values_unique(holidays.reset_index(), "timestamp").success
load_fn = features["temperature.csv"]
temperature = load_fn()
temperature = validate_dataset(temperature)
temperature.head()
temperature | |
---|---|
timestamp | |
2015-12-07 12:00:00 | 14.3 |
2015-12-07 13:00:00 | 15.2 |
2015-12-07 14:00:00 | 16.0 |
2015-12-07 15:00:00 | 16.2 |
2015-12-07 16:00:00 | 15.8 |
assert check_column_type_datetime(temperature.reset_index(), "timestamp").success
assert check_column_values_increasing(temperature.reset_index(), "timestamp").success
assert check_column_values_unique(temperature.reset_index(), "timestamp").success
3.1.4. Parameters#
The parameters of the input data validation stage - as they can be found in the eensight/conf/base/parameters/preprocess.yml
file - are:
params = ConfigLoader(PROJECT_PATH / "conf").get("parameters*", "parameters*/**", "**/parameters*")
{
"rebind_names": params["rebind_names"],
"date_format": params["date_format"],
"validation": params["validation"],
}
{ 'rebind_names': {'consumption': None, 'temperature': None, 'timestamp': None}, 'date_format': '%Y-%m-%d %H:%M:%S', 'validation': {'threshold': 0.25} }