{
"cells": [
{
"cell_type": "markdown",
"id": "4e9aefe7",
"metadata": {},
"source": [
"## The input data validation stage"
]
},
{
"cell_type": "markdown",
"id": "657d13ab",
"metadata": {},
"source": [
"In this section, the input validation steps are presented through an example."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "274afaef",
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "bfda4368",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from eensight.config import ConfigLoader\n",
"from eensight.methods.preprocessing.validation import (\n",
" check_column_exists, \n",
" check_column_type_datetime, \n",
" check_column_values_increasing,\n",
" check_column_values_unique,\n",
" remove_duplicate_dates,\n",
" validate_dataset,\n",
")\n",
"from eensight.settings import PROJECT_PATH\n",
"from eensight.utils import load_catalog"
]
},
{
"cell_type": "markdown",
"id": "28b01073",
"metadata": {},
"source": [
"### Load dataset\n",
"\n",
"First, we load the catalog for one of the available datasets (the one with `site_id=\"b03\"`):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3ec999db",
"metadata": {},
"outputs": [],
"source": [
"catalog = load_catalog(store_uri=\"../../../data\", site_id=\"b03\", namespace=\"train\")"
]
},
{
"cell_type": "markdown",
"id": "78259b1a",
"metadata": {},
"source": [
"Get the raw input data:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6a4911f4",
"metadata": {},
"outputs": [],
"source": [
"features = catalog.load(\"train.input-features\")\n",
"labels = catalog.load(\"train.input-labels\")"
]
},
{
"cell_type": "markdown",
"id": "ab33d284",
"metadata": {},
"source": [
"### Validate labels\n",
"\n",
"Let's work first with the label data:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c97e89f5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
<class 'dict'>\n",
"
\n"
],
"text/plain": [
"\u001b[1m<\u001b[0m\u001b[1;95mclass\u001b[0m\u001b[39m \u001b[0m\u001b[32m'dict'\u001b[0m\u001b[1m>\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"type(labels)"
]
},
{
"cell_type": "markdown",
"id": "c77215cb",
"metadata": {},
"source": [
"Raw input data is always treated as a [partitioned dataset](https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset). This means that both features and labels are loaded by `eensight` as dictionaries with file names as keys and load functions as values."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "99f0a6ef",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" timestamp | \n",
" consumption | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2015-11-18 18:45:00 | \n",
" 4671.259736 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2015-11-18 19:00:00 | \n",
" 4211.127331 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 2015-11-18 19:15:00 | \n",
" 4393.182131 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 2015-11-18 19:30:00 | \n",
" 4562.470893 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 2015-11-18 19:45:00 | \n",
" 4535.828727 | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"consumption = []\n",
"\n",
"for load_fn in labels.values():\n",
" consumption.append(load_fn())\n",
"\n",
"consumption = pd.concat(consumption, axis=0)\n",
"consumption.head()"
]
},
{
"cell_type": "markdown",
"id": "5b96be0e",
"metadata": {},
"source": [
"**Check if a column with the name `timestamp` exists**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8931c57f",
"metadata": {},
"outputs": [],
"source": [
"assert check_column_exists(consumption, \"timestamp\").success"
]
},
{
"cell_type": "markdown",
"id": "35528048",
"metadata": {},
"source": [
"**Parse the contents of the `timestamp` column as dates**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "09be234d",
"metadata": {},
"outputs": [],
"source": [
"date_format = \"%Y-%m-%d %H:%M:%S\"\n",
"\n",
"if not check_column_type_datetime(consumption, \"timestamp\").success:\n",
" try:\n",
" consumption[\"timestamp\"] = pd.to_datetime(consumption[\"timestamp\"], format=date_format)\n",
" consumption = consumption.dropna(subset=\"timestamp\")\n",
" except ValueError:\n",
" raise ValueError(f\"Column `timestamp` must be in datetime format\")"
]
},
{
"cell_type": "markdown",
"id": "73d46ec3",
"metadata": {},
"source": [
"**Sort the the contents of the `timestamp` column if they are not already in an increasing order**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "163671eb",
"metadata": {},
"outputs": [],
"source": [
"if not check_column_values_increasing(consumption, \"timestamp\").success:\n",
" consumption = consumption.sort_values(by=[\"timestamp\"])"
]
},
{
"cell_type": "markdown",
"id": "4d202f0b",
"metadata": {},
"source": [
"**Check that the values of the `timestamp` column are unique, and if they are not, remove duplicate dates**\n",
"\n",
"We can test this functionality by adding some duplicate rows to the data. Half of these rows correspond to consumption values that differ more than 0.25 (default value of `threshold`) times the standard deviation of the data. These should be replaced by `NaN` at the end of this task."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "30d5f5a1",
"metadata": {},
"outputs": [],
"source": [
"n_duplicate = 100\n",
"n_out_of_range = 50\n",
"nan_before = consumption[\"consumption\"].isna()\n",
"\n",
"consumption_with_dup = pd.concat(\n",
" (consumption, consumption[~nan_before].sample(n=n_duplicate, replace=False)),\n",
" axis=0,\n",
" ignore_index=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "68596867",
"metadata": {},
"outputs": [],
"source": [
"assert not check_column_values_unique(consumption_with_dup, \"timestamp\").success"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "946b9e3b",
"metadata": {},
"outputs": [],
"source": [
"data_std = consumption_with_dup[\"consumption\"].std()\n",
"\n",
"for i, (_, grouped) in enumerate(\n",
" consumption_with_dup[consumption_with_dup.duplicated(subset=\"timestamp\")].groupby(\n",
" \"timestamp\"\n",
" )\n",
"):\n",
" if i < n_out_of_range:\n",
" consumption_with_dup.loc[grouped.index[0], \"consumption\"] = (\n",
" grouped[\"consumption\"].iloc[0] + 2 * data_std\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d0ff10c2",
"metadata": {},
"outputs": [],
"source": [
"consumption_no_dup = remove_duplicate_dates(consumption_with_dup, \"timestamp\", threshold=0.25)\n",
"\n",
"assert check_column_values_unique(consumption_no_dup, \"timestamp\").success\n",
"assert consumption_no_dup[\"consumption\"].isna().sum() == n_out_of_range + nan_before.sum()"
]
},
{
"cell_type": "markdown",
"id": "73da0349",
"metadata": {},
"source": [
"Finally, the `timestamp` column becomes the dataframe's index, and columns including \"Unnamed\" in their name are dropped:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4af5595e",
"metadata": {},
"outputs": [],
"source": [
"consumption = consumption.set_index(\"timestamp\")\n",
"\n",
"to_drop = consumption.filter(like=\"Unnamed\", axis=1).columns\n",
"if len(to_drop) > 0:\n",
" consumption = consumption.drop(to_drop, axis=1)"
]
},
{
"cell_type": "markdown",
"id": "1d1c2be5",
"metadata": {},
"source": [
"All, the aforementioned tasks are carried out by the `eensight.methods.preprocessing.validation.validate_dataset` function."
]
},
{
"cell_type": "markdown",
"id": "7e5be1ab",
"metadata": {},
"source": [
"### Validate features\n",
"\n",
"The features of this dataset include two files:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "5567a58b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"dict_keys(['holidays.csv', 'temperature.csv'])\n",
"
\n"
],
"text/plain": [
"\u001b[1;35mdict_keys\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'holidays.csv'\u001b[0m, \u001b[32m'temperature.csv'\u001b[0m\u001b[1m]\u001b[0m\u001b[1m)\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"features.keys()"
]
},
{
"cell_type": "markdown",
"id": "63b15730",
"metadata": {},
"source": [
"Each file is separately validated:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "1efcf8ab",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" holiday | \n",
"
\n",
" \n",
" timestamp | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2015-01-01 | \n",
" New year | \n",
"
\n",
" \n",
" 2015-01-06 | \n",
" Epiphany | \n",
"
\n",
" \n",
" 2015-04-06 | \n",
" Easter Monday | \n",
"
\n",
" \n",
" 2015-04-25 | \n",
" Liberation Day | \n",
"
\n",
" \n",
" 2015-05-01 | \n",
" International Workers' Day | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"load_fn = features[\"holidays.csv\"]\n",
"holidays = load_fn()\n",
"holidays = validate_dataset(holidays)\n",
"holidays.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "bf73bb05",
"metadata": {},
"outputs": [],
"source": [
"assert check_column_type_datetime(holidays.reset_index(), \"timestamp\").success\n",
"assert check_column_values_increasing(holidays.reset_index(), \"timestamp\").success\n",
"assert check_column_values_unique(holidays.reset_index(), \"timestamp\").success"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "632a14a0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" temperature | \n",
"
\n",
" \n",
" timestamp | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2015-12-07 12:00:00 | \n",
" 14.3 | \n",
"
\n",
" \n",
" 2015-12-07 13:00:00 | \n",
" 15.2 | \n",
"
\n",
" \n",
" 2015-12-07 14:00:00 | \n",
" 16.0 | \n",
"
\n",
" \n",
" 2015-12-07 15:00:00 | \n",
" 16.2 | \n",
"
\n",
" \n",
" 2015-12-07 16:00:00 | \n",
" 15.8 | \n",
"
\n",
" \n",
"
\n",
"
"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"load_fn = features[\"temperature.csv\"]\n",
"temperature = load_fn()\n",
"temperature = validate_dataset(temperature)\n",
"temperature.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "2f99b35b",
"metadata": {},
"outputs": [],
"source": [
"assert check_column_type_datetime(temperature.reset_index(), \"timestamp\").success\n",
"assert check_column_values_increasing(temperature.reset_index(), \"timestamp\").success\n",
"assert check_column_values_unique(temperature.reset_index(), \"timestamp\").success"
]
},
{
"cell_type": "markdown",
"id": "f2908fc0",
"metadata": {},
"source": [
"### Parameters"
]
},
{
"cell_type": "markdown",
"id": "e700128e",
"metadata": {},
"source": [
"The parameters of the input data validation stage - as they can be found in the `eensight/conf/base/parameters/preprocess.yml` file - are:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "aa47298e",
"metadata": {},
"outputs": [],
"source": [
"params = ConfigLoader(PROJECT_PATH / \"conf\").get(\"parameters*\", \"parameters*/**\", \"**/parameters*\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "26e04a52",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"{\n",
" 'rebind_names': {'consumption': None, 'temperature': None, 'timestamp': None},\n",
" 'date_format': '%Y-%m-%d %H:%M:%S',\n",
" 'validation': {'threshold': 0.25}\n",
"}\n",
"
\n"
],
"text/plain": [
"\n",
"\u001b[1m{\u001b[0m\n",
" \u001b[32m'rebind_names'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'consumption'\u001b[0m: \u001b[3;35mNone\u001b[0m, \u001b[32m'temperature'\u001b[0m: \u001b[3;35mNone\u001b[0m, \u001b[32m'timestamp'\u001b[0m: \u001b[3;35mNone\u001b[0m\u001b[1m}\u001b[0m,\n",
" \u001b[32m'date_format'\u001b[0m: \u001b[32m'%Y-%m-%d %H:%M:%S'\u001b[0m,\n",
" \u001b[32m'validation'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'threshold'\u001b[0m: \u001b[1;36m0.25\u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[1m}\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"{\n",
" \"rebind_names\": params[\"rebind_names\"],\n",
" \"date_format\": params[\"date_format\"],\n",
" \"validation\": params[\"validation\"],\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "9f92ef69",
"metadata": {},
"source": [
"-----------------"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}