{ "cells": [ { "cell_type": "markdown", "id": "4e9aefe7", "metadata": {}, "source": [ "## The input data validation stage" ] }, { "cell_type": "markdown", "id": "657d13ab", "metadata": {}, "source": [ "In this section, the input validation steps are presented through an example." ] }, { "cell_type": "code", "execution_count": 1, "id": "274afaef", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "bfda4368", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from eensight.config import ConfigLoader\n", "from eensight.methods.preprocessing.validation import (\n", " check_column_exists, \n", " check_column_type_datetime, \n", " check_column_values_increasing,\n", " check_column_values_unique,\n", " remove_duplicate_dates,\n", " validate_dataset,\n", ")\n", "from eensight.settings import PROJECT_PATH\n", "from eensight.utils import load_catalog" ] }, { "cell_type": "markdown", "id": "28b01073", "metadata": {}, "source": [ "### Load dataset\n", "\n", "First, we load the catalog for one of the available datasets (the one with `site_id=\"b03\"`):" ] }, { "cell_type": "code", "execution_count": 3, "id": "3ec999db", "metadata": {}, "outputs": [], "source": [ "catalog = load_catalog(store_uri=\"../../../data\", site_id=\"b03\", namespace=\"train\")" ] }, { "cell_type": "markdown", "id": "78259b1a", "metadata": {}, "source": [ "Get the raw input data:" ] }, { "cell_type": "code", "execution_count": 4, "id": "6a4911f4", "metadata": {}, "outputs": [], "source": [ "features = catalog.load(\"train.input-features\")\n", "labels = catalog.load(\"train.input-labels\")" ] }, { "cell_type": "markdown", "id": "ab33d284", "metadata": {}, "source": [ "### Validate labels\n", "\n", "Let's work first with the label data:" ] }, { "cell_type": "code", "execution_count": 5, "id": "c97e89f5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<class 'dict'>\n",
       "
\n" ], "text/plain": [ "\u001b[1m<\u001b[0m\u001b[1;95mclass\u001b[0m\u001b[39m \u001b[0m\u001b[32m'dict'\u001b[0m\u001b[1m>\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "type(labels)" ] }, { "cell_type": "markdown", "id": "c77215cb", "metadata": {}, "source": [ "Raw input data is always treated as a [partitioned dataset](https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset). This means that both features and labels are loaded by `eensight` as dictionaries with file names as keys and load functions as values." ] }, { "cell_type": "code", "execution_count": 6, "id": "99f0a6ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0timestampconsumption
002015-11-18 18:45:004671.259736
112015-11-18 19:00:004211.127331
222015-11-18 19:15:004393.182131
332015-11-18 19:30:004562.470893
442015-11-18 19:45:004535.828727
\n", "
" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "consumption = []\n", "\n", "for load_fn in labels.values():\n", " consumption.append(load_fn())\n", "\n", "consumption = pd.concat(consumption, axis=0)\n", "consumption.head()" ] }, { "cell_type": "markdown", "id": "5b96be0e", "metadata": {}, "source": [ "**Check if a column with the name `timestamp` exists**" ] }, { "cell_type": "code", "execution_count": 7, "id": "8931c57f", "metadata": {}, "outputs": [], "source": [ "assert check_column_exists(consumption, \"timestamp\").success" ] }, { "cell_type": "markdown", "id": "35528048", "metadata": {}, "source": [ "**Parse the contents of the `timestamp` column as dates**" ] }, { "cell_type": "code", "execution_count": 8, "id": "09be234d", "metadata": {}, "outputs": [], "source": [ "date_format = \"%Y-%m-%d %H:%M:%S\"\n", "\n", "if not check_column_type_datetime(consumption, \"timestamp\").success:\n", " try:\n", " consumption[\"timestamp\"] = pd.to_datetime(consumption[\"timestamp\"], format=date_format)\n", " consumption = consumption.dropna(subset=\"timestamp\")\n", " except ValueError:\n", " raise ValueError(f\"Column `timestamp` must be in datetime format\")" ] }, { "cell_type": "markdown", "id": "73d46ec3", "metadata": {}, "source": [ "**Sort the the contents of the `timestamp` column if they are not already in an increasing order**" ] }, { "cell_type": "code", "execution_count": 9, "id": "163671eb", "metadata": {}, "outputs": [], "source": [ "if not check_column_values_increasing(consumption, \"timestamp\").success:\n", " consumption = consumption.sort_values(by=[\"timestamp\"])" ] }, { "cell_type": "markdown", "id": "4d202f0b", "metadata": {}, "source": [ "**Check that the values of the `timestamp` column are unique, and if they are not, remove duplicate dates**\n", "\n", "We can test this functionality by adding some duplicate rows to the data. Half of these rows correspond to consumption values that differ more than 0.25 (default value of `threshold`) times the standard deviation of the data. These should be replaced by `NaN` at the end of this task." ] }, { "cell_type": "code", "execution_count": 10, "id": "30d5f5a1", "metadata": {}, "outputs": [], "source": [ "n_duplicate = 100\n", "n_out_of_range = 50\n", "nan_before = consumption[\"consumption\"].isna()\n", "\n", "consumption_with_dup = pd.concat(\n", " (consumption, consumption[~nan_before].sample(n=n_duplicate, replace=False)),\n", " axis=0,\n", " ignore_index=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "id": "68596867", "metadata": {}, "outputs": [], "source": [ "assert not check_column_values_unique(consumption_with_dup, \"timestamp\").success" ] }, { "cell_type": "code", "execution_count": 12, "id": "946b9e3b", "metadata": {}, "outputs": [], "source": [ "data_std = consumption_with_dup[\"consumption\"].std()\n", "\n", "for i, (_, grouped) in enumerate(\n", " consumption_with_dup[consumption_with_dup.duplicated(subset=\"timestamp\")].groupby(\n", " \"timestamp\"\n", " )\n", "):\n", " if i < n_out_of_range:\n", " consumption_with_dup.loc[grouped.index[0], \"consumption\"] = (\n", " grouped[\"consumption\"].iloc[0] + 2 * data_std\n", " )" ] }, { "cell_type": "code", "execution_count": 13, "id": "d0ff10c2", "metadata": {}, "outputs": [], "source": [ "consumption_no_dup = remove_duplicate_dates(consumption_with_dup, \"timestamp\", threshold=0.25)\n", "\n", "assert check_column_values_unique(consumption_no_dup, \"timestamp\").success\n", "assert consumption_no_dup[\"consumption\"].isna().sum() == n_out_of_range + nan_before.sum()" ] }, { "cell_type": "markdown", "id": "73da0349", "metadata": {}, "source": [ "Finally, the `timestamp` column becomes the dataframe's index, and columns including \"Unnamed\" in their name are dropped:" ] }, { "cell_type": "code", "execution_count": 14, "id": "4af5595e", "metadata": {}, "outputs": [], "source": [ "consumption = consumption.set_index(\"timestamp\")\n", "\n", "to_drop = consumption.filter(like=\"Unnamed\", axis=1).columns\n", "if len(to_drop) > 0:\n", " consumption = consumption.drop(to_drop, axis=1)" ] }, { "cell_type": "markdown", "id": "1d1c2be5", "metadata": {}, "source": [ "All, the aforementioned tasks are carried out by the `eensight.methods.preprocessing.validation.validate_dataset` function." ] }, { "cell_type": "markdown", "id": "7e5be1ab", "metadata": {}, "source": [ "### Validate features\n", "\n", "The features of this dataset include two files:" ] }, { "cell_type": "code", "execution_count": 15, "id": "5567a58b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
dict_keys(['holidays.csv', 'temperature.csv'])\n",
       "
\n" ], "text/plain": [ "\u001b[1;35mdict_keys\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[32m'holidays.csv'\u001b[0m, \u001b[32m'temperature.csv'\u001b[0m\u001b[1m]\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "features.keys()" ] }, { "cell_type": "markdown", "id": "63b15730", "metadata": {}, "source": [ "Each file is separately validated:" ] }, { "cell_type": "code", "execution_count": 16, "id": "1efcf8ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
holiday
timestamp
2015-01-01New year
2015-01-06Epiphany
2015-04-06Easter Monday
2015-04-25Liberation Day
2015-05-01International Workers' Day
\n", "
" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "load_fn = features[\"holidays.csv\"]\n", "holidays = load_fn()\n", "holidays = validate_dataset(holidays)\n", "holidays.head()" ] }, { "cell_type": "code", "execution_count": 17, "id": "bf73bb05", "metadata": {}, "outputs": [], "source": [ "assert check_column_type_datetime(holidays.reset_index(), \"timestamp\").success\n", "assert check_column_values_increasing(holidays.reset_index(), \"timestamp\").success\n", "assert check_column_values_unique(holidays.reset_index(), \"timestamp\").success" ] }, { "cell_type": "code", "execution_count": 18, "id": "632a14a0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperature
timestamp
2015-12-07 12:00:0014.3
2015-12-07 13:00:0015.2
2015-12-07 14:00:0016.0
2015-12-07 15:00:0016.2
2015-12-07 16:00:0015.8
\n", "
" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "load_fn = features[\"temperature.csv\"]\n", "temperature = load_fn()\n", "temperature = validate_dataset(temperature)\n", "temperature.head()" ] }, { "cell_type": "code", "execution_count": 19, "id": "2f99b35b", "metadata": {}, "outputs": [], "source": [ "assert check_column_type_datetime(temperature.reset_index(), \"timestamp\").success\n", "assert check_column_values_increasing(temperature.reset_index(), \"timestamp\").success\n", "assert check_column_values_unique(temperature.reset_index(), \"timestamp\").success" ] }, { "cell_type": "markdown", "id": "f2908fc0", "metadata": {}, "source": [ "### Parameters" ] }, { "cell_type": "markdown", "id": "e700128e", "metadata": {}, "source": [ "The parameters of the input data validation stage - as they can be found in the `eensight/conf/base/parameters/preprocess.yml` file - are:" ] }, { "cell_type": "code", "execution_count": 20, "id": "aa47298e", "metadata": {}, "outputs": [], "source": [ "params = ConfigLoader(PROJECT_PATH / \"conf\").get(\"parameters*\", \"parameters*/**\", \"**/parameters*\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "26e04a52", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
       "{\n",
       "    'rebind_names': {'consumption': None, 'temperature': None, 'timestamp': None},\n",
       "    'date_format': '%Y-%m-%d %H:%M:%S',\n",
       "    'validation': {'threshold': 0.25}\n",
       "}\n",
       "
\n" ], "text/plain": [ "\n", "\u001b[1m{\u001b[0m\n", " \u001b[32m'rebind_names'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'consumption'\u001b[0m: \u001b[3;35mNone\u001b[0m, \u001b[32m'temperature'\u001b[0m: \u001b[3;35mNone\u001b[0m, \u001b[32m'timestamp'\u001b[0m: \u001b[3;35mNone\u001b[0m\u001b[1m}\u001b[0m,\n", " \u001b[32m'date_format'\u001b[0m: \u001b[32m'%Y-%m-%d %H:%M:%S'\u001b[0m,\n", " \u001b[32m'validation'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'threshold'\u001b[0m: \u001b[1;36m0.25\u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "{\n", " \"rebind_names\": params[\"rebind_names\"],\n", " \"date_format\": params[\"date_format\"],\n", " \"validation\": params[\"validation\"],\n", "}" ] }, { "cell_type": "markdown", "id": "9f92ef69", "metadata": {}, "source": [ "-----------------" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }