{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ba420ddc",
   "metadata": {},
   "source": [
    "# The `preprocess` pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20e3cab2",
   "metadata": {},
   "source": [
    "The preprocessing of the input data includes three (3) main tasks:\n",
    "\n",
    "1. Data validation - *Do the datasets have the expected structure?*\n",
    "\n",
    "\n",
    "2. Data alignment and merging - *Do the datasets have the same indices? If they do, merge them. If not, first align their indices and then merge them.*\n",
    "\n",
    "\n",
    "3. Evaluation of data adequacy - *Do we have enough data to understand and model the energy consumption of the building?*\n",
    "\n",
    "\n",
    "The input data preprocessing tasks are summarized below:\n",
    "\n",
    "![Preprocessing steps](../../images/preprocessing.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb796245",
   "metadata": {},
   "source": [
    "**Data validation**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4c180fe",
   "metadata": {},
   "source": [
    "Raw input data is always treated as a [partitioned dataset](https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset). This means that both features and labels are loaded by `eensight` as dictionaries with file names as keys and load functions as values. Each load function returns a pandas DataFrame. If there are more than one (1) feature and/or label files, `eensight` validates each file's data separately.\n",
    "\n",
    "The input data validation has the following goals:\n",
    "\n",
    "**1. Change feature names to ones that are expected by `eensight`.** \n",
    "\n",
    "`eensight` expects tabular features where the datetime information is provided by a column named `timestamp`, and the outdoor dry bulb temperature information is provided by a column named `temperature`. In addition, label data (energy consumption) is also expected in tabular form, where there is one column named `consumption`.     \n",
    "\n",
    "As an example, the dataset with `site_id=\"b01\"` has information about both plug and hvac loads: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a6ea23c2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>timestamp</th>\n",
       "      <th>plugs</th>\n",
       "      <th>hvac</th>\n",
       "      <th>total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-01-01 00:00:00</td>\n",
       "      <td>157.34</td>\n",
       "      <td>11.20</td>\n",
       "      <td>168.54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-01-01 01:00:00</td>\n",
       "      <td>138.35</td>\n",
       "      <td>10.92</td>\n",
       "      <td>149.27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-01-01 02:00:00</td>\n",
       "      <td>116.55</td>\n",
       "      <td>11.80</td>\n",
       "      <td>128.35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-01-01 03:00:00</td>\n",
       "      <td>101.08</td>\n",
       "      <td>10.92</td>\n",
       "      <td>112.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-01-01 04:00:00</td>\n",
       "      <td>84.92</td>\n",
       "      <td>11.98</td>\n",
       "      <td>96.90</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from eensight.utils import load_catalog\n",
    "\n",
    "catalog = load_catalog(store_uri=\"../../data\", site_id=\"b01\", namespace=\"train\")\n",
    "\n",
    "labels = catalog.load(\"train.input-labels\")\n",
    "load_fn = labels[\"consumption.csv\"]\n",
    "labels = load_fn()\n",
    "labels.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e320faa",
   "metadata": {},
   "source": [
    "If `eensight` users want to apply its functionality on the HVAC load, they can pass the following option to the command line: \n",
    "\n",
    "```\n",
    "--param rebind-names.consumption=hvac\n",
    "```\n",
    "\n",
    "The result is that the `hvac` column will be used in all places in the `eensight` code where a `consumption` column is expected (everywhere labels are expected). \n",
    "\n",
    "**Note**: \n",
    "\n",
    "> Passing either `name-of-parameter` or `name_of_parameter` to the command line, will update the `name_of_parameter` parameter in the `eensight` code.  \n",
    "\n",
    "Similarly, if a dataset includes outdoor temperature in a column named `Temp`, passing the following option to the command line will guide `eensight` to use this column in all places where a `temperature` column is expected:  \n",
    "\n",
    "```\n",
    "--param rebind-names.temperature=Temp\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c12e7d64",
   "metadata": {},
   "source": [
    "**2. Make sure that a column named `timestamp` exists.** \n",
    "\n",
    "The `timestamp` column will be parsed as datetime. The default date format is `%Y-%m-%d %H:%M:%S`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5247f0ea",
   "metadata": {},
   "source": [
    "**3. Check that the values of the `timestamp` column are unique, and if they are not, remove duplicate dates.**\n",
    "\n",
    "This function is controlled by a `threshold` parameter. If the range of the values that share a timestamp is less than `threshold` times the standard deviation of the data, they are replaced by their average. Otherwise, they are treated as missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2815b530",
   "metadata": {},
   "source": [
    "**Index alignment and data merging**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "854cd63e",
   "metadata": {},
   "source": [
    "First, `eensight` checks that the labels dataset has only one time step with duration that is less than one (1) day. This is needed because energy consumption is a cumulative variable: the value at any given timestamp corresponds to the consumption during the period between this timestamp and the preceding one. When more than one time steps with duration that is less than one (1) day exist, the meaning of the consumption values becomes ambiguous as a target for any predictive model.\n",
    "\n",
    "If the labels dataset has indeed only one time step, its index becomes the primary index for all features. Alignment is needed only when features have different time steps than the labels (for instance, hourly temperature data, but 15-minute interval consumption data). \n",
    "\n",
    "For data that is *numerical*, `eensight` provides two (2) approaches for alignment. The first approach interpolates the data to match the primary index, the second matches the primary index on the nearest key of the feature's index. \n",
    "\n",
    "For data that is both *numerical* and *cumulative* (temperature is non-cumulative, whereas consumption is cumulative), alignment is done by first calculating the cumulative sum of the feature to align, then applying the alignment method (by interpolation or by distance), and finally calculating the first discrete differences so that to reverse the cumulative sum transformation. \n",
    "\n",
    "Alignment of categorical features is always done by timestamp distance.\n",
    "\n",
    "`eensight` can align daily data to sub-daily indices (for instance, daily holiday data to hourly consumption data). \n",
    "\n",
    "Finally, aligning sub-daily data to a daily index is not supported. This should be part of a feature engineering process and not a data preprocessing one."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d8ba1c1",
   "metadata": {},
   "source": [
    "**Data adequacy evaluation**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60047c2d",
   "metadata": {},
   "source": [
    "The final step of the `preprocess` pipeline is to check that there is enough data available for the energy consumption of the building under study. Baseline energy consumption data must cover at least one (1) full year before any energy efficiency intervention. \n",
    "\n",
    "In addition, data must be available for over a certain percentage of hours in each calendar month. The default value of this percentage is 10%, but it is a user-defined parameter, so it can be adjusted."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4dc46d5",
   "metadata": {},
   "source": [
    "Before evaluating the adequacy of the available data, `eensight` screens for non-physically plausible values in the consumption data. Long streaks of constant values are filtered out as well (here *long* is defined in hours by `no_change_window`). However, long streaks of constant values will not be filtered out if they represent more than the `max_pct_constant` percentange of the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90dd0f0d",
   "metadata": {},
   "source": [
    "-----------------"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "vscode": {
   "interpreter": {
    "hash": "959d2dd615dbea0ec23cfa5d24469412d511971322994f213dbde8f84b85ea73"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}