{ "cells": [ { "cell_type": "markdown", "id": "ba420ddc", "metadata": {}, "source": [ "# The `preprocess` pipeline" ] }, { "cell_type": "markdown", "id": "20e3cab2", "metadata": {}, "source": [ "The preprocessing of the input data includes three (3) main tasks:\n", "\n", "1. Data validation - *Do the datasets have the expected structure?*\n", "\n", "\n", "2. Data alignment and merging - *Do the datasets have the same indices? If they do, merge them. If not, first align their indices and then merge them.*\n", "\n", "\n", "3. Evaluation of data adequacy - *Do we have enough data to understand and model the energy consumption of the building?*\n", "\n", "\n", "The input data preprocessing tasks are summarized below:\n", "\n", "" ] }, { "cell_type": "markdown", "id": "cb796245", "metadata": {}, "source": [ "**Data validation**" ] }, { "cell_type": "markdown", "id": "b4c180fe", "metadata": {}, "source": [ "Raw input data is always treated as a [partitioned dataset](https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset). This means that both features and labels are loaded by `eensight` as dictionaries with file names as keys and load functions as values. Each load function returns a pandas DataFrame. If there are more than one (1) feature and/or label files, `eensight` validates each file's data separately.\n", "\n", "The input data validation has the following goals:\n", "\n", "**1. Change feature names to ones that are expected by `eensight`.** \n", "\n", "`eensight` expects tabular features where the datetime information is provided by a column named `timestamp`, and the outdoor dry bulb temperature information is provided by a column named `temperature`. In addition, label data (energy consumption) is also expected in tabular form, where there is one column named `consumption`. \n", "\n", "As an example, the dataset with `site_id=\"b01\"` has information about both plug and hvac loads: " ] }, { "cell_type": "code", "execution_count": 2, "id": "a6ea23c2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | timestamp | \n", "plugs | \n", "hvac | \n", "total | \n", "
---|---|---|---|---|
0 | \n", "2016-01-01 00:00:00 | \n", "157.34 | \n", "11.20 | \n", "168.54 | \n", "
1 | \n", "2016-01-01 01:00:00 | \n", "138.35 | \n", "10.92 | \n", "149.27 | \n", "
2 | \n", "2016-01-01 02:00:00 | \n", "116.55 | \n", "11.80 | \n", "128.35 | \n", "
3 | \n", "2016-01-01 03:00:00 | \n", "101.08 | \n", "10.92 | \n", "112.00 | \n", "
4 | \n", "2016-01-01 04:00:00 | \n", "84.92 | \n", "11.98 | \n", "96.90 | \n", "