{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Regression: Jupyter Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `pandas.DataFrame`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `read_csv()` from pandas, read data into dataframe. If your data happens to be in a M$ Excel file, then there is also a `read_excel()` function." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "dataset = pd.read_csv('./history_data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relationship Between `pandas.DataFrame` and `numpy.ndarray` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See how a DataFrame holds values using `numpy.ndarray`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([['New York', 'New York', nan, ..., 0.0, nan, 'Clear'],\n", " ['New York', 'New York', nan, ..., 0.0, nan, 'Clear'],\n", " ['New York', 'New York', nan, ..., 0.0, 87.77, 'Clear'],\n", " ...,\n", " ['New York', 'New York', nan, ..., 23.3, 74.96, 'Clear'],\n", " ['New York', 'New York', nan, ..., 14.3, 70.33, 'Clear'],\n", " ['New York', 'New York', nan, ..., 0.0, 84.26, 'Clear']],\n", " dtype=object)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.values" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(dataset.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, `pandas.DataFrame` provides many attributes from the underlying `numpy.ndarray`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two dimensional array ..." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.ndim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... extending such and such cell in each direction ..." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(72, 16)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.values.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(72, 16)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrame.describe()` is convenient for interactive use in a Jupyter notebook, just like many other methods." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Resolved AddressMaximum TemperatureMinimum TemperatureTemperatureWind ChillHeat IndexPrecipitationSnow DepthWind SpeedWind GustCloud CoverRelative Humidity
count0.072.00000072.00000072.0000000.022.00000072.0000000.072.00000065.00000072.00000038.000000
meanNaN82.99583370.68888976.287500NaN91.3272730.012222NaN12.69027820.4707693.36944472.513421
stdNaN5.9461065.5740235.313256NaN5.9945360.070695NaN6.0259808.3024237.21482512.665492
minNaN69.90000053.30000061.500000NaN82.4000000.000000NaN2.2000004.7000000.00000046.090000
25%NaN79.05000068.90000074.300000NaN87.0500000.000000NaN9.10000015.0000000.00000066.377500
50%NaN83.75000071.95000076.650000NaN89.7500000.000000NaN12.80000019.7000000.00000072.330000
75%NaN87.95000074.32500079.850000NaN95.9500000.000000NaN15.42500025.3000002.82500081.715000
maxNaN92.90000080.70000085.800000NaN101.6000000.470000NaN38.00000050.60000034.50000096.970000
\n", "
" ], "text/plain": [ " Resolved Address Maximum Temperature Minimum Temperature \\\n", "count 0.0 72.000000 72.000000 \n", "mean NaN 82.995833 70.688889 \n", "std NaN 5.946106 5.574023 \n", "min NaN 69.900000 53.300000 \n", "25% NaN 79.050000 68.900000 \n", "50% NaN 83.750000 71.950000 \n", "75% NaN 87.950000 74.325000 \n", "max NaN 92.900000 80.700000 \n", "\n", " Temperature Wind Chill Heat Index Precipitation Snow Depth \\\n", "count 72.000000 0.0 22.000000 72.000000 0.0 \n", "mean 76.287500 NaN 91.327273 0.012222 NaN \n", "std 5.313256 NaN 5.994536 0.070695 NaN \n", "min 61.500000 NaN 82.400000 0.000000 NaN \n", "25% 74.300000 NaN 87.050000 0.000000 NaN \n", "50% 76.650000 NaN 89.750000 0.000000 NaN \n", "75% 79.850000 NaN 95.950000 0.000000 NaN \n", "max 85.800000 NaN 101.600000 0.470000 NaN \n", "\n", " Wind Speed Wind Gust Cloud Cover Relative Humidity \n", "count 72.000000 65.000000 72.000000 38.000000 \n", "mean 12.690278 20.470769 3.369444 72.513421 \n", "std 6.025980 8.302423 7.214825 12.665492 \n", "min 2.200000 4.700000 0.000000 46.090000 \n", "25% 9.100000 15.000000 0.000000 66.377500 \n", "50% 12.800000 19.700000 0.000000 72.330000 \n", "75% 15.425000 25.300000 2.825000 81.715000 \n", "max 38.000000 50.600000 34.500000 96.970000 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting Input and Output Features from a `pandas.DataFrame`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beware that algorithms expect a two dimensional array as the set of inputs. Using the column header (\"Minimum Temperature\") to index the dataframe gives a list-like type. **Wrong!!**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "inputfeatures = dataset['Minimum Temperature']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 53.3\n", "1 58.7\n", "2 60.2\n", "3 66.8\n", "4 68.3\n", " ... \n", "67 70.1\n", "68 72.2\n", "69 72.1\n", "70 75.5\n", "71 78.2\n", "Name: Minimum Temperature, Length: 72, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputfeatures" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(inputfeatures)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Correction**: index the dataframe *with a list of column headers*" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "inputfeatures = dataset[['Minimum Temperature']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sign of correctness: that one is made up more nicely by the notebook:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Minimum Temperature
053.3
158.7
260.2
366.8
468.3
......
6770.1
6872.2
6972.1
7075.5
7178.2
\n", "

72 rows × 1 columns

\n", "
" ], "text/plain": [ " Minimum Temperature\n", "0 53.3\n", "1 58.7\n", "2 60.2\n", "3 66.8\n", "4 68.3\n", ".. ...\n", "67 70.1\n", "68 72.2\n", "69 72.1\n", "70 75.5\n", "71 78.2\n", "\n", "[72 rows x 1 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputfeatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise, the output features." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "outputfeatures = dataset[['Maximum Temperature']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting with `matplotlib`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fortunately (well, that was on purpose), our feature sets are one dimensional, so plotting the dataset im two dimensions makes sense. Multidimensional data analysis is not so straightforward - this is why they call it data **science**." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.DataFrame` interacts nicely with `matplotlib`. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "dataset.plot(x='Minimum Temperature', y='Maximum Temperature', style='o')\n", "plt.title('Min/Max Temperature')\n", "plt.xlabel('Min')\n", "plt.ylabel('Max')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Splicing: Split into Training and Test Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before creating the model (from an algorithm and a dataset), we prepare the dataset\n", "* 80% for training\n", "* 20% for testing/verification" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "input_train, input_test, output_train, output_test = \\\n", " train_test_split(inputfeatures, outputfeatures, test_size=0.2, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the Model: Algorithm + Training Data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially, the model *is* the algorithm" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "model = LinearRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we feed it the training data" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "model = model.fit(input_train, output_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model is complete; see the parameters of the linear interpolation (would need theory to better understand):" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.80189231]])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.coef_" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([25.95355086])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Verify the Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We saved 20% of the dataset for verification.\n", "* Use the model to predict the output for the input test data.\n", "* Compare prediction to actual output test set" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "output_predicted = model.predict(input_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we (ab)use a `pandas.DataFrame` to nicely format actual output test data and predicted output side by side.\n", "\n", "**Note** that `input_test` is a `pd.DataFrame`, but `output_predicted` is a `numpy.ndarray`.\n", "\n", "**Reason**:`model.predict()` is happy with anything that supports indexing (thanks to duck typing - we gave it a `Dataframe`), but its output is always a `numpy.ndarray`" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ActualPredicted
080.083.609608
184.984.571879
291.186.736988
380.084.170933
478.278.798254
592.084.170933
678.275.189739
792.088.180394
885.483.449230
980.188.661530
1092.987.057745
1185.483.850176
1287.281.284120
1390.083.850176
1483.681.685067
\n", "
" ], "text/plain": [ " Actual Predicted\n", "0 80.0 83.609608\n", "1 84.9 84.571879\n", "2 91.1 86.736988\n", "3 80.0 84.170933\n", "4 78.2 78.798254\n", "5 92.0 84.170933\n", "6 78.2 75.189739\n", "7 92.0 88.180394\n", "8 85.4 83.449230\n", "9 80.1 88.661530\n", "10 92.9 87.057745\n", "11 85.4 83.850176\n", "12 87.2 81.284120\n", "13 90.0 83.850176\n", "14 83.6 81.685067" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'Actual': output_test.values.reshape((15,)),\n", " 'Predicted': output_predicted.reshape((15,))})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing the actual and predicted values, we can see that they are \"not far off\". Whatever this means - in a real data science world (this is only the surface), we would now have to use advanced statistical methods to actually measure the term \"not far off\".\n", "\n", "But this is left to data scientists. Our job is to create correct programs, and to keep those *maintainable*." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" } }, "nbformat": 4, "nbformat_minor": 4 }