Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
np.random.seed(1234) # for reproducibility

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     956.6
Date:                Sun, 21 Apr 2024   Prob (F-statistic):           1.96e-41
Time:                        19:55:32   Log-Likelihood:                 1.2217
No. Observations:                  50   AIC:                             5.557
Df Residuals:                      46   BIC:                             13.20
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9654      0.084     59.175      0.000       4.796       5.134
x1             0.5088      0.013     39.314      0.000       0.483       0.535
x2             0.5651      0.051     11.109      0.000       0.463       0.668
x3            -0.0206      0.001    -18.144      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        0.840   Durbin-Watson:                   2.269
Prob(Omnibus):                  0.657   Jarque-Bera (JB):                0.577
Skew:                          -0.263   Prob(JB):                        0.749
Kurtosis:                       2.972   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.44997408  4.96265684  5.43161584  5.82605136  6.12627912  6.32696437
  6.4379984   6.4828734   6.49482276  6.51136093  6.56811991  6.69299498
  6.90156162  7.1945165   7.55756297  7.96376004  8.37794864  8.76252818
  9.08363424  9.31670235  9.45050393  9.48899104  9.45064714  9.36545023
  9.26994764  9.20125134  9.19094055  9.25987342  9.41476003  9.64705998
  9.93438555 10.24417991 10.53906618 10.78298825 10.9471348  11.01467283
 10.98351335 10.86665452 10.69004615 10.4883262  10.29912985 10.15690615
 10.08725815 10.10273638 10.20077685 10.36412228 10.56365745 10.76319271
 10.92540989 11.01799353]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.00539682 10.84456472 10.55765996 10.1951886   9.82363439  9.50918122
  9.30150901  9.2216303   9.25674569  9.36337756]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0xadde5de1e8ed>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.965353
x1                  0.508755
np.sin(x1)          0.565142
I((x1 - 5) ** 2)   -0.020615
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.005397
1    10.844565
2    10.557660
3    10.195189
4     9.823634
5     9.509181
6     9.301509
7     9.221630
8     9.256746
9     9.363378
dtype: float64