The past couple of weeks, I have begun my journey into machine learning with linear and logistic regression. While I have only put my toes in to test the water so far, I have already realized the importance of outliers on the predictive ability of any given regression. One useful metric I discovered to identify pesky outliers is studentized (AKA leveraged) residuals. The studentized residual is calculated by dividing each residual by its standard error. Observations whose studentized residuals are greater than 3 should draw a weary eye, and residuals greater than 6 should definitely be considered a toss-out candidate.
Thankfully, we do not have to calculate this on our own, statsmodels has a function within the standard linear regression package that will find studentized residuals for us! Statsmodel.OLSresults.outlier_test() calculates these studentized residuals for us. Make outlier identification a breeze with outlier_test()!
# Predictors span the entire set of observations
X_train_sm = sm.add_constant(X, prepend=True)
# Instantiating and fitting statsmodels ordinary least squares
lr_model = sm.OLS(y_log, X_train_sm).fit()# Performing statsmodels outliers test, calculates leveraged (AKA studentized) residuals
outliers = lr_model.outlier_test()
# Gathering 5 largest outliers according to leveraged residuals
outliers['student_resid'].abs().sort_values(ascending=False).head(n=5)
# Calculating predicted prices for ordinary least squares
sm_pred_prices = np.exp(lr_model.predict(X_train_sm))# Plotting leveraged residuals versus the predicted prices
student_res = outliers['student_resid']
# Most data points concentrated between -6 and 6 studentized residuals
# Outliers are thus greater than 6 or less than -5
plt.axhline(y=-6, color='red')
plt.axhline(y=6, color='red')
plt.scatter(x=sm_pred_prices, y=student_res)
plt.title('Outlier Identification')
plt.ylabel('Studentized Residuals')
plt.xlabel('Predicted Price')
plt.savefig('./figures/outlier_identification.png', format='png')