Ethan Kuehl
2 min readJan 20, 2021

--

The past couple of weeks, I have begun my journey into machine learning with linear and logistic regression. While I have only put my toes in to test the water so far, I have already realized the importance of outliers on the predictive ability of any given regression. One useful metric I discovered to identify pesky outliers is studentized (AKA leveraged) residuals. The studentized residual is calculated by dividing each residual by its standard error. Observations whose studentized residuals are greater than 3 should draw a weary eye, and residuals greater than 6 should definitely be considered a toss-out candidate.

Thankfully, we do not have to calculate this on our own, statsmodels has a function within the standard linear regression package that will find studentized residuals for us! Statsmodel.OLSresults.outlier_test() calculates these studentized residuals for us. Make outlier identification a breeze with outlier_test()!

# Predictors span the entire set of observations 
X_train_sm = sm.add_constant(X, prepend=True)

# Instantiating and fitting statsmodels ordinary least squares
lr_model = sm.OLS(y_log, X_train_sm).fit()
# Performing statsmodels outliers test, calculates leveraged (AKA studentized) residuals
outliers = lr_model.outlier_test()

# Gathering 5 largest outliers according to leveraged residuals
outliers['student_resid'].abs().sort_values(ascending=False).head(n=5)

# Calculating predicted prices for ordinary least squares
sm_pred_prices = np.exp(lr_model.predict(X_train_sm))
# Plotting leveraged residuals versus the predicted prices
student_res = outliers['student_resid']

# Most data points concentrated between -6 and 6 studentized residuals
# Outliers are thus greater than 6 or less than -5
plt.axhline(y=-6, color='red')
plt.axhline(y=6, color='red')
plt.scatter(x=sm_pred_prices, y=student_res)

plt.title('Outlier Identification')
plt.ylabel('Studentized Residuals')
plt.xlabel('Predicted Price')

plt.savefig('./figures/outlier_identification.png', format='png')

--

--