Borderline ML

Lasso Regression, Simply Explained

Let's pretend for a moment, dear reader, that you are hired at Best Real Estate Company, Inc. as their first data scientist. They are looking to use their massive database to build a model that predicts the price of a home given its features. How would you approach this? Would you build a neural network? Would you ask ChatGPT to do it? Well, I suppose you could, but I'd recommend you start with something a little more straight-forward: perhaps something like linear regression.

What is linear regression? Linear regression is a mathematical formula that attempts to assign weights to every input variable, such that the sum of the input variables multiplied by their respective weights adds up to the target variable. For example, I could fit a linear regression model to predict the price of a home by fitting proper weights to things like the number of rooms, number of bathrooms, square footage, lot size, etc. A linear regression model attempts to find a weight for each variable such that the error of predicted value compared to actual value is minimized. So, with this example, the final model might say that: (40 * sqft) + (50 * number of rooms) + (60 * lot size) + (100 * school district rating) = Predicted Price of Home ($). Here, those values (40, 50, 60, and 100) are the weights that would be applied to each of those variables such that the output is as accurate as possible for all items in our dataset.

Now, let's say that Best Real Estate Company has the greatest Data Engineering team on the planet (Woo! Go DEs!). They have been able to create a remarkable training dataset that has over 150 features for every home. This is far too many features and we are seeing that, when trained using all of the features, the model can't seem to find weights for all 150+ variables that makes realistic and accurate predictions. In our training dataset, our model predicts wonderfully. But on our testing dataset, the model goes haywire on its predictions. In short, we discover that we are overfitting our model.

How can we reduce the dimensionality of this dataset to help our model learn better and generalize more? Do we look at all 150+ features and try to decide which features we think would be the most predictive?

One method we could use to solve this conundrum would be to use Lasso Regression. Lasso, in this case, stands for Least Absolute Shrinkage and Selection Operator. This is great, but big words are always hard for me. In plain english, Lasso regression is a version of Linear Regression that attempts to minimize the sum of the weights by allowing weights of variables to go as low as 0. What does this mean? Referring back to our small example above: (40 * sqft) + (50 * number of rooms) + (60 * lot size) + (100 * school district rating) = Predicted Price of Home ($). Lasso regression is going to try to minimize the sum of 40+50+60+100, even if it has to make values 0, while still maintaining a solid prediction.

This will allow our model to, in a sense, be a cowboy and say:

I've wrangled up all of the bad variables with my lasso (get it?). Things like plumbing material, sprinkler head type, and the size of the nails used to frame the home don't matter when predicting the price of a home. I've also lasso'ed up all of the best variables: square footage, amount of grass in the backyard, and number of bedrooms really matter when predicting the price of a home. *Dramatically tips had a rides off into the sunset*

By using Lasso Regression instead of regular Linear Regression, we can now understand what the model is finding in terms of important features, thereby reducing the dimensionality (i.e. number of variables) in our dataset. This will hopefully allow our model to learn better as well as enable us to create more efficient training and prediction pipelines that use less memory. Huge, right? With one simple algorithm, we have:

So put your cowboy hat and boots on. Get a big belt buckle, and go use Lasso Regression to solve your next regression problem, cowpoke!

#computers