Looking at this nap time pattern, *Beta 0* is the **intercept**, the value taken by the target when all features are zero.

The remaining betas are the unknown **coefficients** which, along with the interception, are the missing pieces of the model. You can observe the result of combining the different features, but you don’t know all the details about the impact of each feature on the target.

Once you determine the value of each coefficient, you know the direction, positive or negative, and the magnitude of the impact of each entity on the target.

With a linear model, you assume that all the features are independent of each other, for example, the fact that you received a delivery has no impact on the number of treats your dog receives per day.

Additionally, you think there is a linear relationship between functionality and target.

So on the days when you play with your dog more, he will be more tired and want to nap longer. Or, on days when there are no squirrels outside, your dog won’t need to nap as much because he hasn’t expended so much energy to stay alert and watch every move. squirrels.

## How long will your dog take a nap tomorrow?

With the general idea of the model in mind, you collected data for a few days. You now have real observations on the characteristics and target of your model.

But there are still some critical parts missing, the values of the coefficients and the interception.

One of the most common methods of finding the coefficients of a linear model is Ordinary least squares.

The premise of ordinary least squares (OLS) is that you will choose the coefficients that minimize the **residual sum of squares**, that is, the total squared difference between your predictions and the observed data[1].

With the residual sum of squares, not all residuals are treated the same. You want to make an example of the times when the model generated predictions that were too far from the observed values.

It is not so much that the prediction is too far above or below the observed value, but the magnitude of the error. You squared the residuals and penalize predictions that are too far apart while making sure you only deal with positive values.

With the residual sum of squares, it is not so much that the prediction is too much above or below the observed value, but rather the magnitude of that error.

That way, when RSS is zero, it really means the prediction and observed values are equal, and it’s not just the byproduct of arithmetic.

In python you can use ScikitLearn to fit a linear model to the data using ordinary least squares.

Since you want to test the model with data it has not been trained on, you want to keep a percentage of your original data set in a test set. In this case, the test dataset sets aside 20% of the original dataset at random.

After fitting a linear model to the training set, you can check its characteristics.

The coefficients and the intercept are the last parts you need to define your model and make predictions. The coefficients in the output array follow the order of the entities in the dataset, so your model can be written as follows:

It is also useful to calculate some metrics to assess the quality of the model.

The R-squared, also called the coefficient of determination, gives an idea of the quality of the model to describe the models in the training data, and has values ranging from 0 to 1. It shows the part of the variability of the target explained by the characteristics[1].

For example, if you fit a linear model to the data but there is no linear relationship between the target and the features, the R-squared will be very close to zero.

Bias and variance are measures that help balance the two sources of error that a model can have:

- The bias relates to the learning error, that is to say the error of the predictions on the training set.
- The variance is related to the generalization error, the error of the predictions on the test set.

This linear model has a relatively high variance. Let’s use regularization to reduce the variance while trying to keep the bias as low as possible.

## Regularization of the model

Regularization is a set of techniques that improve a linear model in terms of:

- Accuracy of predictions, reducing the variance of model predictions.
- Interpretability, by
*contraction*or reduce to zero the coefficients that are not as relevant to the model[2].

With ordinary least squares, you want to minimize the residual sum of squares (RSS).

But, in a regularized version of ordinary least squares, you want *shrink *some of its coefficients to reduce the overall variance of the model. You do this by applying a penalty to the residual sum of squares[1].

In the *regularized *version of OLS, you try to find the coefficients that minimize:

The *shrinkage penalty* is the product of a tuning parameter and the regression coefficients, so it will become smaller as the regression coefficient portion of the penalty decreases. The tuning parameter controls the impact of *shrinkage penalty *in the residual sum of squares*.*

The *shrinkage penalty* is never applied to beta 0, the intercept, because you only want to control the effect of the coefficients on the features, and the intercept does not have an associated entity. If all features have a coefficient of zero, the target will be equal to the value of the intersection.

There are two different regularization techniques that can be applied to OLS:

## Ridge Regression

Peak regression minimizes the sum of the square of the coefficients.

It is also called the L2 standard because, as an adjustment parameter *lambda* increases the norm of the least squares vector, the coefficients always decrease.

Even if she *shrinks *each coefficient of the model in the same proportion, the Ridge regression will never actually be *shrink *them to zero.

The very aspect which makes this regularization more stable is also one of its drawbacks. You end up reducing the variance of the model, but the model retains its original level of complexity because none of the coefficients have been reduced to zero.

You can fit a model with Ridge Regression by running the following code.

`fit_model(features, targets, type="Ridge")`

Here *lambda*, i.e. alpha in the scikit learn, has been arbitrarily set to 0.5, but in the next section you will go through the process of setting this parameter.

Depending on the output of the peak regression, your dog’s nap duration can be modeled as follows:

By looking at other features of the model, such as R-squared, bias, and variance, you can see that all of them have been reduced from the OLS output.

Ridge regression has been very effective at *contraction* the value of the coefficients and, therefore, the variance of the model have been reduced considerably.

However, the complexity and interpretability of the model remained the same. You still have four characteristics that impact the length of your dog’s daily nap.

Let’s turn to Lasso and see how it works.

## Lasso

Lasso is the abbreviation of *Least absolute removal and selection operator* [2], and it minimizes the sum of the absolute values of the coefficients.

This is very similar to Ridge regression but, instead of the L2 standard, it uses the L1 standard as part of the *shrinkage penalty*. This is why Lasso is also called L1 regularization.

What’s powerful about Lasso is that he *shrink* some of the coefficients to zero, thus reducing both the variance and the complexity of the model.

The lasso uses a technique called gentle thresholding[1]. He *shrinks *each coefficient of a constant quantity such that when the value of the coefficient is less than the *shrinkage constant* it is reduced to zero.

Again, with an arbitrariness *lambda* of 0.5, you can fit the lasso to the data.

In this case, you can see the functionality *the squirrels was *removed from the model because its coefficient is zero.

With Lasso, the duration of your dog’s nap can be described as a pattern with three characteristics:

The advantage here over Ridge regression is that you ended up with a more interpretable model, because it has less functionality.

Going from four to three features isn’t a big deal in terms of interpretability, but you can see how extremely useful it could be in datasets with hundreds of features!

## Find your optimal lambda

So far the *lambda* you used to see Ridge Regression and Lasso in action was completely arbitrary. But there is a way to refine the value of *lambda* to ensure that you can reduce the overall variance of the model.

If you trace the mean squared error with respect to a continuous set of *lambda *values, you can use the *elbow technique* to find the optimal value.

**Note:** The content and images used in this article is rewritten and sourced from thenextweb.com