Change in gears! Time for a math post.

I've been working through various online machine learning courses over the last eighteen months, beginning with the Stanford/Coursera ML course taught by Andrew Ng. It opens with three weeks on linear and logistic regression, covering the structure of the linear/logistic regression models, the specific loss functions involved in each, and how to minimize said functions via gradient descent.

The lectures are well executed, but I could have gone for more background on the loss functions, which are sort of handed down from above. Where did they come from? Why do they produce good regression coefficients? I think the answers to these questions are pretty neat—there's actually a straightforward statistical interpretation of what's happening.

## Problem Setting

We're presented with a collection of data points \((\mathbf x_1, \mathbf x_2, \dotsc, \mathbf x_n), \mathbf x_i \in \mathbb R^m\) and the corresponding value of some dependent variable \(y_i\) for each point. We'd like to model how \(y\) depends on \(\mathbf x\) in order to predict its value for previously-unseen \(\mathbf x\).

This setting is general enough to capture a wide variety of real-world problems. As a holiday-inspired example, perhaps we want to predict the density of pumpkin pie custard based on the volume of milk used (other ingredient quantities held constant) and the brand (one of *Annie's Animals*, *Bob's Barnyard*, or *Chuck's Cowpasture*). Dairy brand is not a continuous variable, so we'll encode it as three indicator variables which assume the value one (for custards using a particular brand) or zero (for custards using the other two brands). So! Each vector \(\mathbf x_i \in \mathbb R^4\) will represent a single custard, with the first element \(x_i^1\) representing the volume of milk used in liters and the remaining three elements representing the milk brand. Each \(y_i\) will represent the density of the set custard in grams per liter. If we record these values every time we make custard, we'll soon have lots of data that we can use to predict the properties of future custards.

Data in hand, we're going to model the density of our custard by assuming that density is a linear function of our input variables. Our model will have the form

\[y = w^0 + x^1 w^1 + \dotsb + x^4 w^4 \]

where \(w^0\) is a bias variable representing the “default” density in the absence of any data. Our task, then, is to find a weight vector \(\mathbf w\) such that for each \(\mathbf x_i\), the vector product \(\mathbf x_i \mathbf w^T\) is close to \(y_i\). (You can think of each \(\mathbf x\) as including an element \(x^0 = 1\) for the sole purpose of being multiplied by the bias. Note that the superscripts here indicate element indices rather than exponents.)

Just for fun, we're also going to model a discrete dependent variable: whether the surface of our custard cracks during baking. Let \(z_i \in \{0, 1\}\) indicate whether the i-th custard was cracked (\(1\)) or smooth (\(0\)). Predicting the value of \(z\) directly using a weighted sum as we did for density is a little bit awkward. But we can adapt that model by instead predicting the log-odds of a crack. If \(h(\mathbf x, \mathbf w)\) is the hypothesized probability of cracking, our second model will look like

\[\log \frac {h(\mathbf x, \mathbf w)} {1-h(\mathbf x, \mathbf w)} = \mathbf x \mathbf w^T \]

or equivalently

\[h(\mathbf x, \mathbf w) = \frac 1 {1 + e^{- \mathbf x \mathbf w^T}} \]

The model should yield probabilities closer to one when \(z_i = 1\) and closer to zero otherwise. Equivalently, when \(z_i = 1\), we want \(\mathbf x_i \mathbf w^T > 0\).

To differentiate between the weights for the density model and the weights for the crack-probability model, let's call them \(\mathbf w_y\) and \(\mathbf w_z\) respectively. To recap, we want to find \(\mathbf w_y\) and \(\mathbf w_z\) such that \(\mathbf x \mathbf w_y^T\) predicts density and \(\mathbf x \mathbf w_z^T\) predicts the log-odds of a crack. These are examples of linear regression and logistic regression models respectively.

## Selecting a Model

In reality, our models won't be perfect. A density model that's both linear and accurate requires a linear relationship between density and milk volume which obviously doesn't hold in real life. (Think about what happens as our milk volume approaches infinity!) The entire premise of our model is arguably flawed, but we're going to hope that the density-volume relationship is *locally* linear in the neighborhood of reasonable custard recipes, and try to discover the coefficients governing that local relationship.

Given that we're not going to nail it, we need a way to choose among imperfect models.

We do this by defining and then optimizing a **loss function**, also known as a **cost function**, which takes in our data and a set of weights and tells us how well we've modeled the data. You can think of a loss function as measuring how many mistakes a model makes: bad mistakes are expensive and drive up loss, while better models that make smaller mistakes result in low loss. Our goal is to minimize the loss function with regard to the weights.

The linear regression loss function looks like this:

\[\mathcal L_y(X, \mathbf y, \mathbf w) = \sum_{i=1}^n (\mathbf x_i \mathbf w^T - y_i)^2 = \| X \mathbf w^T- \mathbf y \|_2^2 \]

…where \(X\) is the matrix created by stacking the row-vector data points \(\mathbf x_i\) on top of each other, \(\mathbf y\) is the column vector comprised of individual density measurements \(y_i\), and \(\| \mathbf v \|_2^2\) denotes the squared L2 norm of some vector \(\mathbf v\). Looking at the middle of the equation, it's clear that the loss is simply the sum of squared distances between the model's prediction and the observed value (over all prediction-observation pairs). On the right-hand side, this is expressed using a matrix product.

The logistic regression loss function looks like this:

\[\begin{equation} \begin{split} \mathcal L_z(X, \mathbf z, \mathbf w) & = -\sum_{i=1}^n \bigl[ z_i \log h(\mathbf x_i, \mathbf w) + (1 - z_i) \log (1 - h(\mathbf x_i, \mathbf w)) \bigr] \\& = -\bigl\| \mathbf z \log h(X, \mathbf w)^T + (1 - \mathbf z) \log (1 - h(X, \mathbf w))^T \bigr\|_1 \end{split} \end{equation} \]

where \(h(X, \mathbf w)\) is our crack probability function from before, applied to all data points at once to produce a vector of results. Note that \(z_i\) and \(1 - z_i\) are opposites in the sense that the former is equal to zero when the latter is equal to one and vice vera. That means that the logistic loss is just a matter of calculating how “off” the model's probability estimate was, taking the negative log of (one minus off-ness), and summing the resulting values over all data points. Large mistakes are penalized more heavily than small mistakes, as in linear regression.

How can we find weights that minimize these loss functions? Conveniently, both functions are differentiable and convex with regard to the weights, which makes them excellent candidates for optimization algorithms like gradient descent. In fact, we can be even more efficient in the case of linear regression, which admits a closed form solution.

Today, though, we're not focused on minimizing loss. We want to know where those loss functions came from in the first place.

## Loss Function Derivation

Setting the loss functions aside for a moment, let's think about our models from a statistical perspective.

One nice feature of our logistic regression model is that it gives us probabilities rather than an inflexible yes/no answer. It might give one custard a 5% chance of developing cracks, while a second case might be a toss-up. In comparison, our linear regression model is quite rigid in its density estimations. We can make it more flexible by assuming that observed densities include some normally-distributed error. We'll say that each \(y_i\) includes an error term \(\varepsilon\), independently drawn from a normal distribution with zero mean and unknown variance \(\sigma^2\) (which is common to all data points). Symbolically:

\[ y_i \sim \mathcal N(\mathbf x_i \mathbf w_y^T, \sigma^2) \]

We now have a way to express the probability of individual observations \(y_i\) and \(z_i\) under weights \(\mathbf w_y\) and \(\mathbf w_z\). Multiplying those together, we can calculate \(P(\mathbf y | \mathbf w_y)\) and \(P(\mathbf z | \mathbf w_z)\), the probabilities of the entire series of observations conditioned on the weights.

What happens if we try to maximize those probabilities?

\[ \begin{equation} \begin{split} \mathbf w_y^* & = \operatorname*{\arg\!\max}_{\mathbf w_y} P(\mathbf y | \mathbf w_y) \\& = \operatorname*{\arg\!\max}_{\mathbf w_y} \prod_{i=1}^n P(y_i | \mathbf w_y) \\& = \operatorname*{\arg\!\max}_{\mathbf w_y} \prod_{i=1}^n \Biggl[ \frac 1 {\sqrt{2\pi\sigma^2}} \,\exp\Biggl(\frac {-(\mathbf x_i \mathbf w_y^T - y_i)^2} {2\sigma^2}\Biggr) \Biggr] \end{split} \end{equation} \]

Because probabilities are nonnegative, we can take the log of the entire right-hand side of the equation. Maximizing the log of probability is equivalent to maximizing probability and the resulting equation is easier to work with.

\[ \begin{equation} \begin{split} \mathbf w_y^* & = \operatorname*{\arg\!\max}_{\mathbf w_y} \sum_{i=1}^n \Biggl[ - \log \sqrt{2\pi\sigma^2} - \frac {(\mathbf x_i \mathbf w_y^T - y_i)^2} {2\sigma^2} \Biggr] \\& = \operatorname*{\arg\!\min}_{\mathbf w_y} \sum_{i=1}^n \Biggl[ \log \sqrt{2\pi\sigma^2} + \frac {(\mathbf x_i \mathbf w_y^T - y_i)^2} {2\sigma^2} \Biggr] \\& = \operatorname*{\arg\!\min}_{\mathbf w_y} \sum_{i=1}^n (\mathbf x_i \mathbf w_y^T - y_i)^2 \end{split} \end{equation} \]

Hey, this is the same as minimizing our loss function! Let's see if the same thing happens with the second model.

\[ \begin{equation} \begin{split} \mathbf w_z^* & = \operatorname*{\arg\!\max}_{\mathbf w_z} P(\mathbf z | \mathbf w_z) \\& = \operatorname*{\arg\!\max}_{\mathbf w_z} \prod_{i=1}^n \bigl[ z_i h(\mathbf x_i, \mathbf w_z) + (1 - z_i)(1 - h(\mathbf x_i, \mathbf w_z)) \bigr] \\& = \operatorname*{\arg\!\max}_{\mathbf w_z} \sum_{i=1}^n \log \bigl[ z_i h(\mathbf x_i, \mathbf w_z) + (1 - z_i)(1 - h(\mathbf x_i, \mathbf w_z)) \bigr] \\& = \operatorname*{\arg\!\max}_{\mathbf w_z} \sum_{i=1}^n \bigl[ z_i \log h(\mathbf x_i, \mathbf w_z) + (1 - z_i) \log(1 - h(\mathbf x_i, \mathbf w_z)) \bigr] \\& = \operatorname*{\arg\!\min}_{\mathbf w_z} - \sum_{i=1}^n \bigl[ z_i \log h(\mathbf x_i, \mathbf w_z) + (1 - z_i) \log(1 - h(\mathbf x_i, \mathbf w_z)) \bigr] \end{split} \end{equation} \]

Nice. It's now clear that our loss functions have a firm basis in statistics: when we minimize the loss functions, we're actually choosing weights that maximize the probability of our observed results. This is known as the **maximum-likelihood estimate (MLE)** of the weight values.

## Regularized Regression

In scenarios with a large number of features (that is, high-dimensional data points \(\mathbf x\)) and therefore a large number of parameters, but relatively few data points, a common problem is **overfitting**. This means that the model picks up on details in the training data that are not representative of the distribution from which the data was generated. The model will be quite accurate given points from the training data, but may fail to generalize to previously-unseen inputs.

As an extreme example, consider what happens when there are more model parameters than data points. If we use only three custards to train our density-estimation model, each made with a quarter-liter of milk from one of the three brands, our input data will look like this:

\[ X = \begin{bmatrix} 0.25 & 1 & 0 & 0 \\ 0.25 & 0 & 1 & 0 \\ 0.25 & 0 & 0 & 1 \end{bmatrix} \]

…and we can perfectly match that data by setting the brand weights to the observed densities, i.e. \(\mathbf w_y^* = (0, y_1, y_2, y_3)\). This model is terrible—it says that custard density is completely independent of milk volume—but we wouldn't know it based on the stellar training-data performance.

Even with more data points than parameters, it's possible for regression models to fit too closely to the training data. The problem in a nutshell is that lots of parameters can provide too much modeling flexibility. One way to counteract this is with **regularization**, which penalizes complicated models having heavier weights in favor of simpler ones, all else being equal. We can express a preference for simpler models by adding a penalty term to our loss functions. Many penalty terms are possible; our penalty term will be based on the L2 norm of the weight vector, a strategy that's variously known as L2 regularization, Tikhonov regularization, or ridge regression.

\[ \begin{align} \mathcal L_{y\_reg}(X, \mathbf y, \mathbf w, \lambda) & = \mathcal L_y(X, \mathbf y, \mathbf w) + \lambda \| \mathbf w \|_2^2 \\ \mathcal L_{z\_reg}(X, \mathbf z, \mathbf w, \lambda) & = \mathcal L_z(X, \mathbf z, \mathbf w) + \lambda \| \mathbf w \|_2^2 \end{align} \]

By penalizing larger weights based on the norm of the weight vector, we require the model to make a tradeoff between simplicity and training accuracy. The nonnegative hyperparameter \(\lambda\) dictates the “exchange rate” between the two, with larger values favoring simplicity and smaller values favoring accuracy. Values approaching infinity result in a very simple model with zero weight, while \(\lambda = 0\) returns us to unregularized regression.

But where does regularization fit in our statistical interpretation of regression? Glad you asked.

Previously, we chose weights to maximize \(P(\mathbf y | \mathbf w_y)\) and \(P(\mathbf z | \mathbf w_z)\), the observation probabilities conditioned on weights. As an alternative, we can try to maximize \(P(\mathbf w_y | \mathbf y)\) and \(P(\mathbf w_z | \mathbf z)\): Out of all the weights that could have produced the data we observed, we want the most probable weights. We can relate these new probabilities to the previous ones using Bayes' law.

\[\begin{align} \mathbf w_y^{**} = \operatorname*{\arg\!\max}_{\mathbf w_y} P(\mathbf w_y | \mathbf y) = \operatorname*{\arg\!\max}_{\mathbf w_y} \frac {P(\mathbf y | \mathbf w_y)P(\mathbf w_y)} {P(\mathbf y)} \end{align}\]

The observation probability \(P(\mathbf y)\) is independent of the weights and therefore plays no role in optimization. Note also that when \(P(\mathbf w_y)\) is constant—when we assume all weights are equally likely—we have \(\mathbf w_y^{**} = \mathbf w_y^* \) and we're back to unregularized regression again.

In contrast, since regularization prefers smaller weights, it can be viewed as an assumption about the prior probability distribution of the weights \(P(\mathbf w_y)\) where smaller weights are more probable. The most probable weight vector taking prior probability into account is known as the **maximum a posteriori (MAP) estimate**.

A regularization term of \(\lambda \| \mathbf w \|_2^2\) translates to a prior proportional to \(e^{- \lambda \| \mathbf w \|_2^2}\). This is just a multivariate normal distribution with zero mean and variance \((2\lambda)^{-1} I\), which makes sense—as we increase \(\lambda\), we're narrowing the prior weight distribution that's centered around zero.

And with that, we're done! We've got a compelling statistical justification for both linear and logistic regression, with or without regularization. None of this background is strictly necessary for performing regression analysis, but it's nice to have some extra intuition for what's going on.