Let’s first take a look at simple linear regression.

First let’s start with an open question: imagine you can get all the information you need, how do you predict the selling price of a house?

I believe you might consider the following factors:

  • House size
  • House location
  • House making year
  • Location of the house
  • Other features.

General machine learning steps

If we can the these information of many house, then we are able to predict the selling price of this house. A simple linear regression is doing exactly the same thing: it take the feature information of existing house as input, then try to predict the prices of another house.

Then there are a few general steps for a prediction work that are applied in the above process, including:

  • Feature engineering: trying to find or create features that make machine learning algorithm work.
  • Model selection: try to find the best model that can interpret the given question.
  • Training: once you get the model, you need to train the model to make sure the model is customized for your problem
  • Testing: you also want to test whether the given model over customized or not by testing it with data sets beyond training data.
  • Prediction: use the given model, make prediction for the problem you want to talk about.

Let’s look back to the house price prediction issue. Assume we only consider one feature: square feet of the house. And we may assume the function of size to price is y = ax + b.
The next question is how to determine: a and b.

Cost function

One principle or standard that help us determine a and b is that:
If we define the error of prediction as (yi - f(Xi))^2, then for all training data set, a and b should minimize SUM(yi - f(xi)). Then we call (yi - f(xi))^2 cost function. It is used to gauge the cost of using a particular model.

Then we come the following questions:

  • Of what value should a and b be initiated?
  • How should we change a and b after we now the cost of using existing a and b?
  • How do we know whether we can find the optimal a and b?
    To answer the first question: it does not matter what the initial value of a and b we choose.

Gradient descent

For the second question: how should we move to reach the optimal destination?
We want to know the concept of gradient: the gradient point in the direction of greatest rate of increase of the function. Which means, we are able to find the min or max of a function by following the gradient direction. For example, let’s say we have a single variable convex/concave function. For concave function, we can find the max through hill climbing; while for concave function, we can find the minimal value through hill descent. For multi variable functions, it is essentially the same, so we are able to find the min and max as well.

When we do hill climbing or hill descent, one thing we need to consider is the step size: where is the next step if we did not find the max/min value.

There are two approaches for step size choose:

  • Fixed size
  • Decreasing step size.

For the third questions: when should we stop moving? In theory, when converged, gradient should be zero, while in practice, the move stops when the gradient is less than the threshold.

Then it becomes a math question:

  • How do we compute gradient for multiple variable function?
  • How do we compute gradient for multiple dimension vectors?

We will talk about all the general steps mentioned and un-mentioned later, including the detail of gradient descent.