Stochastic Gradient Descent Algorithm Tutorial

 Stochastic Gradient Descent Algorithm Tutorial

Algorithm

Here we will calculate the heuristic function, which is basically an approximation of output value.

y h x i

And the output value is a multi-variable linear function of x.

y = f x

For example f(x) can be

f ( x ) = x 2

Linear regression with stochastic gradient descent

Below is the tested code for Gradient Descent Algorithm. I have designed this code based on Andrew Ng's Notes and lecture. This code follows linear regression model of iterating till convergence is achieved. The code here tries to minimize the error by iteratively updating the coefficient of the equation

Algorithm

In each iteration, We need to perform 3 operations.

  1. Calculate the heuristic.

The heuristic function is an approximation of the output value. The closer it is to the output value(y) the better it is.

h x i = j = 0 n θ j x j
  1. Calculate Error Value.

The difference between the heuristic and the actual value is the error value. Our aim is to minimize this error.

δ i = y i - h x i
  1. Adjust the co-efficient values.

Using the actual value, heuristic and learning rate we adjust all the values of the coefficient of variables.

θ j := θ j + α δ i x i j

Repeat these 3 steps till convergence is achieved.


while (J(y, h) > 0.1){

    for (int i = 1; i < y.length; i++) {

        //1. calculate heuristic
        h[i] = c[0];
        for (int j = 1; j < c.length; j++) {
            h[i] += c[j] * x[i][j-1];
        }

        //2. Calculate Error
        double delta = (y[i] - h[i]);

        //3. update constants
        c[0] += a * delta;
        for (int j = 1; j < c.length; j++) {
            //update cj for xj
            c[j] += a * delta * x[i][j-1];                    
        }

    }

}

Learning rate

The learning rate (α) needs to be small. Here we are taking it 10-6. Since we have a smaller dataset. You can take it between 0.1 to 0.01 if you have a large dataset otherwise, it will take a lot to converge. If it's larger than this value, it reaches the bottom and misses and starts moving up again. If its smaller than this value, it tends to never converge.

α = 10 -6

Convergence

Now, let's talk about convergence. We want the δ value to become 0 for each dataset. So we try to minimize the sum of squares of errors

J θ = 1 2 i = 0 n δ i 2

Convergence will be achieved when J(θ) is very small or zero.

@Field def prevSum=0
//convergence test is true if the error is less than 0.1
def J(y, h){

    double sum = 0.0    
    //why we are minimizing the sum of squares
    for(i=0; i<y.length; i++){ //for each training set
        sum += (y[i] - h[i])**2
    }
    sum = sum/2

    //negligible difference between errors in two consecutive iterations
    //assign the error to be minimized to 0
    if((prevSum - sum).abs() < 10**-10) sum = 0;

    prevSum = sum //for taking difference in iteration    

    //converge if error
    return sum 
}

If J(θ) does not change over iterations, we must stop the iterations. That means the solution is not going to go further down.

Tutorial Code

The below performs supervised learning on a given test.csv file. You can change the code to add more features. I haved added 2 features.


import groovy.transform.Field

/**
* Given x values of training set and corresponding y values
*/  

def stochastic(xlist, ylist) {
    x = xlist
    y = ylist

    def a = 0.000001; //value of alpha
    def h = new double[y.length]; //heuristics same as y
    def c = new double[x[0].length + 1]; //constants initliazed as 0.0

    while (J(y, h) > 0.1){

        for (int i = 1; i < y.length; i++) {

            //1. calculate heuristic
            h[i] = c[0];
            for (int j = 1; j < c.length; j++) {
                h[i] += c[j] * x[i][j-1];
            }

            //2. Calculate Error
            double delta = (y[i] - h[i]);

            //3. update constants
            c[0] += a * delta;
            for (int j = 1; j < c.length; j++) {
                //update cj for xj
                c[j] += a * delta * x[i][j-1];                    
            }

        }

    }

    printCoefficients(c) 

}

@Field def prevSum=0
//convergence test is true if the error is less than 0.1
def J(y, h){

    // println h

    double sum = 0.0    
    //why we are minimizing the sum of squares
    for(i=0; i<y.length; i++){ //for each training set
        sum += (y[i] - h[i])**2
    }
    sum = sum/2

    //negligible difference between errors in two consecutive iterations
    //assign the error to be minized to 0
    if((prevSum - sum).abs() < 10**-10) sum = 0;

    prevSum = sum //for taking difference in iteration    
    println sum

    //converge if error
    return sum < 0.1
}

def printCoefficients(c){

    print("y = ${c[0].round(2)}")
    for (int j = 1; j < c.length; j++) {
        print(" + ${c[j].round(2)}*x${j}");
    }

}

def x = new double[201][2];
def y = new double[201];

int i=0;

new File("data.csv").eachLine{ line ->

   def tokens = line.split(",")

   if(tokens[0]?.trim()){

       x[i][0] = tokens[0] as double
       x[i][1] = tokens[1] as double

       y[i] = tokens[2] as double

       assert 2*x[i][0] + x[i][1] + 18   == y[i]
       i++
    }
}

stochastic(x, y);