# Get started with Machine Learning by building a simple project

Photo by Franck V. on Unsplash

Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. The image above shows how a machine can learn to play a musical instrument provided the correct training. This article covers some of the basic concepts, functions, and algorithms to help beginners get started with machine learning. So, let’s start our journey by building a simple classification project.

What are we building

We are making a classification project to classify three sets of Iris flowers namely Iris_setosa(1), Iris_versicolor(2), and Iris_verginica(3). You can check them out to have a grasp on the project :

We have been provided with a dataset of 150 training examples (i.e 50 for each flower) with ‘4’ features for each flower. The features of the flowers are as follows :

1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm

We have to train our algorithm in such way that it could classify any new examples into one of the 3 species. You can have a look on the dataset provided in the Iris_flower.txt file. I hope you got the idea of what are we going to build.

Some prerequisites before we start

You must have the Octave or Matlab to implement this project. I recommend using Octave because it’s free and more compatible for this project. You can download Octave here :

Secondly you must have a basic understanding of the matrices, and Octave commands and few algorithms. With this said let’s dive into the fun stuff….Yeah! the code.

Files Included in the project

We are going to make a bunch of files(Don’t worry it’s quite easy) for the project. Let’s list them out to keep all of this structured :

1. Iris_main.m
2. lrCostFunction.m
3. oneVsAll.m
4. predictOneVsAll.m
5. sigmoid.m
6. Iris_flower.txt

Iris_flower.txt

Let’s begin by looking at the dataset given to us :

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

You can see there are 4 features(already mentioned above)for each training example separated by commas and the name for the respective species of the flower. Here, we have used the names for the flowers but in practise we are going to replace it with numbers 1, 2, and 3 respectively.

Iris_main.m

This file is like the skeleton for our project because it contains all the function calling statements. Let’s look at the file and try to understand the full code from the scratch.

% This is the iris flower detection project
% we are given to identify 3 types of flowers
% based on a training set
% The functions included are
% lrCostFunction.m (logistic regression cost function)
% oneVsAll.m
% predictOneVsAll.m
% sigmoid.m
% fmincg.m
clear;close all;clc;

% loading the data from iris_flower.txt
fprintf('loading the data...\n');

% the whole dataset is stored in a matrix 'data'

data = load('iris_flower.txt'); % data is a 150x5 matrix

% store all the features of each training
% set in the matrix X, and the corresponding
% result in the matrix y

X = data(:,1:4) ; % 150x4 matrix
y = data(:,5); % 150x1 matrix
fprintf('Program paused. Press enter to continue.\n');
pause;
% Note: We add a column of ones to X to balance
% the bias term in theta.
X = [ones(size(X,1),1) X];
fprintf('Running oneVsall Algorithm...\n');

% number of classes signifies different types of flowers

num_of_classes = 3;
lambda = 0.1;
[all_theta] = oneVsAll(X,y,num_of_classes,lambda);
fprintf('Program paused. Press enter to continue.\n');
pause;
fprintf('==== Running prediction ====\n');

p = predictOneVsAll(X, all_theta);
% We calculate the accuracy of our hypothesis

accuracy = mean((p == data(:,5))*100);
fprintf('Accuracy of the algorithm = %f\n',accuracy);

% check the predictions Vs Original Result
fprintf('P OVal\n\n');

for i = 1 : size(X,1)
fprintf('%d %d\n ',p(i),y(i));
end

fprintf('Program paused. Press enter to continue.\n');
pause;
fprintf('END\n');

We start by loading our dataset into our ‘data’ matrix and then split the data into the matrix ‘X’ and ‘y’. We add an extra column of ones in our matrix X to balance the bias term in ‘theta’ matrix. Once we are done with the initialization parts let’s look at our cost function

#### Cost Function(lrCostFunction.m)

The cost function uses the Logistic Regression Model. The figure below explains the hypothesis function(h) for the model

The graph for the sigmoid function(‘g’) or the logistic function is shown above. No matter what the input is the function outputs a value between 0 and 1.

We have implemented this cost fuction in our lrCostFunction.m. Let’s see what does it do.

function [J, grad] = lrCostFunction (theta, X, y, lambda)
m = size(X, 1);
n = size(X, 2);
J = 0;
grad = zeros(size(theta));
h = sigmoid(X*theta);
% here h is our hypothesis function
J = (1/m)*(-y'*log(h) - (1 - y)'*log(1 - h));
J_reg = (lambda/(2*m))*(sum(theta.^2) - theta(1)^2);
J = J + J_reg;
grad = (1/m).*(X'*(h - y)) + (lambda/m).*(theta);
grad(1) = (1/m)*(sum(h - y));
end

We take theta, X, y, and lambda as input to the function. We got to return ‘J’ the cost function and ‘grad’ the gradient.The lambda is the regularization term that we use to avoid overfitting of the dataset. We are not going to discuss regularization here, if you wish to know more about regularization follow the link given below :

We have used regularization in the cost function as well as in the gradient. If you remember the concept of regularization we don’t regularize the bias term(theta0) both in the cost function and the gradient. Now, let’s take a deeper look at the code. We first calculate the hypothesis(‘h’) and use it to futhur calculate J which is our cost function. To remove the bias term from the regularization we substract the square of the bias term from the sum of the square of all theta values. Then we calculate the gradient(all theta values are simultaneously calculated) and to remove the bias term from regularization we update the value of grad(1) as shown above.

#### Sigmoid function

It calculates the sigmoid of the argument(matrix or vector) which results into a value between 0 and 1.

function [ret_z] = sigmoid (z)
% takes z as an argument and returns it's sigmoid
ret_z = 1./(1 + exp(-z));
% Note : Here, we have passed z = X*theta
end

#### oneVsAll algorithm

This algorithm divides the given set of data to us(here 3 sets)into 2 sets of data and tries to find the perfect fit or our hypothesis function or more precisely the parameters theta needed to fit the 2 sets of data created. It does this for all the different sets(all different species of flowers). So, for the given training set we would have 3 different hypothesis function which we store in our ‘all_theta’ matrix. With that said let’s look at the oneVsAll code to see how it works.

% oneVsAll algorithm divides the dataset into
% 2 classes and finds optimum theta
% values for the respective classes
function [all_theta] = oneVsAll (X, y, num_of_classes, lambda)
m = size(X, 1);
n = size(X, 2);
all_theta = zeros(num_of_classes, n);

% for this example we have 3 classes

for i = 1: num_of_classes
% Set the parameters for our advance optimization
% ==================================
initial_theta = zeros(n, 1);
options = optimset('GradObj','on','MaxIter',50);
% ==================================
costfunc = @(t)(lrCostFunction(t,X,y == i,lambda));
[theta] = fmincg(costfunc,initial_theta,options);
% (y == i)divides the training set into 2 distinct classes
% and then tries to find optimum theta
% values to fit the dataset

all_theta(i,:) = theta;

end

end

Now, let’s look at some of the details of this function. The arguments for the oneVsAll function are X(150×5), y(150×1), num_of_classes(3), and lambda. We have made ‘all_theta’ matrix to store our best fit theta values that the function ‘fmincg’ is going to return. Inside the loop, first we have the initial_theta which acts as our starting point in order to minimize J(taken as a zero vector). You can think of it as coordinates which are set to zero with the end goal to reach the perfect fit coordinates. Then we have our options object that we pass into our advance optimization as an argument. Options object sets ‘GradObj’ as ‘on’ which means we want to consider gradient to minimize our cost function followed by ‘MaxIter’ which tells us the number of iterations we want to perform to minimize our cost function. Next, we pass our cost function with arguments initial_theta(as ‘t’), X, (y == i), and lambda into fmincg which minimizes our cost function. Minimizing the cost function returns an optimum theta(best fit values for theta) which we store into our all_theta matrix.

#### predictOneVsAll.m

In this function we pass our all_theta matrix and X matrix to predict the type of species for all the training examples. Let’s look at the code for the function and discuss some important concepts.

% here we are going to use all the training examples
% and find their respective
% classes by using the hypothesis function sigmoid
% The value in which hypothesis returns the max value is our predicted value which we store in p
% Note: p is a column vector
function [p] = predictOneVsAll (X,all_theta)
m = size(X, 1);
p = zeros(m,1);
h = sigmoid(X*all_theta');
[max_val,max_ind] = max(h,[],2);
p = max_ind;
end

We define a prediction vector ‘p’ of the size of our X vector to store our predicted value for all the training examples. To perform this task we calculate the hypothesis vector that returns a hypothesis value for each of the classes(each of the 3 species). The maximum predicted value among the three values would be taken as the prediction. The max function returns the maximum value(max_val) as well as the index for the max_value. We need the index value because we need to predict either of the three values 1,2 or 3. Our ‘p’ vector would thus contain 150 predictions and which we use to calculate the accuracy.

#### Accuracy

p = predictOneVsAll(X, all_theta);
accuracy = mean((p == data(:,5))*100);
fprintf('Accuracy of the algorithm = %f\n',accuracy);

The accuracy vector contains the fraction of the correct answers predicted which comes to be around 97%. This means our algorithm performs 97% accurately.

#### fmincg

This is an advance optimization which is used to minimize a function according to the given arguments. You don’t really need to know this code right now and it’s quite advance for you to understand as a beginner so just focus on its working.

function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')
length = options.MaxIter;
else
length = 100;
end
RHO = 0.01; % a bunch of constants for line searches
SIG = 0.5; % RHO and SIG are the constants in the Wolfe-Powell conditions
INT = 0.1; % don't reevaluate within 0.1 of the limit of the current bracket
EXT = 3.0; % extrapolate maximum 3 times the current bracket
MAX = 20; % max 20 function evaluations per line search
RATIO = 100; % maximum allowed slope ratio
argstr = ['feval(f, X']; % compose string used to call function
for i = 1:(nargin - 3)
argstr = [argstr, ',P', int2str(i)];
end
argstr = [argstr, ')'];
if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end
S=['Iteration '];
i = 0; % zero the run length counter
ls_failed = 0; % no previous line search has failed
fX = [];
[f1 df1] = eval(argstr); % get function value and gradient
i = i + (length<0); % count epochs?!
s = -df1; % search direction is steepest
d1 = -s'*s; % this is the slope
z1 = red/(1-d1); % initial step is red/(|s|+1)
while i < abs(length) % while not finished
i = i + (length>0); % count iterations?!
X0 = X; f0 = f1; df0 = df1; % make a copy of current values
X = X + z1*s; % begin line search
[f2 df2] = eval(argstr);
i = i + (length<0); % count epochs?!
d2 = df2'*s;
f3 = f1; d3 = d1; z3 = -z1; % initialize point 3 equal to point 1
if length>0, M = MAX; else M = min(MAX, -length-i); end
success = 0; limit = -1; % initialize quanteties
while 1
while ((f2 > f1+z1*RHO*d1) || (d2 > -SIG*d1)) && (M > 0)
limit = z1; % tighten the bracket
if f2 > f1
z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3); % quadratic fit
else
A = 6*(f2-f3)/z3+3*(d2+d3); % cubic fit
B = 3*(f3-f2)-z3*(d3+2*d2);
z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A; % numerical error possible - ok!
end
if isnan(z2) || isinf(z2)
z2 = z3/2; % if we had a numerical problem then bisect
end
z2 = max(min(z2, INT*z3),(1-INT)*z3); % don't accept too close to limits
z1 = z1 + z2; % update the step
X = X + z2*s;
[f2 df2] = eval(argstr);
M = M - 1; i = i + (length<0); % count epochs?!
d2 = df2'*s;
z3 = z3-z2; % z3 is now relative to the location of z2
end
if f2 > f1+z1*RHO*d1 || d2 > -SIG*d1
break; % this is a failure
elseif d2 > SIG*d1
success = 1; break; % success
elseif M == 0
break; % failure
end
A = 6*(f2-f3)/z3+3*(d2+d3); % make cubic extrapolation
B = 3*(f3-f2)-z3*(d3+2*d2);
z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3)); % num. error possible - ok!
if ~isreal(z2) || isnan(z2) || isinf(z2) || z2 < 0 % num prob or wrong sign?
if limit < -0.5 % if we have no upper limit
z2 = z1 * (EXT-1); % the extrapolate the maximum amount
else
z2 = (limit-z1)/2; % otherwise bisect
end
elseif (limit > -0.5) && (z2+z1 > limit) % extraplation beyond max?
z2 = (limit-z1)/2; % bisect
elseif (limit < -0.5) && (z2+z1 > z1*EXT) % extrapolation beyond limit
z2 = z1*(EXT-1.0); % set to extrapolation limit
elseif z2 < -z3*INT
z2 = -z3*INT;
elseif (limit > -0.5) && (z2 < (limit-z1)*(1.0-INT)) % too close to limit?
z2 = (limit-z1)*(1.0-INT);
end
f3 = f2; d3 = d2; z3 = -z2; % set point 3 equal to point 2
z1 = z1 + z2; X = X + z2*s; % update current estimates
[f2 df2] = eval(argstr);
M = M - 1; i = i + (length<0); % count epochs?!
d2 = df2'*s;
end % end of line search
if success % if line search succeeded
f1 = f2; fX = [fX' f1]';
fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);
s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere direction
tmp = df1; df1 = df2; df2 = tmp; % swap derivatives
d2 = df1'*s;
if d2 > 0 % new slope must be negative
s = -df1; % otherwise use steepest direction
d2 = -s'*s;
end
z1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIO
d1 = d2;
ls_failed = 0; % this line search did not fail
else
X = X0; f1 = f0; df1 = df0; % restore point from before failed line search
if ls_failed || i > abs(length) % line search failed twice in a row
break; % or we ran out of time, so we give up
end
tmp = df1; df1 = df2; df2 = tmp; % swap derivatives
s = -df1; % try steepest
d1 = -s'*s;
z1 = 1/(1-d1);
ls_failed = 1; % this line search failed
end
if exist('OCTAVE_VERSION')
fflush(stdout);
end
end
fprintf('\n');
if success % if line search succeeded
f1 = f2; fX = [fX' f1]';
fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);
s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere direction
tmp = df1; df1 = df2; df2 = tmp; % swap derivatives
d2 = df1'*s;
if d2 > 0 % new slope must be negative
s = -df1; % otherwise use steepest direction
d2 = -s'*s;
end
z1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIO
d1 = d2;
ls_failed = 0; % this line search did not fail
else
X = X0; f1 = f0; df1 = df0; % restore point from before failed line search
if ls_failed || i > abs(length) % line search failed twice in a row
break; % or we ran out of time, so we give up
end
tmp = df1; df1 = df2; df2 = tmp; % swap derivatives
s = -df1; % try steepest
d1 = -s'*s;
z1 = 1/(1-d1);
ls_failed = 1; % this line search failed
end
if exist('OCTAVE_VERSION')
fflush(stdout);
end
end
fprintf('\n');

This function returns the best fit theta parameters into our [theta] matrix in the ‘oneVsAll.m’ function used above. You can alternatively use ‘fminunc’ function which does the same job as ‘fmincg’ but is used when there are fewer features in the training examples.

#### Conclusion

This code hardly takes time to be completed once you know the concepts behind the algorithms. I hope you enjoyed doing this project, if you did, feel free to click the ‘clap’ button.