Photo by Franck V. on Unsplash

**Machine learning** is an application of **artificial intelligence** that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. The image above shows how a machine can learn to play a musical instrument provided the correct training. This article covers some of the basic concepts, functions, and algorithms to help beginners get started with machine learning. So, let’s start our journey by building a simple classification project.

**What are we building**

We are making a classification project to classify three sets of Iris flowers namely Iris_setosa(1), Iris_versicolor(2), and Iris_verginica(3). You can check them out to have a grasp on the project :

We have been provided with a dataset of 150 training examples (i.e 50 for each flower) with ‘4’ features for each flower. The features of the flowers are as follows :

- Sepal length in cm
- Sepal width in cm
- Petal length in cm
- Petal width in cm

We have to train our algorithm in such way that it could classify any new examples into one of the 3 species. You can have a look on the dataset provided in the Iris_flower.txt file. I hope you got the idea of what are we going to build.

**Some prerequisites before we start**

You must have the Octave or Matlab to implement this project. I recommend using Octave because it’s free and more compatible for this project. You can download Octave here :

Secondly you must have a basic understanding of the matrices, and Octave commands and few algorithms. With this said let’s dive into the fun stuff….Yeah! the code.

**Files Included in the project**

We are going to make a bunch of files(Don’t worry it’s quite easy) for the project. Let’s list them out to keep all of this structured :

- Iris_main.m
- lrCostFunction.m
- oneVsAll.m
- predictOneVsAll.m
- sigmoid.m
- Iris_flower.txt

**Iris_flower.txt**

Let’s begin by looking at the dataset given to us :

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

You can see there are 4 features(already mentioned above)for each training example separated by commas and the name for the respective species of the flower. Here, we have used the names for the flowers but in practise we are going to replace it with numbers 1, 2, and 3 respectively.

**Iris_main.m**

This file is like the skeleton for our project because it contains all the function calling statements. Let’s look at the file and try to understand the full code from the scratch.

% This is the iris flower detection project

% we are given to identify 3 types of flowers

% based on a training set

% The functions included are

% lrCostFunction.m (logistic regression cost function)

% oneVsAll.m

% predictOneVsAll.m

% sigmoid.m

% fmincg.m

clear;close all;clc;% loading the data from iris_flower.txt

fprintf('loading the data...\n');% the whole dataset is stored in a matrix 'data'

data = load('iris_flower.txt'); % data is a 150x5 matrix

% store all the features of each training

% set in the matrix X, and the corresponding

% result in the matrix y

X = data(:,1:4) ; % 150x4 matrix

y = data(:,5); % 150x1 matrix

fprintf('Program paused. Press enter to continue.\n');

pause;

% Note: We add a column of ones to X to balance

% the bias term in theta.

X = [ones(size(X,1),1) X];

fprintf('Running oneVsall Algorithm...\n');% number of classes signifies different types of flowers

num_of_classes = 3;

lambda = 0.1;

[all_theta] = oneVsAll(X,y,num_of_classes,lambda);

fprintf('Program paused. Press enter to continue.\n');

pause;

fprintf('==== Running prediction ====\n');p = predictOneVsAll(X, all_theta);

% We calculate the accuracy of our hypothesis

accuracy = mean((p == data(:,5))*100);

fprintf('Accuracy of the algorithm = %f\n',accuracy);% check the predictions Vs Original Result

fprintf('P OVal\n\n');for i = 1 : size(X,1)

fprintf('%d %d\n ',p(i),y(i));

endfprintf('Program paused. Press enter to continue.\n');

pause;

fprintf('END\n');

We start by loading our dataset into our ‘data’ matrix and then split the data into the matrix ‘X’ and ‘y’. We add an extra column of ones in our matrix X to balance the bias term in ‘theta’ matrix. Once we are done with the initialization parts let’s look at our cost function

#### Cost Function(lrCostFunction.m)

The cost function uses the Logistic Regression Model. The figure below explains the hypothesis function(h) for the model

The graph for the sigmoid function(‘g’) or the logistic function is shown above. No matter what the input is the function outputs a value between 0 and 1.

We have implemented this cost fuction in our lrCostFunction.m. Let’s see what does it do.

function [J, grad] = lrCostFunction (theta, X, y, lambda)

m = size(X, 1);

n = size(X, 2);

J = 0;

grad = zeros(size(theta));

h = sigmoid(X*theta);

% here h is our hypothesis function

J = (1/m)*(-y'*log(h) - (1 - y)'*log(1 - h));

J_reg = (lambda/(2*m))*(sum(theta.^2) - theta(1)^2);

J = J + J_reg;

grad = (1/m).*(X'*(h - y)) + (lambda/m).*(theta);

grad(1) = (1/m)*(sum(h - y));

end

We take theta, X, y, and lambda as input to the function. We got to return ‘J’ the cost function and ‘grad’ the gradient.The lambda is the regularization term that we use to avoid overfitting of the dataset. We are not going to discuss regularization here, if you wish to know more about regularization follow the link given below :

We have used regularization in the cost function as well as in the gradient. If you remember the concept of regularization we don’t regularize the bias term(theta0) both in the cost function and the gradient. Now, let’s take a deeper look at the code. We first calculate the hypothesis(‘h’) and use it to futhur calculate J which is our cost function. To remove the bias term from the regularization we substract the square of the bias term from the sum of the square of all theta values. Then we calculate the gradient(all theta values are simultaneously calculated) and to remove the bias term from regularization we update the value of grad(1) as shown above.

#### Sigmoid function

It calculates the sigmoid of the argument(matrix or vector) which results into a value between 0 and 1.

function [ret_z] = sigmoid (z)

% takes z as an argument and returns it's sigmoid

ret_z = 1./(1 + exp(-z));

% Note : Here, we have passed z = X*theta

end

#### oneVsAll algorithm

This algorithm divides the given set of data to us(here 3 sets)into 2 sets of data and tries to find the perfect fit or our hypothesis function or more precisely the parameters theta needed to fit the 2 sets of data created. It does this for all the different sets(all different species of flowers). So, for the given training set we would have 3 different hypothesis function which we store in our ‘all_theta’ matrix. With that said let’s look at the oneVsAll code to see how it works.

% oneVsAll algorithm divides the dataset into

% 2 classes and finds optimum theta

% values for the respective classes

function [all_theta] = oneVsAll (X, y, num_of_classes, lambda)

m = size(X, 1);

n = size(X, 2);

all_theta = zeros(num_of_classes, n);% for this example we have 3 classes

for i = 1: num_of_classes

% Set the parameters for our advance optimization

% ==================================

initial_theta = zeros(n, 1);

options = optimset('GradObj','on','MaxIter',50);

% ==================================

costfunc = @(t)(lrCostFunction(t,X,y == i,lambda));

[theta] = fmincg(costfunc,initial_theta,options);

% (y == i)divides the training set into 2 distinct classes

% and then tries to find optimum theta

% values to fit the datasetall_theta(i,:) = theta;

end

end

Now, let’s look at some of the details of this function. The arguments for the oneVsAll function are X(150×5), y(150×1), num_of_classes(3), and lambda. We have made ‘all_theta’ matrix to store our best fit theta values that the function ‘fmincg’ is going to return. Inside the loop, first we have the initial_theta which acts as our starting point in order to minimize J(taken as a zero vector). You can think of it as coordinates which are set to zero with the end goal to reach the perfect fit coordinates. Then we have our options object that we pass into our advance optimization as an argument. Options object sets ‘GradObj’ as ‘on’ which means we want to consider gradient to minimize our cost function followed by ‘MaxIter’ which tells us the number of iterations we want to perform to minimize our cost function. Next, we pass our cost function with arguments initial_theta(as ‘t’), X, (y == i), and lambda into fmincg which minimizes our cost function. Minimizing the cost function returns an optimum theta(best fit values for theta) which we store into our all_theta matrix.

#### predictOneVsAll.m

In this function we pass our all_theta matrix and X matrix to predict the type of species for all the training examples. Let’s look at the code for the function and discuss some important concepts.

% here we are going to use all the training examples

% and find their respective

% classes by using the hypothesis function sigmoid

% The value in which hypothesis returns the max value is our predicted value which we store in p

% Note: p is a column vector

function [p] = predictOneVsAll (X,all_theta)

m = size(X, 1);

p = zeros(m,1);

h = sigmoid(X*all_theta');

[max_val,max_ind] = max(h,[],2);

p = max_ind;

end

We define a prediction vector ‘p’ of the size of our X vector to store our predicted value for all the training examples. To perform this task we calculate the hypothesis vector that returns a hypothesis value for each of the classes(each of the 3 species). The maximum predicted value among the three values would be taken as the prediction. The max function returns the maximum value(max_val) as well as the index for the max_value. We need the index value because we need to predict either of the three values 1,2 or 3. Our ‘p’ vector would thus contain 150 predictions and which we use to calculate the accuracy.

#### Accuracy

p = predictOneVsAll(X, all_theta);

accuracy = mean((p == data(:,5))*100);

fprintf('Accuracy of the algorithm = %f\n',accuracy);

The accuracy vector contains the fraction of the correct answers predicted which comes to be around 97%. This means our algorithm performs 97% accurately.

#### fmincg

This is an advance optimization which is used to minimize a function according to the given arguments. You don’t really need to know this code right now and it’s quite advance for you to understand as a beginner so just focus on its working.

function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)

if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')

length = options.MaxIter;

else

length = 100;

end

RHO = 0.01; % a bunch of constants for line searches

SIG = 0.5; % RHO and SIG are the constants in the Wolfe-Powell conditions

INT = 0.1; % don't reevaluate within 0.1 of the limit of the current bracket

EXT = 3.0; % extrapolate maximum 3 times the current bracket

MAX = 20; % max 20 function evaluations per line search

RATIO = 100; % maximum allowed slope ratio

argstr = ['feval(f, X']; % compose string used to call function

for i = 1:(nargin - 3)

argstr = [argstr, ',P', int2str(i)];

end

argstr = [argstr, ')'];

if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end

S=['Iteration '];

i = 0; % zero the run length counter

ls_failed = 0; % no previous line search has failed

fX = [];

[f1 df1] = eval(argstr); % get function value and gradient

i = i + (length<0); % count epochs?!

s = -df1; % search direction is steepest

d1 = -s'*s; % this is the slope

z1 = red/(1-d1); % initial step is red/(|s|+1)

while i < abs(length) % while not finished

i = i + (length>0); % count iterations?!

X0 = X; f0 = f1; df0 = df1; % make a copy of current values

X = X + z1*s; % begin line search

[f2 df2] = eval(argstr);

i = i + (length<0); % count epochs?!

d2 = df2'*s;

f3 = f1; d3 = d1; z3 = -z1; % initialize point 3 equal to point 1

if length>0, M = MAX; else M = min(MAX, -length-i); end

success = 0; limit = -1; % initialize quanteties

while 1

while ((f2 > f1+z1*RHO*d1) || (d2 > -SIG*d1)) && (M > 0)

limit = z1; % tighten the bracket

if f2 > f1

z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3); % quadratic fit

else

A = 6*(f2-f3)/z3+3*(d2+d3); % cubic fit

B = 3*(f3-f2)-z3*(d3+2*d2);

z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A; % numerical error possible - ok!

end

if isnan(z2) || isinf(z2)

z2 = z3/2; % if we had a numerical problem then bisect

end

z2 = max(min(z2, INT*z3),(1-INT)*z3); % don't accept too close to limits

z1 = z1 + z2; % update the step

X = X + z2*s;

[f2 df2] = eval(argstr);

M = M - 1; i = i + (length<0); % count epochs?!

d2 = df2'*s;

z3 = z3-z2; % z3 is now relative to the location of z2

end

if f2 > f1+z1*RHO*d1 || d2 > -SIG*d1

break; % this is a failure

elseif d2 > SIG*d1

success = 1; break; % success

elseif M == 0

break; % failure

end

A = 6*(f2-f3)/z3+3*(d2+d3); % make cubic extrapolation

B = 3*(f3-f2)-z3*(d3+2*d2);

z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3)); % num. error possible - ok!

if ~isreal(z2) || isnan(z2) || isinf(z2) || z2 < 0 % num prob or wrong sign?

if limit < -0.5 % if we have no upper limit

z2 = z1 * (EXT-1); % the extrapolate the maximum amount

else

z2 = (limit-z1)/2; % otherwise bisect

end

elseif (limit > -0.5) && (z2+z1 > limit) % extraplation beyond max?

z2 = (limit-z1)/2; % bisect

elseif (limit < -0.5) && (z2+z1 > z1*EXT) % extrapolation beyond limit

z2 = z1*(EXT-1.0); % set to extrapolation limit

elseif z2 < -z3*INT

z2 = -z3*INT;

elseif (limit > -0.5) && (z2 < (limit-z1)*(1.0-INT)) % too close to limit?

z2 = (limit-z1)*(1.0-INT);

end

f3 = f2; d3 = d2; z3 = -z2; % set point 3 equal to point 2

z1 = z1 + z2; X = X + z2*s; % update current estimates

[f2 df2] = eval(argstr);

M = M - 1; i = i + (length<0); % count epochs?!

d2 = df2'*s;

end % end of line search

if success % if line search succeeded

f1 = f2; fX = [fX' f1]';

fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);

s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere direction

tmp = df1; df1 = df2; df2 = tmp; % swap derivatives

d2 = df1'*s;

if d2 > 0 % new slope must be negative

s = -df1; % otherwise use steepest direction

d2 = -s'*s;

end

z1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIO

d1 = d2;

ls_failed = 0; % this line search did not fail

else

X = X0; f1 = f0; df1 = df0; % restore point from before failed line search

if ls_failed || i > abs(length) % line search failed twice in a row

break; % or we ran out of time, so we give up

end

tmp = df1; df1 = df2; df2 = tmp; % swap derivatives

s = -df1; % try steepest

d1 = -s'*s;

z1 = 1/(1-d1);

ls_failed = 1; % this line search failed

end

if exist('OCTAVE_VERSION')

fflush(stdout);

end

end

fprintf('\n');

if success % if line search succeeded

f1 = f2; fX = [fX' f1]';

fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);

s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere direction

tmp = df1; df1 = df2; df2 = tmp; % swap derivatives

d2 = df1'*s;

if d2 > 0 % new slope must be negative

s = -df1; % otherwise use steepest direction

d2 = -s'*s;

end

z1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIO

d1 = d2;

ls_failed = 0; % this line search did not fail

else

X = X0; f1 = f0; df1 = df0; % restore point from before failed line search

if ls_failed || i > abs(length) % line search failed twice in a row

break; % or we ran out of time, so we give up

end

tmp = df1; df1 = df2; df2 = tmp; % swap derivatives

s = -df1; % try steepest

d1 = -s'*s;

z1 = 1/(1-d1);

ls_failed = 1; % this line search failed

end

if exist('OCTAVE_VERSION')

fflush(stdout);

end

end

fprintf('\n');

This function returns the best fit theta parameters into our [theta] matrix in the ‘oneVsAll.m’ function used above. You can alternatively use ‘fminunc’ function which does the same job as ‘fmincg’ but is used when there are fewer features in the training examples.

#### Conclusion

This code hardly takes time to be completed once you know the concepts behind the algorithms. I hope you enjoyed doing this project, if you did, feel free to click the ‘clap’ button.

read original article at https://medium.com/@kumar.bits009/get-started-with-machine-learning-by-building-a-simple-project-82c9c99a01aa?source=rss——artificial_intelligence-5