Get started with Machine Learning by building a simple project Photo by Franck V. on Unsplash

Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. The image above shows how a machine can learn to play a musical instrument provided the correct training. This article covers some of the basic concepts, functions, and algorithms to help beginners get started with machine learning. So, let’s start our journey by building a simple classification project.

What are we building

We are making a classification project to classify three sets of Iris flowers namely Iris_setosa(1), Iris_versicolor(2), and Iris_verginica(3). You can check them out to have a grasp on the project :

We have been provided with a dataset of 150 training examples (i.e 50 for each flower) with ‘4’ features for each flower. The features of the flowers are as follows :

1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm

We have to train our algorithm in such way that it could classify any new examples into one of the 3 species. You can have a look on the dataset provided in the Iris_flower.txt file. I hope you got the idea of what are we going to build.

Some prerequisites before we start

You must have the Octave or Matlab to implement this project. I recommend using Octave because it’s free and more compatible for this project. You can download Octave here :

Secondly you must have a basic understanding of the matrices, and Octave commands and few algorithms. With this said let’s dive into the fun stuff….Yeah! the code.

Files Included in the project

We are going to make a bunch of files(Don’t worry it’s quite easy) for the project. Let’s list them out to keep all of this structured :

1. Iris_main.m
2. lrCostFunction.m
3. oneVsAll.m
4. predictOneVsAll.m
5. sigmoid.m
6. Iris_flower.txt

Iris_flower.txt

Let’s begin by looking at the dataset given to us :

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

You can see there are 4 features(already mentioned above)for each training example separated by commas and the name for the respective species of the flower. Here, we have used the names for the flowers but in practise we are going to replace it with numbers 1, 2, and 3 respectively.

Iris_main.m

This file is like the skeleton for our project because it contains all the function calling statements. Let’s look at the file and try to understand the full code from the scratch.

```% This is the iris flower detection project% we are given to identify 3 types of flowers% based on a training set% The functions included are % lrCostFunction.m (logistic regression cost function)% oneVsAll.m% predictOneVsAll.m% sigmoid.m% fmincg.m
```
```clear;close all;clc;% loading the data from iris_flower.txtfprintf('loading the data...\n');% the whole dataset is stored in a matrix 'data' data = load('iris_flower.txt'); % data is a 150x5 matrix% store all the features of each training% set in the matrix X, and the corresponding% result in the matrix y
```
```X = data(:,1:4) ; % 150x4 matrix
```
```y = data(:,5); % 150x1 matrix
```
```fprintf('Program paused. Press enter to continue.\n');pause;% Note: We add a column of ones to X to balance % the bias term in theta.
```
```X = [ones(size(X,1),1) X];
```
`fprintf('Running oneVsall Algorithm...\n');% number of classes signifies different types of flowers `
``` num_of_classes = 3;lambda = 0.1;[all_theta] = oneVsAll(X,y,num_of_classes,lambda);
```
``` fprintf('Program paused. Press enter to continue.\n');pause;fprintf('==== Running prediction ====\n');p = predictOneVsAll(X, all_theta);% We calculate the accuracy of our hypothesis
```
``` accuracy = mean((p == data(:,5))*100);fprintf('Accuracy of the algorithm = %f\n',accuracy);% check the predictions Vs Original Resultfprintf('P OVal\n\n');for i = 1 : size(X,1)fprintf('%d %d\n ',p(i),y(i));endfprintf('Program paused. Press enter to continue.\n');pause;fprintf('END\n');
```

We start by loading our dataset into our ‘data’ matrix and then split the data into the matrix ‘X’ and ‘y’. We add an extra column of ones in our matrix X to balance the bias term in ‘theta’ matrix. Once we are done with the initialization parts let’s look at our cost function

Cost Function(lrCostFunction.m)

The cost function uses the Logistic Regression Model. The figure below explains the hypothesis function(h) for the model The graph for the sigmoid function(‘g’) or the logistic function is shown above. No matter what the input is the function outputs a value between 0 and 1. We have implemented this cost fuction in our lrCostFunction.m. Let’s see what does it do.

```function [J, grad] = lrCostFunction (theta, X, y, lambda)
```
```m = size(X, 1);n = size(X, 2);J = 0;grad = zeros(size(theta));
```
```h = sigmoid(X*theta);% here h is our hypothesis functionJ = (1/m)*(-y'*log(h) - (1 - y)'*log(1 - h)); J_reg = (lambda/(2*m))*(sum(theta.^2) - theta(1)^2);J = J + J_reg;grad = (1/m).*(X'*(h - y)) + (lambda/m).*(theta);grad(1) = (1/m)*(sum(h - y));
```
```end
```

We take theta, X, y, and lambda as input to the function. We got to return ‘J’ the cost function and ‘grad’ the gradient.The lambda is the regularization term that we use to avoid overfitting of the dataset. We are not going to discuss regularization here, if you wish to know more about regularization follow the link given below :

We have used regularization in the cost function as well as in the gradient. If you remember the concept of regularization we don’t regularize the bias term(theta0) both in the cost function and the gradient. Now, let’s take a deeper look at the code. We first calculate the hypothesis(‘h’) and use it to futhur calculate J which is our cost function. To remove the bias term from the regularization we substract the square of the bias term from the sum of the square of all theta values. Then we calculate the gradient(all theta values are simultaneously calculated) and to remove the bias term from regularization we update the value of grad(1) as shown above.

Sigmoid function

It calculates the sigmoid of the argument(matrix or vector) which results into a value between 0 and 1.

```function [ret_z] = sigmoid (z)% takes z as an argument and returns it's sigmoidret_z = 1./(1 + exp(-z));
```
``` % Note : Here, we have passed z = X*theta
```
```end
```

oneVsAll algorithm

This algorithm divides the given set of data to us(here 3 sets)into 2 sets of data and tries to find the perfect fit or our hypothesis function or more precisely the parameters theta needed to fit the 2 sets of data created. It does this for all the different sets(all different species of flowers). So, for the given training set we would have 3 different hypothesis function which we store in our ‘all_theta’ matrix. With that said let’s look at the oneVsAll code to see how it works.

```% oneVsAll algorithm divides the dataset into% 2 classes and finds optimum theta% values for the respective classes
```
```function [all_theta] = oneVsAll (X, y, num_of_classes, lambda)m = size(X, 1);n = size(X, 2);all_theta = zeros(num_of_classes, n);% for this example we have 3 classesfor i = 1: num_of_classes% Set the parameters for our advance optimization% ==================================initial_theta = zeros(n, 1); options = optimset('GradObj','on','MaxIter',50);% ==================================costfunc = @(t)(lrCostFunction(t,X,y == i,lambda));[theta] = fmincg(costfunc,initial_theta,options);% (y == i)divides the training set into 2 distinct classes% and then tries to find optimum theta % values to fit the datasetall_theta(i,:) = theta;end
```
```end
```

Now, let’s look at some of the details of this function. The arguments for the oneVsAll function are X(150×5), y(150×1), num_of_classes(3), and lambda. We have made ‘all_theta’ matrix to store our best fit theta values that the function ‘fmincg’ is going to return. Inside the loop, first we have the initial_theta which acts as our starting point in order to minimize J(taken as a zero vector). You can think of it as coordinates which are set to zero with the end goal to reach the perfect fit coordinates. Then we have our options object that we pass into our advance optimization as an argument. Options object sets ‘GradObj’ as ‘on’ which means we want to consider gradient to minimize our cost function followed by ‘MaxIter’ which tells us the number of iterations we want to perform to minimize our cost function. Next, we pass our cost function with arguments initial_theta(as ‘t’), X, (y == i), and lambda into fmincg which minimizes our cost function. Minimizing the cost function returns an optimum theta(best fit values for theta) which we store into our all_theta matrix.

predictOneVsAll.m

In this function we pass our all_theta matrix and X matrix to predict the type of species for all the training examples. Let’s look at the code for the function and discuss some important concepts.

```% here we are going to use all the training examples% and find their respective % classes by using the hypothesis function sigmoid% The value in which hypothesis returns the max value is our predicted value which we store in p% Note: p is a column vector
```
```function [p] = predictOneVsAll (X,all_theta)m = size(X, 1);p = zeros(m,1);h = sigmoid(X*all_theta');[max_val,max_ind] = max(h,[],2);p = max_ind;
```
```end
```

We define a prediction vector ‘p’ of the size of our X vector to store our predicted value for all the training examples. To perform this task we calculate the hypothesis vector that returns a hypothesis value for each of the classes(each of the 3 species). The maximum predicted value among the three values would be taken as the prediction. The max function returns the maximum value(max_val) as well as the index for the max_value. We need the index value because we need to predict either of the three values 1,2 or 3. Our ‘p’ vector would thus contain 150 predictions and which we use to calculate the accuracy.

Accuracy

```p = predictOneVsAll(X, all_theta);accuracy = mean((p == data(:,5))*100);fprintf('Accuracy of the algorithm = %f\n',accuracy);
```

The accuracy vector contains the fraction of the correct answers predicted which comes to be around 97%. This means our algorithm performs 97% accurately.

fmincg

This is an advance optimization which is used to minimize a function according to the given arguments. You don’t really need to know this code right now and it’s quite advance for you to understand as a beginner so just focus on its working.

```function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
```
```if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')length = options.MaxIter;elselength = 100;end
```
```RHO = 0.01; % a bunch of constants for line searchesSIG = 0.5; % RHO and SIG are the constants in the Wolfe-Powell conditionsINT = 0.1; % don't reevaluate within 0.1 of the limit of the current bracketEXT = 3.0; % extrapolate maximum 3 times the current bracketMAX = 20; % max 20 function evaluations per line searchRATIO = 100; % maximum allowed slope ratio
```
```argstr = ['feval(f, X']; % compose string used to call functionfor i = 1:(nargin - 3)argstr = [argstr, ',P', int2str(i)];endargstr = [argstr, ')'];
```
```if max(size(length)) == 2, red=length(2); length=length(1); else red=1; endS=['Iteration '];
```
```i = 0; % zero the run length counterls_failed = 0; % no previous line search has failedfX = [];[f1 df1] = eval(argstr); % get function value and gradienti = i + (length<0); % count epochs?!s = -df1; % search direction is steepestd1 = -s'*s; % this is the slopez1 = red/(1-d1); % initial step is red/(|s|+1)
```
```while i < abs(length) % while not finishedi = i + (length>0); % count iterations?!
```
```X0 = X; f0 = f1; df0 = df1; % make a copy of current valuesX = X + z1*s; % begin line search[f2 df2] = eval(argstr);i = i + (length<0); % count epochs?!d2 = df2'*s;f3 = f1; d3 = d1; z3 = -z1; % initialize point 3 equal to point 1if length>0, M = MAX; else M = min(MAX, -length-i); endsuccess = 0; limit = -1; % initialize quantetieswhile 1while ((f2 > f1+z1*RHO*d1) || (d2 > -SIG*d1)) && (M > 0) limit = z1; % tighten the bracketif f2 > f1z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3); % quadratic fitelseA = 6*(f2-f3)/z3+3*(d2+d3); % cubic fitB = 3*(f3-f2)-z3*(d3+2*d2);z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A; % numerical error possible - ok!endif isnan(z2) || isinf(z2)z2 = z3/2; % if we had a numerical problem then bisectendz2 = max(min(z2, INT*z3),(1-INT)*z3); % don't accept too close to limitsz1 = z1 + z2; % update the stepX = X + z2*s;[f2 df2] = eval(argstr);M = M - 1; i = i + (length<0); % count epochs?!d2 = df2'*s;z3 = z3-z2; % z3 is now relative to the location of z2endif f2 > f1+z1*RHO*d1 || d2 > -SIG*d1break; % this is a failureelseif d2 > SIG*d1success = 1; break; % successelseif M == 0break; % failureendA = 6*(f2-f3)/z3+3*(d2+d3); % make cubic extrapolationB = 3*(f3-f2)-z3*(d3+2*d2);z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3)); % num. error possible - ok!if ~isreal(z2) || isnan(z2) || isinf(z2) || z2 < 0 % num prob or wrong sign?if limit < -0.5 % if we have no upper limitz2 = z1 * (EXT-1); % the extrapolate the maximum amountelsez2 = (limit-z1)/2; % otherwise bisectendelseif (limit > -0.5) && (z2+z1 > limit) % extraplation beyond max?z2 = (limit-z1)/2; % bisectelseif (limit < -0.5) && (z2+z1 > z1*EXT) % extrapolation beyond limitz2 = z1*(EXT-1.0); % set to extrapolation limitelseif z2 < -z3*INTz2 = -z3*INT;elseif (limit > -0.5) && (z2 < (limit-z1)*(1.0-INT)) % too close to limit?z2 = (limit-z1)*(1.0-INT);endf3 = f2; d3 = d2; z3 = -z2; % set point 3 equal to point 2z1 = z1 + z2; X = X + z2*s; % update current estimates[f2 df2] = eval(argstr);M = M - 1; i = i + (length<0); % count epochs?!d2 = df2'*s;end % end of line search
```
```if success % if line search succeededf1 = f2; fX = [fX' f1]';fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere directiontmp = df1; df1 = df2; df2 = tmp; % swap derivativesd2 = df1'*s;if d2 > 0 % new slope must be negatives = -df1; % otherwise use steepest directiond2 = -s'*s; endz1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIOd1 = d2;ls_failed = 0; % this line search did not failelseX = X0; f1 = f0; df1 = df0; % restore point from before failed line searchif ls_failed || i > abs(length) % line search failed twice in a rowbreak; % or we ran out of time, so we give upendtmp = df1; df1 = df2; df2 = tmp; % swap derivativess = -df1; % try steepestd1 = -s'*s;z1 = 1/(1-d1); ls_failed = 1; % this line search failedendif exist('OCTAVE_VERSION')fflush(stdout);endendfprintf('\n');
```
```if success % if line search succeededf1 = f2; fX = [fX' f1]';fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere directiontmp = df1; df1 = df2; df2 = tmp; % swap derivativesd2 = df1'*s;if d2 > 0 % new slope must be negatives = -df1; % otherwise use steepest directiond2 = -s'*s; endz1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIOd1 = d2;ls_failed = 0; % this line search did not failelseX = X0; f1 = f0; df1 = df0; % restore point from before failed line searchif ls_failed || i > abs(length) % line search failed twice in a rowbreak; % or we ran out of time, so we give upendtmp = df1; df1 = df2; df2 = tmp; % swap derivativess = -df1; % try steepestd1 = -s'*s;z1 = 1/(1-d1); ls_failed = 1; % this line search failedendif exist('OCTAVE_VERSION')fflush(stdout);endendfprintf('\n');
```

This function returns the best fit theta parameters into our [theta] matrix in the ‘oneVsAll.m’ function used above. You can alternatively use ‘fminunc’ function which does the same job as ‘fmincg’ but is used when there are fewer features in the training examples.

Conclusion

This code hardly takes time to be completed once you know the concepts behind the algorithms. I hope you enjoyed doing this project, if you did, feel free to click the ‘clap’ button.