Machine Learning Webinar for Beginners

Webinar Contents

  • Intro to Machine Learning
    • What is Machine Learning/ Machine Intelligence ?
    • Few interesting Applications of Machine Learning
    • Supervised,Unsupervised Learning and Reinforcement Learning

  • Hands on Session
    • Some Python Basics
    • Working with Numpy & Pandas
    • Steps involved in Machine Learning
    • Building your first Machine Learning Algorithm
    • K-Nearest Neighbours
    • Working with Datasets(MNIST Dataset) - Handwritten Digit Recognition
    • LIVE Project

Machine Intelligence

Machine learning is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn. What this means, in most cases, is that an algorithm is given a set of data and infers information about the properties of the data—and that information allows it to make predictions about other data that it might see in the future. In simple terms, it gives preditive power to computers !

Machine Learning vs Artificial Intelligence

  • Aritificial Intelligence is a system which interacts with surroundings
  • AI systems have sensors to collect data from surroundings
  • Machine Learning can be considered as the “brain of AI” which process the input data
  • ML algorithm frames an appropriate answer, and which is sent back back to the surroundings.

Example of Self-Driving Car

  • Input data - Set of images caputured by the sensor

  • Processing by Machine Learning Algorithm - A model trained on images processes it, looks for any obstacles

  • Output is a required ‘float’ value as the output, giving the required acceleration of the car.

Car Image

Why the Hype ?

  • Every minute up to 300 hours of video are uploaded to YouTube.

  • Average of 31.25 million messages and view 2.77 million videos every minute on Facebook.
  • More data has been created in the past two years than in the entire previous history of the human race.
  • At the moment less than 0.5% of all data is ever analyzed and used, just imagine the potential here

Machine Learning In Industry

Google Page Ranking

Google - Natural Language Search Queries

Netflix Suggestions.

Tinder, for you to “chill”

Tesla Self Driving Cars

Political Campaigns (Sentiments of People)

Spam Filtering

Google AdSense ( Ads based upon your history)

Bio-informatics ( Predicting Cancers, IBM Watson )

Apple Siri - Speech Recognition and Talking

Chatbots like Tay, Ruuh

Machine Learning at Home

Google “Allo”

Snapchat Filters

Google Home and Amazon “Alexa”

Facebook Photo Tagging


Recommendations on Amazon, Flipkart

and many more…

Different Machine Learning Approaches

Supervised Learning

  • Algorithms to get a set of Labeled Data called Training Data
  • Predictions are made on a set of Unlabled Data called Testing Data
  • Example - Spam Filtering in emails, Obstacle Detection in Images, Classifying Fruits

  • Algorithm is trained using a model, which is based on various parameters called features.
Color(X) Sweetness(Y) Label
0.80 0.90 Apple
0.80 0.84 Apple
0.10 0.27 Lemon
0.30 0.47 Lemon
0.83 0.83 Apple
0.60 0.97 Apple

Unsupervised Learning

  • Algorithm don’t get set of labeled data
  • Algorithm automatically extracts hidden patterns from the data.
  • Mostly used to classify data into various sets, similar data is clubbed into a single set called a cluster.
  • Example: Clustering Algorithms to classify a set of related data into a single cluster

    | Color(X) | Sweetness(Y) | | ————- |————-:| | 0.80 | 0.90 | | 0.80 | 0.84 | | 0.10 | 0.27 | | 0.30 | 0.47 | | 0.83 | 0.83 | | 0.60 | 0.97 |

Reinforcement Learning

  • It has a feedback element to improve its performance
  • Based upon the idea of “reward”, algorithm will move in a direction to achieve maximum reward.
  • Good Application : Teaching Machine to play games like Tic-Tac-Toe,Chess etc
  • Algorithms uses moves tried in the past which led to successful results Gaming AI

Popular Techniques

  • K-Means
  • K-Nearest Neighbours
  • Regression
  • Decision Trees
  • Naive Bayes
  • Neural Networks
  • Support Vector Machines
  • Neural Nets
  • Deep Learning
  • Support Vector Machines(SVM)

Open Source Packages

  • Scikit - Learn
  • TensorFlow
  • Pytorch

Developer Checklist

  • Basic Python 2.7+ ,Pip, Jupyter-Notebook installed
  • Numpy - Mathematic Operations
  • Pandas - Working with CSV’s(Excel Sheets), Data reading and writing
  • Matplotlib - Plotting Graphs

Python Basics

  • Lists
  • Dictionaries
  • Sorting
  • Lambda Function
  • Range Function
# Working with Lists
a = [ 1,2,4,5,6,"Hello"]

print a

# Slicing in Lists
print a[1:2]
print a[2:]
print a[:4]
print a[:]

# Dictionaries in Python ( Hashmaps in C++, JSON in javascript)
prices = {

print type(prices)

print prices["mango"]

#Iterate over all the keys
print prices.keys()
print prices.values()

# Loops
i = 1
while i<=10:
    print i
    i += 1
# Range Fn (s,e,jump)

x = "Python"
if x=="Python":
    print "yes"
    print "no"

[1, 2, 4, 5, 6, 'Hello']
[4, 5, 6, 'Hello']
[1, 2, 4, 5]
[1, 2, 4, 5, 6, 'Hello']
<type 'dict'>
['mango', 'apple', 'banana']
[100, 120, [10, 20, 30]]
# Sorting in Lists
a = [5,4,3,1,2]

a = sorted(a,reverse=True)
print a
[5, 4, 3, 2, 1]

Math in Python | Packages & Imports

import math

print math.log10(100)
print math.sqrt(2)
from math import sqrt as sq


Scientific Computation

import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# Numpy which is library acutally written in C and Python Interface
# Arrays in Numpy

arr = np.array([1,2,3,4])
# it is of fixed size, it is not dynanimc 
print type(arr)
<type 'numpy.ndarray'>
# 2-D arrays in Numpy

a = np.zeros((4,4))
print a

a[ : ,0] = 2
a[1, :] = 3
print a
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
[[ 2.  0.  0.  0.]
 [ 3.  3.  3.  3.]
 [ 2.  0.  0.  0.]
 [ 2.  0.  0.  0.]]
# Unique and Argmax Functions
arr = np.asarray([1,2,3,5,3,7,4,2,1,7,7])

b = np.unique(arr,return_counts=True)
print b
index = b[1].argmax()
print b[0][index]
(array([1, 2, 3, 4, 5, 7]), array([2, 2, 2, 1, 1, 3]))

Plotting Graphs using Matplot Lib

from jupyterthemes import jtplot

a = np.asarray(range(100))





# Scatter Plots

# Random Values in the Range 0 and 1
arr = np.random.random((100,2)) 
print arr.shape

print arr 
# First Paramenter ix X Coordinate, Second Paratmeter is Y
# We are using the Scatter Plot
(100, 2)
[[ 0.00291436  0.01448286]
 [ 0.51031075  0.65767026]
 [ 0.97074722  0.83606565]
 [ 0.86743705  0.01732897]
 [ 0.88593955  0.8020919 ]
 [ 0.28751527  0.7223667 ]
 [ 0.02263074  0.3382036 ]
 [ 0.69338037  0.22768306]
 [ 0.77667507  0.82879251]
 [ 0.61327601  0.28087191]
 [ 0.31060801  0.69091621]
 [ 0.28837317  0.24580994]
 [ 0.06039161  0.02097023]
 [ 0.7737081   0.0862868 ]
 [ 0.21237252  0.13823183]
 [ 0.48561855  0.57743034]
 [ 0.86938209  0.97227449]
 [ 0.06277177  0.63193716]
 [ 0.34021614  0.35706364]
 [ 0.39370643  0.5804014 ]
 [ 0.41925511  0.04778853]
 [ 0.50533611  0.32895564]
 [ 0.80257028  0.10471664]
 [ 0.71289916  0.89630801]
 [ 0.45396971  0.51404844]
 [ 0.20334233  0.99497241]
 [ 0.99839354  0.21437453]
 [ 0.55529647  0.22472561]
 [ 0.9728573   0.60438948]
 [ 0.32445404  0.33398996]
 [ 0.55213098  0.31026391]
 [ 0.35979964  0.44872302]
 [ 0.05225356  0.58651425]
 [ 0.6458056   0.60890034]
 [ 0.79562724  0.03879441]
 [ 0.6959736   0.38233052]
 [ 0.7526301   0.21503284]
 [ 0.34190764  0.31116281]
 [ 0.75554644  0.46958625]
 [ 0.93776575  0.62445155]
 [ 0.51429204  0.64805179]
 [ 0.37936658  0.26266611]
 [ 0.08913006  0.61540506]
 [ 0.58724715  0.06253665]
 [ 0.85576013  0.16285239]
 [ 0.83506768  0.33033296]
 [ 0.24312511  0.71298804]
 [ 0.41878801  0.06937431]
 [ 0.07202771  0.48041518]
 [ 0.14903979  0.60142633]
 [ 0.97025292  0.94248153]
 [ 0.93120412  0.43516918]
 [ 0.57869014  0.66465101]
 [ 0.23430332  0.26433057]
 [ 0.07642415  0.43258007]
 [ 0.01275717  0.04839758]
 [ 0.74708051  0.01431588]
 [ 0.76829324  0.0143131 ]
 [ 0.07333657  0.15874178]
 [ 0.29564716  0.46516782]
 [ 0.12790292  0.05567735]
 [ 0.7126499   0.61885085]
 [ 0.25167061  0.86270017]
 [ 0.99573942  0.79714859]
 [ 0.01198528  0.23383981]
 [ 0.44065774  0.49272317]
 [ 0.15193365  0.77113256]
 [ 0.78719606  0.86875411]
 [ 0.23941396  0.25477639]
 [ 0.38347683  0.47451819]
 [ 0.67296508  0.71186838]
 [ 0.86829012  0.52463931]
 [ 0.70273571  0.55131349]
 [ 0.83767118  0.56091251]
 [ 0.79026539  0.5755953 ]
 [ 0.80825375  0.24356713]
 [ 0.42424922  0.72464852]
 [ 0.58704666  0.2260189 ]
 [ 0.59772621  0.54736828]
 [ 0.33457846  0.82713099]
 [ 0.98668045  0.81246355]
 [ 0.23846308  0.22472451]
 [ 0.00846766  0.50492411]
 [ 0.17579599  0.61034498]
 [ 0.63107979  0.63797212]
 [ 0.77107512  0.30251411]
 [ 0.87480609  0.84771177]
 [ 0.97447     0.33918272]
 [ 0.48713053  0.86355211]
 [ 0.59996644  0.96567419]
 [ 0.16768305  0.27008325]
 [ 0.00260076  0.15940528]
 [ 0.61041731  0.53609626]
 [ 0.54895212  0.65345545]
 [ 0.43086309  0.60843941]
 [ 0.93793111  0.43365899]
 [ 0.72604315  0.16615365]
 [ 0.06820395  0.44045418]
 [ 0.50274979  0.51384979]
 [ 0.96437541  0.14919605]]


Scatter Plot

Probability Distribution

Random Variable

Random variable is a variable whose possible values are numerical outcomes of a random experiment.

For eg - 1) A random variable could denote number of characters in all the books in the world 2) Length of movie names in all the movies released so far 3) Outcomes of dice throw experiment

Mean and Expectation

u  = E(X)
u is the mean
E(X) is the expected value of X.

Normal/Gaussian Distribution

Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

Standard Normal Distribution

Multivariate Normal Distribution

Example in 2 dimensions- Imgur

KNN Algorithm

  • The K-nearest-neighbor (KNN) algorithm measures the distance between a query scenario and a set of scenarios in the data set.
  • KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consisting of training observations (x,y) and would like to capture the relationship between x and y.
  • This method used for classification and regression.

Training Data

Color(X) Sweetness(Y) Label
0.80 0.83 Apple
0.80 0.85 Apple
0.10 0.27 Lemon
0.30 0.47 Lemon
0.83 0.87 Apple
0.60 0.97 Apple

Test Data

Color(X) Sweetness(Y) Actual Label Predicted Label
0.91 0.75 Apple Apple
0.11 0.25 Lemon Lemon


mean_01 = np.array([3.0,4.0])

#Lemons are sour, avg sweetness will low, they have some low value for color
# Red values is higher, Yellow Lower
# Sweetness is higher, Sourness Lower
mean_01 = np.array([3.0,4.0])

#2 X 2 identity matrix
cov_01 = np.array([[1.0,-0.5],[-0.5,1.0]])

mean_02 = np.array([0.0,0.0])

cov_02 = np.array([[1.0,.5],[0.5,0.6]])

dist_01 = np.random.multivariate_normal(mean_01,cov_01,200)
dist_02 = np.random.multivariate_normal(mean_02,cov_02,200)

print dist_01.shape
print dist_02.shape
# print dist_01
(200, 2)
(200, 2)
# Try to make a scatter plot of these points

for x in range(dist_01.shape[0]):

# Training Data Preparation

# 400 Samples - 200 Apples, 200 for Lemons

labels = np.zeros((400,1))
labels[200:] = 1.0

X_data = np.zeros((400,2))
X_data[:200,:] = dist_01
X_data[200: ,:] = dist_02

# print X_data
# print labels

KNN Algorithm :)

#Dist of the query_point to all other points in the space ( O(N)) time for every point + sorting 
# You can the complexity O(Q.N)

#Euclidean Distance 
def dist(x1,x2):
    return np.sqrt(((x1-x2)**2).sum())

x1 = np.array([0.0,0.0])
x2 = np.array([1.0,1.0])

print dist(x1,x2)

def knn(X_train,query_point,y_train,k=5):
    vals = []
    for ix in range(X_train.shape[0]):
        v = [ dist(query_point,X_train[ix,:]), y_train[ix]]
    # vals is a list containing distances and their labels
    updated_vals = sorted(vals)
    # Lets us pick up top K values
    pred_arr = np.asarray(updated_vals[:k])
    pred_arr = np.unique(pred_arr[:,1],return_counts = True)
    #Largest Occurence 
    index = pred_arr[1].argmax() #Index of largest freq  
    return pred_arr[0][index]

q = np.array([0.0,4.0])

predicted_label  = knn(X_data,q,labels)
print predicted_label

## Run a Loop over a testing data(Split the original data into 2 sets - Training, Testing)

# Find predictions for Q Query points

# If predicted outcome = actual outcome -> Sucess else Failure

# Accuracy =  (Successes)/ (Total no of testing points) * 100

Project Work - Handwritten Digit Recognition on MNIST Dataset(using KNN)

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
ds = pd.read_csv('./train.csv')
print ds.shape

data = ds.values
print data.shape
(42000, 785)
(42000, 785)
y_train = data[:, 0]
X_train = data[:, 1:]

# X_train = (X_train - X_train.mean(axis=0))/(X_train.std(axis=0) + 1e-03)

print y_train.shape, X_train.shape

idx = 104
print y_train[idx]
plt.imshow(X_train[idx].reshape((28, 28)), cmap='gray')
(42000,) (42000, 784)

def dist(x1, x2):
    return np.sqrt(((x1 - x2)**2).sum())

def knn(X_train, x, y_train, k=5):
    vals = []
    for ix in range(X_train.shape[0]):
        v = [dist(x, X_train[ix, :]), y_train[ix]]
    updated_vals = sorted(vals, key=lambda x: x[0])
    pred_arr = np.asarray(updated_vals[:k])
    pred_arr = np.unique(pred_arr[:, 1], return_counts=True)
    pred = pred_arr[1].argmax()
    # return pred_arr[0][pred]
    return pred_arr, pred_arr[0][pred]
idq = int(np.random.random() * X_train.shape[0])
q = X_train[idq]

res = knn(X_train[:10000], q, y_train[:10000], k=7)
print res
print y_train[idq]

plt.imshow(q.reshape((28, 28)), cmap='gray')
((array([ 3.]), array([7])), 3.0)

Subscribe us on Youtube for more such tutorials.

Download Project

Data files and complete code can be downloaded from Github