This is a very light introduction into machine learning. I will demonstrate how to solve one specific problem using scikitlearn, a machine learning framework in just 3 lines of code.
I created a syntetic problem where data is very simple so we don’t have to struggle with distilling it.
Based on very simple rules we have labeled points on a coordinate plane. If x > 0 and y < 0 then the point is labeled as X. If x < 0 and y > 0 then the point is labeled as ‘O’.
o ^ y
o 
o 

+> x
 x
 x
 x
Our task is to label any given point.
Input: (x, y)
Output: X or O
E.g.
Input: (1, 7)
Output: X
In this case to implement the classification manually using if statements is really easy:
def label_for(x, y):
if x < 0 and y > 0:
return 'o'
if x > 0 and y < 0:
return 'x'
return '?'
Anyway let’s compare with the machine learning implementation.
from sklearn.neighbors import KNeighborsClassifier
import coords
clf = KNeighborsClassifier()
clf.fit(*training_data(coords.make_n_random(100)))
print(clf.predict([[1, 7], [1, 7]]))
Basically the last 3 lines of code create the classifier, train it with sample data and predict the labels for the given sample coordinates. The full solution code is in https://github.com/povilasb/machinelearning/blob/master/labeledcoordinates/x_o.py. If we run it, we get the output:
['x' 'o']
Meaning (1, 7) was labeled as ‘x’ and (1, 7) as ‘o’. Which is correct.
Let’s take a deeper look how this actually works.
To make our classifier understand which coordinates have which labels we must train it with the sample data.
We will not get training data from anywhere. Instead we will generate it. We’ll use generate_n_random() function from coords module to generate N coordinates that comply with my given rules. Basically, this function returns a list of labeled coordinates:
[(1, 5, 'x'), (4, 3, 'y'), ...]
coords.make_n_random() generates labeled coordinates in a different format than the scikitlearn classifier expects. So we need to reformat our training data.
That’s where we use training_data() function:
def training_data(coordinates):
"""Converts coordinates to classifier acceptable format."""
return (
[[x, y] for x, y, _ in coordinates],
[label for _, _, label in coordinates]
)
It separates coordinates and labels into two separate arrays.
Once we have preprocessed data we can continue with training the model and classifying new coordinates:
clf = KNeighborsClassifier()
clf.fit(*training_data(coords.make_n_random(100)))
print(clf.predict([[1, 7], [1, 7]]))
KNeighborsClassifier is a python class that implements the knearest neighbors algorithm.
clf.fit() trains the classifier with the given labeled data.
clf.predict() returns predicted labels for the specified coordinates.
In this case it’s obvious that solving the problem with if statements is way easier: you don’t need to gather any data, scrub it, etc.
But what if our input data changes as time goes by?
Let’s say now every coordinate where x > 0 and y > 0 is labeled as ‘s’:
o ^ y
o  s
o  s
 s
+> x
 x
 x
 x
In a machine learningbased implementation we don’t need to change anything. We just have to retrain the model with new data.
If we implemented the classification manually, we would have to program a new rule:
def label_for(x, y):
if x < 0 and y > 0:
return 'o'
if x > 0 and y < 0:
return 'x'
+ if x > 0 and y > 0:
+ return 's'
return '?'
When we use a framework it might be really easy to solve problems using machine learning.
The example problem was easy to implement using if statements. But any input data changes require to adopt the algorithm. Also, in real life scenarios problems are not that simple. For example if we wanted to recognize digits in an image we should program 10 different cases for different digits. Also, if the digit font changes, we would have to adopt code, etc. Using machine learning all we need to do is to train our model with new data. And this is way more scalable.
So machine learning helps to solve a lot of otherwise unsolvable problems. And we don’t really need to understand the maths behind it because there are great tools that do the job for us.
I used the scikitlearn framework with python 3. It depends on a lot of other packages. So if you dont have scikitlearn installed on your machine, I created a Docker container.
Now all you have to do to run your python script in this environment is:
$ docker run it rm=true v `pwd`:/tmp/ml povilasb/scikitlearn python3 /tmp/ml/x_o.py
This command will download Docker image, create a container, run the specified script in it and finally destroy it.