The Ones and Zeros of Numpy as np

As with any good project nowadays, there eventually came a point when some machine learning based classification "needed" to come into play (this is more of a joke, as you should seek to use the right tool for the job).

There are a lot of different toolkits for working with datasets for the purpose of data mining, feature extraction and classification. Scikit learn is a Python package that was written by some interns over at Google. It is easily installed using pip install -U scikit-learn.

Support Vector Classifier (SVC) is a common classifier that can be used for multiple classes. It is a part of the group of machine learning algorithms known as Support Vector Machines (SVM).

Lets say we wanted to classify a set of X, Y corrdinates into two groups; positive and negative numbers. We could use a SVC to train a dataset for prediction of future inputs.

  • 0 will represent the group of positive coordinates
  • 1 will represent the group of negative coordinates
import numpy as np
from sklearn import svm

# define positive and negative coordinates
positive = np.array([[1,1], [5,5], [2,3], [2,4]])
negative = np.array([[-1,-1], [-3,-4], [-5, -1], [-2,-2]])

Positive and negative pairs of coordinates have now been created. These represent the groups 0 and 1 respectively. We will use this information to train a classifier and then test some new coordinates to see how accurate it has been trained. Usually real datasets are much larger, but this is simple.

# generate X data for the classifer by combining the positive and negative numbers
X = np.vstack((positive, negative))

# generate y group labels for the data
y = [0,0,0,0,1,1,1,1]
# or using the ones and zeros of np!
# might not seem like much here, but on larger datasets, this helps a lot!
y = np.append(np.zeros(positive.shape[0]), np.ones(negative.shape[0]))

The classfier takes samples of both groups (X), while an array of binary group labels(y), 1 and 0, were created to classify the training data.

The model for the data can then be fit using sklearn.svm.

clf = svm.SVC(kernel='linear')
fit = clf.fit(X, y)

The output of this command on in a print statement is probably similar to,

>>>print fit
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Yay! We have now trained our classifier.

Lets see how well we did but trying a few test cases.

Points can be passed as an array, or individual coordinates using,

predict = fit.predict([[1,4]])

A score can be derived using a set of labeled test data such as

score = fit.score([[2,3],[4,1],[5,4],[2,1],[4,3]], np.zeros(5))

for testing new positive coordinates. The same can be done with negative corrdinates or any combination of both.

Print out the results.

print "predict ", predict # -> [0]
print "score ", score # -> [1.0]

np.ones and np.zeros. A new appreciation. Perfect classification.