Getting Data from a CSV into Numpy


Often times while say working at an office on computers, you may come accross a spreadsheet with some data. Python makes it easy to load numerical data into performant numpy arrays for applications in data science and numberical computating... also just for fun!

Spreadsheets are data deliminated columns, with entries in each row. A common format for representing this kind of data is a comma separated list, or comma seperated values (CSV). CSV is a popular file format for spreadsheet information.


An example of such a file (data.csv), perhaps for health insurance, is

Nico Knack,high,21,145
Marco Bolo,low,45,180

represented as a table

|   Name        |   Risk        |  Age  |    Weight         |
| ------------- |:-------------:| -----:|------------------:|
| Nico Knack    | high          |    21 |                145|
| Marco Bolo    | low           |    45 |                180|
| Betty Boop    | low           |    36 |                160|

Numpy it is!

The numerical data from this .csv file could be read into a numpy array using np.genfromtxt.

import numpy as np
data = np.genfromtxt('data.csv', delimiter=',', usecols=(2,3), skip_header=1)

The usecol attribute is useful for removing textual data, that may be used to classify the data. This would be useful when applying a classification algorithm to determine if a new person is high or low risk. skipheader removes the specified number of lines from the top of the file [1].

the result is

>>> print data
[[  21.  145.]
 [  45.  180.]
 [  36.  160.]]

Classifying New People

One Step Further...

Let's import a support vector machine (svm) from the python package scikit-learn. Here will will use a linear kernel.

from sklearn import svm

# Create the feature vectors for training
# -1 is a high risk person
# 1 is a low risk person
X = data
y = [-1,1,1]

# Train the classifier
clf = svm.SVC(kernel='linear')
fit =, y)

# Predict new values
fit.predict([19, 135])

This low data size is just an example of how to get right into processing data with introducry tools in machine learning.