Getting Data from a CSV into Numpy
Often times while say working at an office on computers, you may come accross a spreadsheet with some data. Python makes it easy to load numerical data into performant numpy arrays for applications in data science and numberical computating... also just for fun!
Spreadsheets are data deliminated columns, with entries in each row. A common format for representing this kind of data is a comma separated list, or comma seperated values (CSV). CSV is a popular file format for spreadsheet information.
An example of such a file (data.csv), perhaps for health insurance, is
Name,Risk,Age,Weight Nico Knack,high,21,145 Marco Bolo,low,45,180 BettyBoop,low,36,160
represented as a table
| Name | Risk | Age | Weight | | ------------- |:-------------:| -----:|------------------:| | Nico Knack | high | 21 | 145| | Marco Bolo | low | 45 | 180| | Betty Boop | low | 36 | 160|
Numpy it is!
The numerical data from this .csv file could be read into a numpy array using
import numpy as np data = np.genfromtxt('data.csv', delimiter=',', usecols=(2,3), skip_header=1)
usecol attribute is useful for removing textual data, that may be used to
classify the data. This would be useful when applying a classification algorithm
to determine if a new person is high or low risk.
skipheader removes the specified number of lines from the top of the file .
the result is
>>> print data [[ 21. 145.] [ 45. 180.] [ 36. 160.]]
Classifying New People
One Step Further...
Let's import a support vector machine (svm) from the python package
scikit-learn. Here will will use a linear kernel.
from sklearn import svm # Create the feature vectors for training # -1 is a high risk person # 1 is a low risk person X = data y = [-1,1,1] # Train the classifier clf = svm.SVC(kernel='linear') fit = clf.fit(X, y) # Predict new values fit.predict([19, 135])
This low data size is just an example of how to get right into processing data with introducry tools in machine learning.