Share on Facebook0Tweet about this on TwitterShare on Reddit0Share on Google+0Share on LinkedIn0Share on StumbleUpon0Buffer this pageEmail this to someonePrint this page

Census Income Classification Benchmark with the Knowm API

The first classification task we are going to tackle with the Knowm API will be predicting if an individual is rich or poor by looking at some of the attributes provided on their census filing. This Census Income benchmark dataset is a very common dataset used to build and test a relatively simple classifier since the entire dataset is small, the feature vectors for each record are small and it’s a binary classification task as opposed to a multi-label classification task, i.e. is this person rich or poor?

Classification tasks like these are very useful for census organizations and ad companies who are constantly attempting to gather additional information on individuals from a population. For instance, Google might classify you based on your internet activity and market ads to you directly.

rich poor photo

Image by kenteegardin

We will not be performing ad-placement categorization, but rather attempting to predict the income level ( > $50k | < $50K) of a individual from a predefined list of 14 attributes (feature vector) as shown directly below.

Attribute Name Attribute Data Type
Age Integer
WorkClass String
Fnlwgt Floating point
Education String
Acdm String
Marital-Status String
Sex  String
Race  String
NativeCountry  String
Relationship  String
Capital-gain Integer
Capital-loss Integer

Here are a few examples from the original raw data CSV file:

We will be using the open source Java Datasets project to access the raw data, as it provides an extremely convenient way to query the data in the form of POJOs (Plain ol’ Java Objects). To see what I mean, here is the CensusIncome class:

And we can query CensusIncome objects via methods such as:

If you haven’t already, you can access this classifier example by signing up for the Knowm Developer Community and downloading the Java code. If not you can still follow along to see how it’s done.

Building a Classifier with the KnowmAPI

For the remainder of these tutorials we will be building our classifiers in Java and each will be inheriting from ClassifierApp.java. This wrapper, is responsible for the generic code that calls the training and testing loops and plots the results in a presentable way. It will also do all the interfacing with the Linear Classifier.

In order to define the specifics, our subclasses will be implementing some this class’s methods which we will go over later. First, Let’s look at the main class CensusIncomeAppKtRam.java found here:

Before we can train our classifier we need to load our dataset and build the classifier. This is all done in the main method.

Here we load the dataset into local memory, instantiate an instance of our ClassifierApp and finally call the go() method. Loading the dataset might fail if you haven’t downloaded the census income dataset from your local repository.

Once we call the go() method, our superclass will build the kT-RAM object and start the training and testing loops. However, before it can do this we need to set the classifier specifics.

First off, the classifier app will need to instantiate an instance of kT-RAM and to do that it will need to getCoreType(). The core type defines the precision of each synapse component on kT-RAM. If you’re a KDC member you can learn more about the different core types from lesson 1: kt-synapse.

Every app will also need to set the initial conductances of each synapse. The initial conductance defines conductive the synapses on kT-RAM are when we first initialize them.

During training we are going to loop through each example a certain number of times (epochs). We will do 1 epoch.

We will also need to define the labels we are classifying each example under. These are provided by overriding getEvaluationLables(). For census income, these are rich and poor

When we read the outputs of our AHaH nodes we can limit our classifier to only making a prediction until it has passed a certain threshold, this will effect the primary metrics of our classifier. The ClassifierApp will try all the thresholds in this range and record how these change the results. In this case 0 to 0.2.

We also need to define our benchmark name and the results we are going to display

The last three methods we will need to override are test(), learn() and getNewEncoder(). These methods will be unique for each classifier app we build. In the remaining tutorials we will jump directly to how we encode, train and test.

Spike Encoding

One of the features of learning with the Knowm API is that all inputs have to be spike encoded, this is because the chip we are emulating only accepts spike encoded values.Lesson 1: kt-synapse goes into more detail if you’re interested. This is accomplished by an encoder, and in this case we’ve written a special CensusIncomeSpikeEncoder.java for converting each example from the census income dataset into spikes.

This encode method works on single examples by having each feature individually encoded and attached together. To do this the encoder needs to deal with all the different type of data that a single census income example is made of. In our case each census income example is composed of  trings and Integers and so our encoder is built on top of two datatype encoders. Namely:

for encoding the strings, and

for the base ten integers.

If you’re part of the KDC you can learn more about how these individual encoders work from the Standard Spike Encoders tutorial. Otherwise, it’s sufficient to understand that these encoders take each datatype and convert it into an array of integers which later map to particular synapses on the kT-RAM chip.

In the following method, the encoder works on one CensusIncome example at a time by using the appropriate encoder for each data type and then joining them all into a single array of spikes.

Training Phase

With that in place, let’s look at how we will use these spike encoded values to train.

This method is responsible for feeding the appropriate examples into the classifier so that the underlying kT-RAM chip can learn. This will happen for each example in our training set for as many epochs as we’ve previously decided. This is a standard procedure in Machine Learning called the “training phase”.

It’s a fairly simple method. We iterate through each example from our training set

encode it using our spike encoder,

set the corresponding truth label,

and then finally call the classify method from our superclass

The false evaluation flag is telling the parent ClassifierApp class not to evaluate the output of the classifier, which we don’t need until the testing phase.

As we iterate through, the classifier will train the emulated kT-RAM chip so that it can associate the spike’s features of the training set with the appropriate labels.

Great! we’ve built all the necessary parts of our classifier, now lets test it.

Testing Phase and Primary Performance

Once we’ve gone through all of our epochs of training, the classifier will call the test() method:

This will run through all the remaining test data and collect metrics on its performance. Our implementation is almost exactly like it was during the training phase except that we pass true for the evaluation flag. This ensures that the classifier does not cheat by using the label to make the prediction, but to use it for evaluating itself instead.

Let’s run the app and see what results we get.

Knowm API Census Income Performance

Knowm API Census Income Performance

This output shows how the classifier performs on this task for a range of confidence values. Recall that, the classifier not only can predict labels but also say how confident it is in it’s prediction. On physical kT-RAM, its confidence is proportional to the voltage magnitude during a read operation.

The different evaluation metrics were previously explained on the page Primary and Secondary Performance Metrics. We see that as expected as the confidence threshold is varied the metric change accordingly. In order to compare apples to apples and see how our classifier based on the Knowm API compares to other published Census Income benchmark studies, we will use the peak accuracy score from the plot because most studies report the accuracy or the “error”, which is (1 – the accuracy). Looking at the plot, the accuracy is 0.846 or equivalently the error is 0.154 or 15.4%.

Simple Tricks to Increase Primary Performance

What we have shown is a typical classification work-flow, and as one can see below the results are about average. If one wants to boost performance, there are a few tricks to keep in mind.

Check Synaptic Initialization

There are a few ways to initialize the memristors that compose adaptive synapses in kT-RAM. Different initializations result in different learning behavior. For most classification tasks, its best to initialize the memristors to their low conductance state with minimal noise. A more detailed explanation is available in the Knowm API Tutorials.

Use Feature Learning

In many classification problems it is helpful to form features that represent regularities over a few of the input spike streams. Once way to do this with minimal fuss is to use the GenericBeanEncoder class. This class handles most spike-encoding automatically and includes an option to turn on a simple form of feature learning consisting of a collection of AHaH ‘partitioners’. Each Partitioner outputs a spike stream which becomes selective to regularities in the spike stream, and via unsupervised AHaH plasticity adapts to minimize the influence of noise.

Use Unsupervised Drop-Out

As we first disclosed in our Cortical Computing with kT-RAM paper, the performance of the classifier can be improved through a technique that bears similarity to a technique in machine learning called “Drop-Out”, although in our case we operate in a strictly unsupervised manner. The idea is to use unsupervised AHaH learning, which we get for free simply by reading the devices, and couple that to spike patterns where some of the input spikes have been eliminated.

Increase the Number of Training Epochs

Provided the spike-stream is stable, the AHaH Linear Classifier can typically generate acceptable performance in only one epoch of training, particularly when using the Byte or Nibble cores. However, adding another training epoch will modestly increase performance. Increasing the number of training epochs becomes more important as the spike space gets larger and the resolution of your feature encoding increases. Don’t go too far, or you will notice a slight drop in performance due to over-fitting.

As we include all of these tricks, we can usually boost the performance from “average” to “good” or “great”, as compared to other machine learning methods. In this case, we squeezed out another percent at the cost of additional computational time and energy. It is also worth keeping in mind that what we have shown is just one method to use kT-RAM to solve such problems. We continue to discover new methods. Remember, kT-RAM is not an algorithm — it is a general-purpose adaptive computational substrate. There are many ways to use it!

Knowm API Census Income Performance With Tricks

Knowm API Census Income Performance With Tricks

Comparison

Before you get into using the Knowm API to do more tasks like this, it’s important to highlight an important concept in machine learning: performance benchmarking. When developing a machine learning technology, you need to compare your results to other algorithms! If you do not do this, you run the risk of wasting your time developing a bad idea. It doesn’t matter how conceptually beautiful your algorithm is. It only matters that it solves the problem. So how does this method compare against the results of other algorithms?

Algorithm Error
FSS Naive Bayes 0.1405
NBTree 0.1410
C4.5-auto 0.1446
IDTM (Decision table) 0.1446
KnowmAPI – ByteCore – With Simple Tricks 0.145
HOODG 0.1482
C4.5 rules 0.1494
OC1 0.1504
KnowmAPI – ByteCore – No Tricks 0.154
C4.5 0.1554
Voted ID3 (0.6) 0.1564
CN2 0.1600
Naive-Bayes 0.1612
Voted ID3 (0.8) 0.1647
T2 0.1684
1R 0.1954
Nearest-neighbor (1) 0.2142
Nearest-neighbor (3) 0.2035

All benchmark results we’re pulled from the UCI Machine Learning Repository Census Income Data Set page.

Looking at the above table, we can see that our results compare nicely with other methods. While it is not the absolute best classifier, the no-tricks version ranks amongst the top performers despite the fact that only one training epoch was run and no optimization tricks were used try to squeeze out any more performance. As we start adding optimization tricks, we can squeeze out more primary performance but usually at the cost of secondary metrics. In addition, we’ve built our classifier so that it will one day be able to run on physical kT-RAM, and once this happens these algorithms will benefit from massive improvements in speed and efficiency.

ROC Curve

Looking at the ROC curve, we see that the where the equal error rate or crossover error rate (EER or CER): the rate at which both acceptance and rejection errors are equal, is roughly 0.2. The value of the EER can be easily obtained from the ROC curve by looking at where the ROC curve crosses the EER line. In general, the classifier with the lowest EER is the most accurate.

Knowm API Census Income ROC

Knowm API Census Income ROC

Secondary Metrics

As stated in Primary and Secondary Performance Metrics, most machine learning benchmark studies only report primary performance metrics. Here, we also report the secondary metrics when run a a 2015 Macbook Pro Retina. Wattage is a rough estimate acquired from the iStat Menu app.

Measurement Value
Energy Consumption 19.5 Watts
Speed 6 Seconds
Volume 600 cubic centimeters

Reverse Lookup

Another feature of the kT-RAM linear classifier is that we can look at our AHaH nodes and see which spikes cause that AHaH node to evaluate positively and by how much. If we do this with our current classifier we can rank each attribute in order of its effect on being rich or poor.

For instance you are very likely to be poor if your attributes include the following (ordered by relevance):

  1. education = 7th-8th
  2. education = 9th
  3. occupation = Farming-fishing
  4. workclass = Self-emp-not-inc
  5. occupation = Transport-moving

and you are likely to be rich when:

  1. education = Masters
  2. relationship = Wife
  3. age = 40
  4. occupation = Exec-managerial
  5. hoursperweek = 50

To see our CEO, Alex Nugent, talk about this “reverse lookup” in a presentation given at RIT, please watch the video: Introduction to AHaH Computing

Conclusion

In this article we stepped through the Census Income dataset, the kT-RAM linear classifier, spike encoding as well as primary and secondary performance metrics. If you made it this far, we thank you for taking the time to make it through the entire article. If you have any comments or questions please leave them in the comment section below!

Further Reading

TOC: Table of Contents
Previous: Primary and Secondary Performance Metrics
Next: Wisconsin Breast Cancer Classification Benchmark

Share on Facebook0Tweet about this on TwitterShare on Reddit0Share on Google+0Share on LinkedIn0Share on StumbleUpon0Buffer this pageEmail this to someonePrint this page

Related Posts

Subscribe To Our Newsletter

Join our low volume mailing list to receive the latest news and updates from our team.

Leave a Comment

Newsletter

Subscribe to our low-volume mailing list to receive important updates and announcements directly in your inbox.