00000

Image Classification

Image Classification is one task in Computer vision which has seen a tremendous amount of research through academia and the private sector. A consequence of its central role in computer vision systems pursued by large organizations like Google, NASA, and many others.

Image classifiers must be able to infer the presence of objects from the visual information of an image. Usually in the form of gray scale or RGB pixel intensities.

Often the task is made challenging by the sheer number of these points, as descriptive patterns may be embedded in high-dimensional manifolds and out of reach of a simple classifier. This problem is commonly called “The curse of dimensionality.” and solving it often requires preprocessing, dimensionality reduction, and feature learning before an image can be classified. To this aim, recent advancements in Deep Learning, and manifold learning have been successfully applied to image classification: finding brush stroke features through consecutive layers of a Deep Belief Network and by projecting handwritten digits onto lower dimensional manifolds respectively.

Image by L.van der Maaten MNIST Learned Manifold

Although recently very successful, these methods suffer from issues with speed and power consumption due their need to update millions of parameters with a standard processing unit. The problem is compounded when dealing with large images or attempting to process many of them in real-time. A neuromemristive chip, on the otherhand, can perform a synaptic integration (sum) and update its parameter space (learn) in parallel and ‘for free’, a result of the physics of electronic currents and memristors. These properties make neuromemristive processors ideally suited for image and video classification vision systems.

The MNIST Data Set

MNIST Digits

MNIST is a canonical and historically significant image classification benchmark and there has been a considerable amount of research published on MNIST image classification.

The MNIST database was constructed out of the original NIST database; hence, modified NIST or MNIST. There are 60,000 training images and 10,000 test images, both drawn from the same distribution.

All these black and white digits are size normalized and centered in a fixed size image where the center of gravity of the intensity lies at the center of the image with 28 x 28 pixels. Thus, the dimensionality of each image sample vector is 28 * 28 = 784.
Each image is assigned a single truth label digit from [0,9] making this a supervised multi-class classification problem.

Just like in the previous tutorials we will be using the open source Java Datasets project to access the raw data, as it provides an extremely convenient way to query the data in the form of POJOs (Plain Ol’ Java Objects). Here each Mnist object contains its pixel information, id, label and accessor methods (getImageMatrix()) for reading the gray scale information.

Building our Classifier

The MNIST classifier we are going to build here can be found here: org.knowm.knowmj.classifier.mnist.

It contains the usual methods for training and testing that we have seen in the previous tutorials on classification.

Encoding

The first task in classifying images is converting the grey scaled pixel information into the spike format we need for our linear classifier. To do this, we’ve built an MnistSpikeEncoder for translating between grayscale image vectors and spike encodings.

There are two important features of image classification which pose as issues to our normal spike encoding streamline. First, the input dimensionality is quite large (in this case 784) and, secondly, rotations and translations in the image frequently occur between digits of the same category. These problems call for an encoding method that reduces dimension and is invariant to translation. We highlight one simple approach in the following steps.

First, we introduce some spike encoding invariance between translated digits of the same category by re-encoding each pixel from a variety of different reference points. The convolution is accomplished by moving a fixed window across the image and thresholding each value. This creates a new set of smaller images each representing a thresholded window of the original.

Secondly, we reduce the dimensionality of each patch feature through a set of decision trees. These decision trees use kT-RAM instructions to partition spike spaces and learn a basis-set. Multiple methods exist–some are fast but less accurate, some are more accurate but slower and some can integrate errors. Many methods we have yet to invent. Remember, kT-RAM is a computing substrate and not an algorithm.

Encoding MNIST images is carried out in our spike encoders encode() method below.

Zooming in, we can see that the outer loop performs the progression across the image.

and the two inner loops apply a threshold to each pixel from the patch.

Next, we feed our patchFeatures, through our set of decision trees where additional bias spikes are attached.

Finally, a least recently used cache is used ensures our spike space doesn’t exceed the space capabilities of our chip and to reduce the spike space further.

Training

We will be following the same training and testing loops we’ve seen during the previous tutorials. We have set the number of training examples to 60,000 and we train each epoch with the following method.

Testing

We test on the remaining 10,000 MNIST images during the call to test(). Here we additionally maintain an mnistErroEvaluator object for recording misclassified digits.

Results

The above classifier was trained and tested using a 8 bit BYTE core, over 10 training epochs. During test, a BEST_GUESS_ABOVE_THRESHOLD evaluation technique was used to limit classification to a single label.

For our encoder, we convolved with an 8×8 window and used three decision trees of depth 12 for translation. Biases were pooled with a pool size of 8.

The chart below displays the performance results as we vary our evaluation threshold. We manage to attain a peak classification error rate of 1.5% over the full dataset. As is commonly done, we use the overall dataset precision to evaluate error. $Error = 1 - \frac{TP}{TP+FP}$, where TP: = True positives, and FN = False negatives. In other words, error equals the number of digits falsely classified over the size of the test set.

Knowm API – MNIST Performance

A comparative list of method performance on this task is displayed below. For a more complete list visit Yann LeCun’s page here.

Method Error
Committee of 35 CNNs 0.23%
Large/Deep CNN 0.27%
Committee of 25 NN 784-800-10 0.39%
K-nearest-neighbors 0.52%
SVM 0.56%
kT-RAM Classifier, Byte Core 1.5%
NN Linear Classifier (1-layer NN) 7.6%

The following figure shows the error rate for the individual digits. We see that the toughest digit to recognize was “9”.

MNIST Performance Digit Error

The following figure shows all the individual digits which were falsely classified. For each digit, the truth label is annotated in the upper right corner. the bottom right corner annotations represent the first and second guess of our classifier. Often times the second guess is correct, but not always.

MNIST Digits Incorrectly Labeled

Secondary Metrics

As stated in Primary and Secondary Performance Metrics, most machine learning benchmark studies only report primary performance metrics. Here, we also report the secondary metrics when run on a 2015 Macbook Pro Retina. Wattage is a rough estimate acquired from the iStat Menu app.

Measurement Value
Energy Consumption 50.5 Watts
Speed 11683 Seconds
Volume 600 cubic centimeters

Discussion

In this article, we introduced a spike encoding method and linear classifier model built on top of kT-RAM. We used these modules for doing image classification on the canonical MNIST dataset and achieved a performance error rate of 1.5%.

In comparison, we showed our method performed on par with other models of similar size for the same task. Power and speed statistics were also generated as it was emulated on a generic laptop. We expect drastic gains in both categories when physical kT-RAM is available.

If you have any comments or questions please leave them in the comment section below!

References

[LeCun et al., 1998a]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11):2278-2324, November 1998. [on-line version]
00000

Knowm Inc. New Memri...

20000Knowm Inc., a start-up pioneering next-generation advan...

AHaH Computing in a ...

150200This is a video series titled: “AHaH Computing i...

Machine Learning Cap...

13000Machine Learning with Thermodynamic RAM and the Knowm A...

• Keith Wiley

Why do you define prediction on this page as (1 – TP/(TP+FN)), i.e., (1 – recall)? This is not how it’s generally defined. It is conventionally TP/(TP-FP), which isn’t always the same value. It’s even obvious from your plot that Precision and recall don’t sum to 1, yet you defined it that way in the text.