## Clustering KNN Encodings

In this article, we are going to use the partitioner to cluster data points which have been encoded using a K-Nearest Neighbors (KNN) encoder. First we outline how a KNN encoding is achieved before jumping into a number of examples.

## KNN Encoding

The method of K nearest neighbors encoding creates spike patterns for multi-dimensional points by assigning active spikes to its spike pattern for each of its k nearest points. Spike patterns for a point are thus relative to the distribution from which it is drawn.

In practice, we don’t use all the points from the distribution but instead define a fixed library size and choose `k`

from this library.

A KNN encoding is heavily dependent on the “measure of distance” we decide to use. We should set this measure to suit our needs. For instance, some spatial measure, such a euclidean distance, creates similar spike patterns for points which are spatially close together, likewise, for discrete variables, such as for bags of words in text classification, another metric can be used, such as the overlap metric (or Hamming distance). Or in the context of gene expression microarray data, correlation coefficients such as Pearson and Spearman.

The Knowm API comes with a built in KNN spike encoder which encodes floating point values from either `L1`

(Manhattan Distance), `L2`

(Euclidean Distance), and `Cosine`

distances.

1 |
public class KNNEncoder implements Encoder |

We can instantiate this encoder by defining k, the library_size, and our choice of measure.

1 2 3 4 5 6 7 8 9 10 11 12 |
// Instantiation int k = 5; int library_size = 100; KNNEncoder knnEncoder = new KNNEncoder(k,library_size,Measure.L2); ... // Encode first 100 points to fill library ... // Encode float [] point = new float[]{ 0.1f, 0.2f, 0.3f, 0.4f, ....}; int[] spikes = knnEncoder.encoder( point ); |

## KNN adaption

The KNN library stores the first library_size number of points that it encodes directly into the library. Afterward, new values replace their closest neighbor in the library.

1 2 |
// Adapt library to distribution library.get(0).center = value; // library is sorted by distance to value |

When the distribution is stationary and there are few outliers, this causes the library points to fit the shape of the distribution. Otherwise, we may find the library begins to fill up with non-informationally important points. Consider a non-stationary distribution, for instance, which moves linearly in one direction, the KNN library will continually replace those that are closest to the distribution, while points that are farther away will remain and trail behind the distribution.

As an initial implementation, these problems are remedied if the distribution is limited to a finite space, and the library is sufficiently large to cover the varying positions the distribtion moves through.

## KNN Clustering

The spike patterns returned by the KNN encoder do not represent clusters yet. To do this, we need to train our partitioner class on the KNN spike stream. Over time, the partitioner will converge on regularities in the spike stream and output unique spike patterns for objects from within the same cluster. These spike patterns can be converted directly into a spike label by reading each spike pattern as a binary number.

1 2 3 4 5 6 7 |
float[] dataPoint = new float[]{0.1, 0.2, 0.3 ,....., 0.n}; int[] knnSpikes = knnEncoder.encode( dataPoint ); int[] partitionerSpikes = partitioner.encode( knnSpikes ); int label = binaryEncoder.encode( partitionerSpikes )[0]; |

The above process is encapsulated by a `Float_Vector_KNN_Partitioner`

. We can use its `encode()`

method to create our cluster labels as we encode data points from our distribution.

1 2 3 4 5 6 7 |
KTRAM ktram = new KTRAM21D(DigitalCoreType.BYTE); // Create our kT-RAM object int library_size = 100; // number of delegate neighbors int k = 10; // number of neighbors int numAHaHNodes = 10; // size of partitioner Float_Vector_KNN_Partitioner clusterer = new Float_Vector_KNN_Partitioner(library_size, k, numberOfAHaHNodes, ktram); |

## Clustering App

We’ve built a generic clustering suite `ClusteringPanel2d.java`

which we can use to build and test our clustering algorithms on two-dimensional data distributions. This wrapper is responsible for plotting and coloring our points. It extends a `JPanel`

object which we can pass to our animation frame when we start the App.

The following class uses this wrapper to build and cluster a simple distribution:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
package org.knowm.knowmj.clusterer.visual.twod; import java.util.ArrayList; import java.util.List; import org.knowm.knowmj.animation.AnimationFrame; import org.knowm.knowmj.ktram.KTRAM; import org.knowm.knowmj.ktram.digital.KTRAM21D; import org.knowm.knowmj.ktram.digital.KTRAM21D.DigitalCoreType; import org.knowm.knowmj.module.encoder.Encoder; import org.knowm.knowmj.module.encoder._float.vector.Float_Vector_KNN_Partitioner; public class MickyMouse2d extends ClusteringPanel2d { private final List blobCenters = new ArrayList(); public static void main(String[] args) { new AnimationFrame(new MickyMouse2d()); } /** * Constructor * Here we artificially create our 2D data distribution with Blob2D objects */ public MickyMouse2d() { for (int i = 0; i < 5; i++) { blobCenters.add(new Blob2d(0, 0, .1f)); blobCenters.add(new Blob2d(.5f, -.5f, .05f)); blobCenters.add(new Blob2d(-.5f, -.5f, .05f)); } } /* * Here we pass our distribution to our clustering panel */ @Override public List getBlobs() { return blobCenters; } /* * Here we create our clustering encoder and pass it off to * the clustering panel */ @Override public Encoder getEncoder() { KTRAM ktram = new KTRAM21D(DigitalCoreType.BYTE); // Ktram object float sparsity = 0.2f; // how sparse are out spike streams int library_size = 256; // Library size int k = (int) (library_size * sparsity); // number of choosen nearest neighbors K int numberOfAHaHNodes = 10;// Size of partitioner return new Float_Vector_KNN_Partitioner(library_size, k, numberOfAHaHNodes, ktram); } } |

This distribution is composed of three Gaussian clusters. Each is defined using a `Blod2d.java`

class. We pass these objects to `ClusteringPanel2D`

when `getBlobs()`

is called.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
private final List<Blob2d> blobCenters = new ArrayList<Blob2d>(); .... /** * Constructor */ public MickyMouse2d() { for (int i = 0; i < 5; i++) { blobCenters.add(new Blob2d(0, 0, .1f)); blobCenters.add(new Blob2d(.5f, -.5f, .05f)); blobCenters.add(new Blob2d(-.5f, -.5f, .05f)); } //And pass them to the 2-D panel by overriding getBlobs @Override public List<Blob2d> getBlobs() { return blobCenters; } |

Each Blob is instantiated by setting its frame-relative coordinates, with `x`

in (-1, 1) and `y`

in (-1, 1) and an additional radius parameter. The radius acts as a bound on each Gaussian cluster’s variance.

## Clustering

We pass our clustering encoder to the clustering panel by overriding the `getEncoder()`

method.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
@Override public Encoder getEncoder() { KTRAM ktram = new KTRAM21D(DigitalCoreType.BYTE); // Ktram object float sparsity = 0.2f; // how sparse are our spike streams int library_size = 256; // Library size int k = (int) (library_size * sparsity); // number of choosen nearest neighbors (K) int numberOfAHaHNodes = 10;// the number of AHaH nodes in our partitioner object return new Float_Vector_KNN_Partitioner(library_size, k, numberOfAHaHNodes, ktram); } |

## Results

Running the App creates our animation frame. Then, data points are pulled from our `blob`

objects, given a cluster label from our `Float_Vector_KNN_Partitioner`

object, and then finally added to the animation frame. Quickly these values will form visual clusters as the partitioner object adapts to the KNN regularities.

Notice that some of the values between the clusters are being assigned unique colors. This is because these points are have unique nearest neighbor encodings – some points from both clusters.

## Sparsity

We can reduce the number of unique clusters we see by increasing the number of neighbors we use to encode each example.

1 |
float sparsity = 0.7f; // k = sparsity * library_size |

Increasing the sparsity constant here will increase the number of common neighbors (K) points we choose per spike pattern. If the sparsity were equal to 1.0 then every point would have the same encoding: all points. As we add more points to each encoding we create more commonality between spike encodings. This will create a clustering convergence: fewer unique clusters.

If we decrease the sparsity to 0.1, our spike patterns will contain fewer common independent components. We see the opposite effect: a divergence.

Over time, KNN encoder adapts its library to contain points which span the distribution. As this happens a re-convergence to our usual groupings may occur.

However, if we reduce the sparsity too far, points within clusters rarely share common neighbors and we find complete divergence.

## Number of AHaH Nodes

Each AHaH node in our Partitioner object is attempting to maximize the margin between opposing components in our spike stream. Decision boundaries have binary output and thus a one node Paritioner will see two clusters.

This single AHaH node has found a margin which separated the KNN encodings of objects in the head cluster and everything else. Only if we increase the number of AHaH nodes is it possible to cluster the three regions. Specifically a maximum of 2^n unique clusters is possible with n AHaH nodes. There is no guarantee we find 2^n unique clusters since the attractor states are not always unique. In practice, the more AHaH nodes we use the more likely we are to find all regularities in the data.

## Using different KNN encodings

We could also change the way our KNN encodings measure the distance between objects. If we use cosine distance, for instance, then the similarity between points is measured by the angle between them.

1 |
knnEncoder = new Float_Vector_KNN(library_size, k, Measure.Cosine); |

This creates a clustering effect similar to spherical k-means.

## Many Clusters

We can increase the number of clusters and distribute them randomly across the panel by changing the `blob`

objects when we instantiate the app. `StaticRandom2d.java`

uses the following clusters:

1 2 3 4 5 6 7 |
this.blobCenters = new ArrayList(); int numCenters = 25; float blobSize = .05f; for (int i = 0; i < numCenters; i++) { blobCenters.add(new Blob2d((float) (2 * Math.random() - 1), (float) (2 * Math.random() - 1), blobSize * (float) (2 * Math.random() - 1))); } |

Here we also increase the number of library points and the number of AHaH nodes to reflect the increased number of points on the screen.

1 2 3 4 |
float sparsity = 0.025f; // how sparse are out spike streams int library_size = 1000; // Library size int k = (int) (library_size * sparsity); // number of choosen nearest neighbors K int numberOfAHaHNodes = 300;// Size of partitioner |

Results:

## Non-Gaussian Distributions

Each `blob`

object reflects a gaussian distribution, however, by placing them at specific locations we can create a non-gaussian distribution. In the APP `CurveDot2d.java`

we create a non-gaussian distribution as follows.

1 2 3 4 |
for (float x = -.85f; x <= .85f; x += 0.01f) { blobCenters.add(new Blob2d(-.25f, 0, .075f)); blobCenters.add(new Blob2d(-x * x + .75f, x, .05f)); } |

1 2 3 4 5 |
float sparsity = 0.4f; // how sparse are out spike streams int library_size = 256; // Library size int k = (int) (library_size * sparsity); // number of choosen nearest neighbors K int numberOfAHaHNodes = 24;// Size of partitioner |

Running this distribution:

## Dynamic Clustering

We’ve seen that clustering with the Knowm API is an adaptive procedure. We can harness this to build a dynamic clustering system. Our APP `Sliding2D.java`

implements this idea. Here we implement `Slideable`

and an additional method `getSlideRate()`

.

1 |
public class Sliding2d extends ClusteringPanel2d implements Slideable |

Our app now implements a `getSlideRate()`

method this creates a lateral pixels/cycle movement for our distribution.

1 2 3 4 |
@Override public float getSlideRate() { return .00005f; } |

These additions add movement to our `blob`

clusters.

In the following short video, the sliding distribution moves from one side of the screen to the other, shortly being eclipsed by the middle distribution. The library size has been increased to 1000 points here.

The next video shows a full cycle across the screen, as the distribution moves laterally, our groupings adapt, converging when close together and diverging when farther apart.

In the KNN Adaption section, it was discussed how the method of library point-adoption could cause problems if the distribution was non-stationary and the library was not large enough to distribute points evenly. In the following video we have decreased the library size down to 250 points.

In the first few cycles across the screen, the moving distribution has trouble remaining uniform at the far left side of the screen. This is caused by a bunching of library points at that location, which was created when we first instantiated our distribution. Here, KNN encodings are highly diverse, we observe divergence. Over time, however, the continual return of the moving distribution causes the library to spread, and we eventually reach our desired result.

## Discussion of Results

A K-nearest neighbors encoding scheme is paired with a Partitioner object and Binary encoder. These are used to form a simple adaptive clustering object we called the `Float_Vector_KNN_Partitioner`

.

This object was capable of clustering a number of distributions. These are highlighted in demos: `MickeyMouse2d.java`

, `CurveDot2D.java`

, `StaticRandom2D.java`

, and `Sliding2d.java`

.

During this article, we explored the effect of the KNN encoding hyper-parameters `K`

and library size through a sparsity constant. Increasing or decreasing the number of active spikes was shown to cause clustering convergence and divergence respectively.

A single AHaH node was used to demonstrate the effect of changing the size of our Partitioner object. Changing the number of AHaH nodes allowed us to see a single decision boundary formed in the input space.

Demo `StaticRandom2D.java`

showed how clustering multiple distribution centers of variable size and density were possible – Although convergence and noisy colours appear in areas of uncertainty between clusters.

Demo `CurveDot2D.java`

was used to demonstrate non-uniform cluster groupings. This demonstrates both that 1) non-linear decision boundaries are possible in the input space and 2) non-uniform clusters can be formed by a KNN encoding.

Demo `Sliding2d.java`

was created with moving cluster centers via a simple horizontal slide across the frame. The adoption of AHaH nodes to the changing distribution is visually apparent.

## Further Reading

TOC: Table of Contents

Previous: Clustering with the Knowm API

Next: Signal Prediction

## Subscribe To Our Newsletter

Join our low volume mailing list to receive the latest news and updates from our team.