apache storm machine learning

Introduction

The Knowm Development Team needs access to affordable research and development platforms for distributed computing. To address this need, we have paired Adapteva’s powerful and affordable Parallella boards with Apache Storm to create an ARM-based Ubuntu cluster. We hope you find the below information useful in building your own clusters.

Apache Storm

Apache Storm is an open-source high-performance, distributed real-time computation framework for processing large streams of data for applications such as realtime analytics, online machine learning, continuous computation, distributed RPC and other big data jobs. In this How-To article, I show the steps I took to install Apache Storm on a four-node computer cluster consisting of one Zookeeper node, a Nimbus node and two supervisor nodes. In my particular case each host was a Adapteva Parallella running Linaro, a Linux distribution based on Ubuntu for ARM SoC boards. These instructions should work fine for any hardware running Linux Ubuntu or Debian-based Linux distros, with slight modifications to match particular needs and configurations. For restarting failed Zookeeper and Storm components, I wrapped them in Upstart services that respawn automatically when a failure occurs. The official Storm docs instruct to “Launch daemons under supervision using “storm” script and a supervisor of your choice”, and the most common way of doing that seems to be using supervisord. I found that using Upstart was much simpler to set up and use though mostly because I used it before and am comfortable with it. If this guide was useful to you or you have any questions or suggestions, I’d love to heard your comments down below!

Cluster Topology 1

In general, to run a clustered Apache Storm topology, you need a zookeeper node, a nimbus node and N worker nodes. It is possible to combine multiple components on a single node as well, and for that see further below: Cluster Topology 2.

Apache Storm Cluster

Apache Storm Cluster 1

Decide what setup you want and create a hosts file on each node such as:

Add it to all the nodes and make sure they can all talk to each other using ping or similar.

Cluster Topology 2

In the following cluster topology, Zookeeper, Nimbus and Storm UI are installed and running on a single machine.

Apache Storm Cluster

Apache Storm Cluster 2

A sample hosts looks like this:

Java

Java needs to be installed on all hosts. Repeat the instructions on all hosts in the cluster.

Install Java the Easy Way

If you are installing on a normal Ubuntu distribution on x86 architecture, the following will get you running Java 8 quickly.

Install Java the Hard Way

If you are installing on Linaro or an an ARM-based architecture, the above easy method may not work. In my case, I wanted the Oracle HotSpot JDK for ARM. Other downloads for different systems can be found here. Make sure to adjust the following code snippets to correspond to the latest Java release!

Followed by:

, and choose the desired installation.

Finally, check that Java was installed correctly:

Zookeeper Node

Setting up a ZooKeeper server in standalone mode is straightforward. The server is contained in a single JAR file.

By the way, if you run into problems with your Zookeeper installation, check out How To Debug an Apache Zookeeper Installation.

Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want failover or are deploying large Storm clusters you may want larger Zookeeper clusters.

A Few Notes About Zookeeper Deployment

It is critical that you run Zookeeper under supervision, since Zookeeper is fail-fast and will exit the process if it encounters any error case. The way I set up supervision is to run Zookeeperd, which automatically respawns if the service fails. It is also critical that you set up a cron to compact Zookeeper’s data and transaction logs. The Zookeeper daemon does not do this on its own, and if you don’t set up a cron, Zookeeper will quickly run out of disk space. See here for more details.

Install and Configure

Explanation of Parameters

tickTime
the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.
dataDir
the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.
clientPort
the port to listen for client connections

The last two autopurge.* settings are very important for production systems. They instruct ZooKeeper to regularly remove (old) data and transaction logs. The default ZooKeeper configuration does not do this on its own, and if you do not set up regular purging with those parameters, ZooKeeper will quickly run out of disk space. Alternatively, you can set a cron job to run ‘PurgeTxnLog’, which does the same purging. I explain how to use that below…

Control

Test Restart of Service

Test Start of Service on Reboot

Apache Storm Nimbus, UI, Worker Nodes

On any node running either Nimbus, UI or as a worker, we install Storm and run the appropriate option: nimbus, ui or supervisor

Create storm user

Download and Unpack

Go to https://storm.apache.org/downloads.html and click on download link to find a valid mirror url

Configure for Cluster

Paste the following text, save and exit.

Nimbus

Test run:

Create Upstart Script for nimbus

Paste the following text, save and exit.

Create Upstart Script for ui

Paste the following text, save and exit.

Create Upstart Script for supervisor (Worker)

Paste the following text, save and exit.

Control for Nimbus, UI, Supervisor

Cloning

If you are cloning HD or SD cards, be sure to delete the storm-local directory before cloning. This will clear local data and prevent problems of Storm only registering one of the supervisor instances.

Run a Test Storm Topology

Running the storm jar command on the nimbus1 host we can run an example Storm topology that comes packed with the Storm installation. The storm-starter example topology is part of the main Storm project and the source code is available on GitHub. The following command will start the sample topology called ExclamationTopology with the name exclamation-topology provided as the first argument.

After a short delay you will see something like:

List Running Topologies

You can ask Storm to list the running topologies:

And you should see something like this:

View Running Topologies in Storm UI

Of course, since Storm-UI was installed on the nimbus1 host, we can view the deployment in the web browser at http://nimbus1:8080. For an explanation of the Storm-UI web interface, I recommend an excellent overview called “Reading and Understanding the Storm UI [Storm UI explained]” at malinga.me – http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/.

Apache Storm UI

Apache Storm UI

Stop a Running Topology

There are a few options for stopping a running Storm topology. You can either do it from the Storm-UI web interface or you can run:

That’s a Wrap

After struggling to understand how to deploy an Apache Storm cluster and struggling to find good instructions on how to do it, the above instructions document what I ended up doing. I will update this article as I come across any gotchas or important changes or additions. Did this article help you? What’s your favorite way to install and deploy a Storm cluster? Thanks for reading!

References

Related Posts

Subscribe To Our Newsletter

Join our low volume mailing list to receive the latest news and updates from our team.

4 Comments

    • anon
      reply

      Hi Tim,

      I tried this on ubuntu 14.04, and thanks first for writing this up. I’ve been looking for clear instructions for some time.

      First, I couldn’t cd into /var/lib/zookeeper/log and for some reason isn’t created.

      Second, when I try running the exclamation topo, I get:
      Exception in thread “main” org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused

      I tried looking up how to fix this, but am not sure where the problem lies. Did everything else and runs fine on each VM. Maybe a problem with logging on log4j? I saw some error about not initialized properly.

      Thank you.

      • Tim Molter
        reply

        I’ve seen that exception before when Nimbus wasn’t running or the IP address was not what I had set in the hosts file. The IP address can change if you’re using DHCP and they’re getting re-assigned after a reboot or similar situation. I’m not sure why the log directory isn’t being created.

Leave a Comment

Knowm 32X32 Crossbar

Knowm Newsletter

Are you ready for memristor AI processors? With our newsletter, you will be.