Spinning Up a Hadoop Cluster with Apache Ambari and Brooklyn
Deploying and managing a Hadoop stack can be a daunting task. A huge number of services exist within the Hadoop ecosystem, and deploying them so they are correctly wired up to each other can be difficult and time consuming. Apache Ambari aims to make this much easier, and at Cloudsoft we have started to add functionality to Apache Brooklyn to enhance that provided by Ambari.
Apache Ambari aims to make it easy to provision, manage and monitor Apache Hadoop clusters.
Apache Brooklyn is an application blueprinting and management system which supports a wide range of software and services in the cloud.
By combining the two we can immediately simplify the deployment process further, especially when we want to deploy to a cloud. This post will walk through the process of spinning up a Hadoop cluster quickly and easily.
- To start with, you need an Apache Brooklyn installation. If you haven’t got one, there are instructions here.
- If Brooklyn is currently running, stop it (ctrl-c).
- Download and copy brooklyn-ambari-0.1-SNAPSHOT.jar to <brooklyn_home>/libs/dropins/, which will make it available on the Brooklyn classpath.
- Launch Brooklyn (<brooklyn_home>/bin/brooklyn launch).
- Deploy Hadoop via Ambari.
Brooklyn uses a YAML-based blueprint to deploy blueprints. The following will deploy Ambari to a cluster of four machines (one Ambari server plus three agents) to AWS EC2. It will then deploy the listed services to those machines.
<tt>name: Ambari driven cloud installation
identity: <your aws identity>
credential: <your aws credential>
- type: io.brooklyn.ambari.AmbariCluster
securityGroup: <the name of a security group exposing 8080>
Save this to a file called simple-ambari-hdp.yaml, editing the identity, credentials and security group. Putting all the VMs in the same security group will let all the services communicate internally, but you will at least need to set up an inbound rule for port 8080. If you’d rather deploy to a cloud provider other than AWS, see the locations documentation here.
Aside: Although you can’t test Ambari using a localhost deployment, you could use a network of virtual machines. The simplest way to do this would be to use the Ambari-Vagrant project here. You could then treat these VMs as a BYON within Brooklyn.
Assuming Brooklyn is deployed locally on port 8081, you can deploy the application with the following command:
<tt>curl https://127.0.0.1:8081/v1/applications --data-binary @/path/to/simple-ambari-hdp.yaml</tt>
Once Ambari has been installed and started, you should see something like the following screenshot:
The green traffic lights indicate the Ambari server and agents were installed without any problems.
Click on “Ambari Server” and then open the sensors tab near the top of the page.
Here you will see information about the VM instance and the Ambari server process. Find “main.uri” and hover over its row. A small arrow will appear. Hovering over the arrow will display the option “open,” which will open the Ambari server web interface in a new tab.
Log into the Ambari server web interface with the username/password admin/admin. You should be able to see that a cluster called Cluster1 has been created, and there will be one operation in progress. If you investigate the current operation you’ll see “Install and start all services” is in progress. This will take a few minutes, but eventually you should see from the dashboard your Hadoop cluster is up and running.
You can stop your application, and terminate all its VMs, by navigating to the “effectors” tab of the main application and clicking the “stop” effector. This will ensure all resources are removed from AWS.
Obviously, we have only implemented the simplest functionality so far, and while we hope this is useful to people, we still have a long list of functionalities we would like to add.
Firstly, we need to extend the parameters available in our YAML description. For example, adding a parameter to allow an Ambari blueprint to be specified. We also need further parameters to allow us to wire up non-Ambari services, such as MySQL.
Similarly, we need to add in a range of sensors that make useful data available for us to implement policies against. For example, an HDFS capacity sensor would allow us to have a policy that automatically created extra VMs or added extra storage when headroom dropped below a specific threshold.
Effectors that allow us to change the size of the cluster would also be useful (and vital to support the above policies). This would make it easy to extend a cluster by adding extra nodes and instructing Ambari to rebalance services.
Hopefully this article was a useful demonstration of how to quickly set up a full Hadoop stack on a cluster of machines. To give it a try yourself, the latest code is available on GitHub. If you have any thoughts or questions, please come visit us in IRC on #brooklyncentral or contact us via our mailing lists.
Cloudsoft is a sponsor of The New Stack.
Feature image via Flickr Creative Commons.