Setting up a Hadoop cluster with Ansible

Setting up a Hadoop cluster with Ansible

Hello There

After reading this blog you will able to configure a Hadoop Cluster using Ansible.

First, let us get a quick overview of Hadoop and Ansible.

Hadoop

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Ansible

It is a simple and agentless IT automation tool that almost anyone can use.

All we need to know for this tutorial is that there is a controller node (CN) in which Ansible is installed and there is a managed nodes (MN) which we configure with Ansible.

For this tutorial, we will have a CN which sets up a NameNode and DataNodes (Together called a cluster).

I assume you already have a CN, and some systems for NameNode and DataNodes up and running.

Approach

Whenever we write an Ansible Playbook we should always first write down the steps:

To configure NameNode

  1. Install JDK and Hadoop software

  2. Create a directory( "/nn" ) to store the metadata

  3. Configure hdfs-site.xml and core-site.xml files

  4. Format the NameNode directory

  5. Start the NameNode

To configure DataNode

  1. Install JDK and Hadoop software

  2. Create a directory( "/dn" ) to store the data

  3. Configure hdfs-site.xml and core-site.xml files

  4. Start the DataNode

Materializing the approach

Common Step

As we can see that the first step are common for configuring both NameNode and DataNode :

  • Installing JDK and Hadoop Software

This again breaks down into further steps:

Creating a directory in the Managed Nodes to store the software (JDK and Hadoop)

- name: "Create a directory to store software"
  file:
    path: "/software"
    state: "directory"
    mode: "0755"

Transfer the software from the Controller Node to Managed Nodes

- name: "Copy software to the software directory"
  copy:
    src: "/software/{{ item }}"
    dest: "/software"
  loop: "{{ software }}"

Where software is a list of the software JDK and Hadoop.

Install the software

- name: "Install java and hadoop"
  command: "rpm -i /software/{{ item }} --force"
  loop: "{{ software }}"

This code works fine but not optimized. We can make it Idempotent using the following:

- name: "Create a directory to store software"
  file:
    path: "/software"
    state: "directory"
    mode: "0755"

- name: "Copy software to the software directory"
  copy:
    src: "/software/{{ item }}"
    dest: "/software"
  loop: "{{ software }}"

- name: "Register jps command's output to 'jps' variable"
  command: "jps"
  register: jps
  ignore_errors: true

- name: "Register hadoop version command's output to 'hadoop' variable"
  command: "hadoop version"
  register: hadoop
  ignore_errors: true

- name: "Install java and hadoop"
  command: "rpm -i /software/{{ item }} --force"
  loop: "{{ software }}"
  when: jps.rc != 0 or hadoop.rc != 0

NameNode Steps

  • Create a directory( "/nn" ) to store the metadata
- name: "Create a directory /nn"
  file:
    path: "/nn"
    state: "directory"
    mode: "0755"
  • Configure hdfs-site.xml and core-site.xml files

    First, we make a jinja template file for core-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://{{ ansible_facts['ens160']['ipv4']['address'] }}:9001</value>
</property>

</configuration>

Then comes hdfs-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>

</configuration>

Now we can copy these files:

- name: "Copy the configuration files"
  template:
    src: "/namenode/{{ item }}"
    dest: "/etc/hadoop/{{ item }}"
  loop: "{{ config_files }}"

Where config_files is a list of the file names core-site.xml and hdfs-site.xml

  • Format the NameNode directory
- name: "Format the namenode"
  shell: "echo Y | hadoop namenode -format"
  • And finally, start the NameNode
- name: "Start the namenode"
  command: "hadoop-daemon.sh start namenode"

DataNode Steps

  • Create a directory( "/dn" ) to store the data
 - name: "Create a directory /dn"
   file:
     path: "/dn"
     state: "directory"
     mode: "0755"
  • Configure hdfs-site.xml and core-site.xml files

hdfs-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>dfs.data.dir</name>
<value>/dn</value>
</property>

</configuration>

core-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://{{ groups['namenode'][0] }}:9001</value>
</property>

</configuration>

Transferring these templates:

- name: "Copy the configuration files"
  template:
    src: "/datanode/{{ item }}"
    dest: "/etc/hadoop/{{ item }}"
  loop: "{{ config_files }}"

Where config_files is a list of the file names core-site.xml and hdfs-site.xml

  • Start the DataNode
- name: "Start the datanode"
  command: "hadoop-daemon.sh start datanode"

Finally compiling all these we get the playbook as:

- hosts: all

  vars:

    - software:
        - "jdk-8u171-linux-x64.rpm"
        - "hadoop-1.2.1-1.x86_64.rpm"

  tasks:

    - name: "Create a directory to store software"
      file:
        path: "/software"
        state: "directory"
        mode: "0755"

    - name: "Copy software to the software directory"
      copy:
        src: "/software/{{ item }}"
        dest: "/software"
      loop: "{{ software }}"

    - name: "Register jps command's output to 'jps' variable"
      command: "jps"
      register: jps
      ignore_errors: true

    - name: "Register hadoop version command's output to 'hadoop' variable"
      command: "hadoop version"
      register: hadoop
      ignore_errors: true

    - name: "Install java and hadoop"
      command: "rpm -i /software/{{ item }} --force"
      loop: "{{ software }}"
      when: jps.rc != 0 or hadoop.rc != 0


- hosts: namenode

  vars:

    - config_files:
        - "core-site.xml"
        - "hdfs-site.xml"

  tasks:

    - name: "Create a directory /nn"
      file:
        path: "/nn"
        state: "directory"
        mode: "0755"

    - name: "Copy the configuration files"
      template:
        src: "/namenode/{{ item }}"
        dest: "/etc/hadoop/{{ item }}"
      loop: "{{ config_files }}"

    - name: "Format the namenode"
      shell: "echo Y | hadoop namenode -format"

    - name: "Start the namenode"
      command: "hadoop-daemon.sh start namenode"

- hosts: datanodes

  vars:

    - config_files:
        - "core-site.xml"
        - "hdfs-site.xml"

  tasks:

    - name: "Create a directory /dn"
      file:
        path: "/dn"
        state: "directory"
        mode: "0755"

    - name: "Copy the configuration files"
      template:
        src: "/datanode/{{ item }}"
        dest: "/etc/hadoop/{{ item }}"
      loop: "{{ config_files }}"

    - name: "Start the datanode"
      command: "hadoop-daemon.sh start datanode"

And the hosts file looks like:

[namenode]
192.168.0.9 ansible_user=USERNAME ansible_ssh_pass=PASSWORD ansible_connection=ssh

[datanodes]
192.168.0.5 ansible_user=USERNAME ansible_ssh_pass=PASSWORD ansible_connection=ssh

You can also add more hosts to the datanodes group.

Thank You!