Hadoop Cluster Setup/Configure with HDFS Architecture .

Hello & Welcome all, in this blog we will be learning how to set up Hadoop cluster in our local machine, using KVM on Ubuntu:18.04, and we will be also seeing practical for HDFS. So before moving on with hands on, let us first have some basic inception of the technology.

What is Hadoop ?

Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Hadoop consists of four main modules:

Hadoop Distributed File System (HDFS) – A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.

Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.

MapReduce – A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.

Hadoop Common – Provides common Java libraries that can be used across all modules.

Pre-requisite:

- KVM Installed On Ubuntu:18.04

- RHEL8.0 (Iso file /DVD)

Here, I will be creating a Master-Slave Cluster i.e Name Node — Data Node, keeping in mind of HDFS practical. So it will be consisting of single Name Node(Master Node) and Two Data Node(Slave Node). Although you can tweak it depending upon your use cases.

What is NameNode(MasterNode) ?

NameNode is the centerpiece of the Hadoop Distributed File System. It maintains and manages the file system namespace and provides the right access permission to the clients.

What is DataNode (Slave Node) ?

DataNodes are the slave nodes in Hadoop HDFS. DataNodes are inexpensive commodity hardware. They store blocks of a file.

Lab-Setup :

NameNode:

- IP : 192.168.122.202

- HostName: NameNode.localdomain.com

DataNode1:

- IP : 192.168.122.100

- HostName: DataNode1.localdomain.com

DataNode2:

- IP : 192.168.122.16

- HostName: DataNode2.localdomain.com

Step -1: Downloading Hadoop & JDK RPM Package :

- jdk-8u171-linux-x64.rpm

- hadoop-1.2.1-1.x86_64.rpm

Please Click below URL & Download Hadoop & JDK Package :

URL : https://drive.google.com/drive/folders/1zXxx3dlVxU9OilS_GMHjEDNaC7uMsrWS

Step -2: Install JDK & Hadoop On NameNode (MasterNode):

Firstly we will be installing JDK as it is pre-requisite for the Hadoop to install.

| # rpm -ivh jdk-8u171-linux-x64.rpm

On the successful installed JDK, we will Installing hadoop Using rpm command.

| rpm -ivh hadoop-1.2.1-1.x86_64.rpm

Step-3: Setup/Configuring NameNode (MasterNode):

In order to configure the Name node for HDFS we will create a directory in the Slash(/) Drive, whose storage has to be distributed. Once the directory gets created we will be configuring the hdfs-site.xml to tell it to act as Name node.

Go to vim /etc/hadoop/hdfs-site.xml & add Below lines.

<property>
<name>dfs.name.dir</name>
<value>/namenode</value>
</property>

In the hdfs-site.xml, I have created a property, which will tell the node that it is name node and it is being configured for DFS(Distributed File System). In the name tag we will provide keyword for the DFS Name node and in the value tag we’ll provide the directory we created earlier to be shared in DFS.

Once this configuration is done, we will head toward for the configuration in core-site.xml file. This file is responsible for the networking part of the name node. As it’ll be the one which will tell IP of the Name node and port number.

Go to vim /etc/hadoop/core-site.xml file and add Below lines.

<property>

<name>fs.default.name</name>

<value>hdfs://192.168.122.202:9001</value>

</property>

In the core-site.xml, I have created a property, which will tell the name node that it is being configured for DFS(Distributed File System). In the name tag we will provide the keyword for the FS Name node and in the value tag we will provide the IP of the name node and valid and unused port(here we have used port 9001). And we will also mention the protocol(hdfs) it will use.

Step -4: Basically Hadoop namenode -format does following:

Fsimage files which contains filesystem images which is basically the metadata file in your NameNode. And edit logs are the files which contains the recent changes in the file system, which is later merged in the fsimage.
Formatting the namenode deletes the information from namenode directory. The NameNode directory is specified in hdfs-site.xml file in dfs.namenode.name.dir property.
Formatting the file system means initializing the directory specified by the dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored in the HDFS.

Step -5: Install JDK & Hadoop On DataNode (SlaveNode):

In this step we will be doing the same on both the Data nodes as we did on Name node. i.e. first we will be installing JDK and then Hadoop. Mind that, this has to done on each data nodes.

Step-6: Configuring Slave Node (DataNode):

In order to configure the Data node for HDFS we will create a directory in the Slash(/) drive, whose storage will be contributed to be distributed file System. Once the directory gets created we will be configuring the hdfs-site.xml to tell it to act as Data node.

In the hdfs-site.xml, I have created a property, which will tell the node that it is data node and it is being configured for DFS(Distributed File System). In the name tag we will provide keyword for the DFS data node and in the value tag we’ll provide the directory we created earlier whose storage has to be contributed in DFS.

Go to vim /etc/hadoop/hdfs-site.xml & add Below lines.

<property>
<name>dfs.name.dir</name>
<value>/namenode</value>
</property>

Once this configuration is done, we will head toward for the configuration in core-site.xml file. This file is responsible for the networking part of the data node. As it’ll be the one which will tell data node the IP of the Name node and port number. So that it can contact to name node.

In the core-site.xml, I have created a property, which will tell the data node that it is being configured for DFS(Distributed File System). In the name tag we will provide the keyword for the FS Name node and in the value tag we will provide the IP of the name node and port to connect to the right IP and port. And we will also mention the protocol(hdfs) it will use to connect.

Go to vim /etc/hadoop/core-site.xml file and add Below lines.

<property>

<name>fs.default.name</name>

<value>hdfs://192.168.122.202:9001</value>

</property>

[Note: Do the Same DataNode2]

Step-7: Starting daemon/Service On NameNode (MasterNode):

Once after all the configuration is done, we will be starting the Hadoop service to run. Before starting we have used “jps” command to check the running java processes. As we didn't see any java process running. We will also check whether any process is running at port 9001 by netstat command.

But once we start the Hadoop service we will see a java process running named NameNode and we will also see port 9001 is now listening. But keep in mind in order to make port listen we need to make change in the firewall. In order to escape from this we can also disable firewall(not recommended).

| # hadoop-daemon.sh start namenode

Step-8: Starting daemon/Service On DataNode (Slave Node):

After all the configuration in Data Node and starting service at Name node. We will be starting the Hadoop service to run on Data Node. Before starting we have used “jps” command to check the running java processes. As we didn’t see any java process running.

But once we start the Hadoop service we will see a java process running named DataNode. And if we see this that means our data node is configured correctly as is connected to the Name node.

Step-9: Go to NameNode and running Below command.

| # hadoop dfsadmin -report

Once after all this setup we will check for the distributed storage by the command “hadoop dfsadmin -report”. This will give us a complete report of all the data nodes connected and also about the storage they are contributing to the Name Node.

So this was a small example on Distributed File System. The storage of the slave nodes(data nodes) can be distributed collectively to the name node.

Connect with me Linkdn : https://www.linkedin.com/in/abdhesh-kumar-656b73179/

Written By: Abdhesh Pratap Singh 🙏🙏Thank You 🙏🙏🙏🙏

Search This Blog

How to launch ec2 instances & attach EBS volume | Using Command line Interface

Hadoop Cluster Setup/Configure with HDFS Architecture .

Comments

Post a Comment

Popular posts from this blog

Integration of Ansible with Docker !!

How to launch ec2 Instance & attach EBS volume | Using Command line Interface.

In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?