How to install and utilize Apache Hadoop on an Ubuntu 20.04 cloud server?

19-02-2024 02:08:56

Apache Hadoop is an open-source software framework designed for storing, managing, and processing various big data applications, typically used in cluster systems. Developed in Java, it utilizes the HDFS file system for data storage and employs MapReduce as a data processing platform. This article outlines the steps to install and configure Apache Hadoop on an Ubuntu 20.04 cloud server.

Installing Java

Execute the following commands to install the latest version of Java.

$ sudo apt install default-jdk default-jre -y

Verify the installed Java version.

$ java -version

Creating the Hadoop User

Create the hadoop user.

$ sudo adduser hadoop
$ sudo usermod -aG sudo hadoop

Install SSH and configure passwordless login.

$ sudo su - hadoop
$ apt install openssh-server openssh-client -y
$ sudo su - hadoop
$ ssh-keygen -t rsa
$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ sudo chmod 640 ~/.ssh/authorized_keys

Verify if passwordless login is operational.

$ ssh localhost

Installing Hadoop

Login as the newly created hadoop user, download the latest version of Hadoop. Specific version numbers can be checked on the Apache Hadoop official website; this article uses version 3.3.1 as an example.

$ sudo su - hadoop
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
$ tar -xvzf hadoop-3.3.1.tar.gz
$ sudo mv hadoop-3.3.1 /usr/local/hadoop
$ sudo mkdir /usr/local/hadoop/logs
$ sudo chown -R hadoop:hadoop /usr/local/hadoop

Set the required environment variables for Hadoop.

$ sudo nano ~/.bashrc

Add the following code snippet at the end of the file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Activate the environment variables.

$ source ~/.bashrc

Configuring Java Environment Variables

Hadoop has numerous components for executing its core functionalities. To configure YARN, HDFS, MapReduce components, and set up Hadoop projects, we need to define Java environment variables in the hadoop-env.sh configuration file.

Check the Java installation path.

$ which javac

Find the OpenJDK directory.

$ readlink -f /usr/bin/javac

Edit the hadoop-env.sh file.

$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add the following code snippet at the end of the file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

Navigate to the lib directory of Hadoop, download the javax activation file.

$ cd /usr/local/hadoop/lib
$ sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar

Verify the Hadoop version.

$ hadoop version

Edit the core-site.xml configuration file.

$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following code to specify the URL address of the naming node.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://0.0.0.0:9000</value>
      <description>The default file system URI</description>
   </property>
</configuration>

Create a directory to store node metadata and change ownership to hadoop.

$ sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
$ sudo chown -R hadoop:hadoop /home/hadoop/hdfs

Edit the hdfs-site.xml configuration file.

$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following code to define the location of node metadata storage.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hdfs/datanode</value>
   </property>
</configuration>

Edit the mapred-site.xml configuration file.

$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following code to define MapReduce values.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Edit the yarn-site.xml configuration file.

$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following code to define YARN-related parameters.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

Login as the hadoop user and verify the Hadoop configuration and HDFS naming node format.

$ sudo su - hadoop
$ hdfs namenode -format

Starting the Hadoop Cluster

Start the naming node and data nodes.

$ start-dfs.sh
$ start-yarn.sh

Verify all running components.

$ jps

Additionally, access the Apache Hadoop Web management interface by visiting http://IP_address:9870 through a web browser.

At this point, you have successfully installed Apache Hadoop and can proceed with finer configurations through the management interface.