Apache Hadoop is an open-source software framework designed for storing, managing, and processing various big data applications, typically used in cluster systems. Developed in Java, it utilizes the HDFS file system for data storage and employs MapReduce as a data processing platform. This article outlines the steps to install and configure Apache Hadoop on an Ubuntu 20.04 cloud server.
Execute the following commands to install the latest version of Java.
$ sudo apt install default-jdk default-jre -y
Verify the installed Java version.
$ java -version
Create the hadoop user.
$ sudo adduser hadoop
$ sudo usermod -aG sudo hadoop
Install SSH and configure passwordless login.
$ sudo su - hadoop
$ apt install openssh-server openssh-client -y
$ sudo su - hadoop
$ ssh-keygen -t rsa
$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ sudo chmod 640 ~/.ssh/authorized_keys
Verify if passwordless login is operational.
$ ssh localhost
Login as the newly created hadoop user, download the latest version of Hadoop. Specific version numbers can be checked on the Apache Hadoop official website; this article uses version 3.3.1 as an example.
$ sudo su - hadoop
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
$ tar -xvzf hadoop-3.3.1.tar.gz
$ sudo mv hadoop-3.3.1 /usr/local/hadoop
$ sudo mkdir /usr/local/hadoop/logs
$ sudo chown -R hadoop:hadoop /usr/local/hadoop
Set the required environment variables for Hadoop.
$ sudo nano ~/.bashrc
Add the following code snippet at the end of the file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Activate the environment variables.
$ source ~/.bashrc
Hadoop has numerous components for executing its core functionalities. To configure YARN, HDFS, MapReduce components, and set up Hadoop projects, we need to define Java environment variables in the hadoop-env.sh configuration file.
Check the Java installation path.
$ which javac
Find the OpenJDK directory.
$ readlink -f /usr/bin/javac
Edit the hadoop-env.sh file.
$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add the following code snippet at the end of the file.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"
Navigate to the lib directory of Hadoop, download the javax activation file.
$ cd /usr/local/hadoop/lib
$ sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar
Verify the Hadoop version.
$ hadoop version
Edit the core-site.xml configuration file.
$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following code to specify the URL address of the naming node.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9000</value>
<description>The default file system URI</description>
</property>
</configuration>
Create a directory to store node metadata and change ownership to hadoop.
$ sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
$ sudo chown -R hadoop:hadoop /home/hadoop/hdfs
Edit the hdfs-site.xml configuration file.
$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following code to define the location of node metadata storage.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Edit the mapred-site.xml configuration file.
$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following code to define MapReduce values.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit the yarn-site.xml configuration file.
$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add the following code to define YARN-related parameters.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Login as the hadoop user and verify the Hadoop configuration and HDFS naming node format.
$ sudo su - hadoop
$ hdfs namenode -format
Start the naming node and data nodes.
$ start-dfs.sh
$ start-yarn.sh
Verify all running components.
$ jps
Additionally, access the Apache Hadoop Web management interface by visiting http://IP_address:9870 through a web browser.
At this point, you have successfully installed Apache Hadoop and can proceed with finer configurations through the management interface.
23-02-2024 02:02:07
22-02-2024 03:19:32
22-02-2024 03:16:03
22-02-2024 03:14:03
22-02-2024 03:11:58