Apache Hadoop is an open-source framework for processing and storing big data. In today's industries, Hadoop become the standard framework for big data. Hadoop is designed to be run on distributed systems with hundreds or even thousands of clustered computers or dedicated servers. With this in mind, Hadoop can handle large datasets with high volume and complexity for both structured and unstructured data.
Every Hadoop deployment contains the following components:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
In this Video, we will install the latest version of Apache Hadoop on an Ubuntu 22.04 server. Hadoop gets installed on a single node server and we create a Pseudo-Distributed Mode of Hadoop deployment.
Useful Links:
VPS/VDS - https://www.mivocloud.com/
Hadoop - https://hadoop.apache.org/
WARNING - ANGLED BRACKETS AREN'T ALLOWED IN DESCRIPTION SO BE ATTENTIVE TO THE VIDEO IN NANO REDACTOR
Commands Used:
sudo apt install default-jdk
java -version
sudo apt install openssh-server openssh-client pdsh
sudo useradd -m -s /bin/bash hadoop
sudo passwd hadoop
sudo usermod -aG sudo hadoop
su - hadoop
ssh-keygen -t rsa
ls ~/.ssh/
cat ~/.ssh/id_rsa.pub ff ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh localhost
wget https://dlcdn.apache.org/hadoop/commo...
tar -xvzf hadoop-3.3.4.tar.gz
sudo mv hadoop-3.3.4 /usr/local/hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop
nano ~/.bashrc
Hadoop environment variables
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
source ~/.bashrc
echo $JAVA_HOME
echo $HADOOP_HOME
echo $HADOOP_OPTS
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
hadoop version
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
configuration
property
name fs.defaultFS /name
value hdfs://IP:9000 /value
/property
/configuration
sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
sudo chown -R hadoop:hadoop /home/hadoop/hdfs
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
configuration
property
name dfs.replication /name
value 1 /value
/property
property
name dfs.name.dir /name
value file:///home/hadoop/hdfs/namenode /value
/property
property
name dfs.data.dir /name
value file:///home/hadoop/hdfs/datanode /value
/property
/configuration
hdfs namenode -format
start-dfs.sh
IF YOU HAVE AN ERROR
sudo apt-get remove pdsh
start-dfs.sh
IF IT DOESN'T HELP
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
YARN MANAGER
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
configuration
property
name mapreduce.framework.name /name
value yarn /value
/property
property
name mapreduce.application.classpath /name
value $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/* /value
/property
/configuration
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
configuration
property
name yarn.nodemanager.aux-services /name
value mapreduce_shuffle /value
/property
property
name yarn.nodemanager.env-whitelist /name
value JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME /value
/property
/configuration
start-yarn.sh