APACHE HADOOP INSTALLATION AND CONFIGURATION STEP-BY-STEP
Note: I am using 'CentOS-7' for this entire tutorial ( or I will mention OS flavor when required )
- - -
Pre requisites-1:: (Environmental)
1) Laptop or Desktop (with Min. 4 GB of RAM & 150 GB of HDD)
2) Linux OS iso ( Cent OS is preferred because, its lighter than other OS.)
( I suggest to install Cent OS with GUI as 'host' instead of windows 7/8/10)
3) KVM (Kernel-based Virtual Machine) to install guest OS.
Explanation:
if your laptop is installed with Linux OS, you can use it to install hadoop in 'pseudo-distribution mode' and you can use laptop as 'Name Node' and guests installed under KVM as 'Data Nodes'.
Modes of Installation:
You will have daemons for each process and the daemons run on multiple servers.
2.Pseudo distributed Mode:
Can be implemented in a single server. All services will run as separate JVM in the same machine.
all daemons (such as the DataNode, NameNode and ResourceManager processes) run on a single server.
Pre-Requisites-2 ::: Modifying 'Kernel Parameters' of OS (useful for all modes of installation )
it is not recommended to leave default kernel parameters unchanged when your are implementing hadoop cluster. Following steps will show you how to change linux kernel parameters for better hadoop cluster performance..
Linux keeps 3 time stamps for each file on its filesystem : modified time (mtime), change time(ctime), and access time(atime).
'stat' command show above three time stamps. Ex: $ stat /bin/ls
- The 'noatime' option disables writing file access times to the HDD every time you read a file.
- The 'nodiratime' option disables the writing of file access times only for directories while other files still get access times written.
Note: 'noatime' implies 'nodiratime'. No need to specify both
/etc/fstab :
/dev/sda2 /data1 ext4 defaults,noatime 0 0
/dev/sda3 /data2 ext4 defaults,noatime 0 0
below command will show the maximum, total, global number of file descriptors the kernel will allocate before choking.
# cat /proc/sys/fs/file-max
you can check soft-limit and hard-limit using below commands
# ulimit -Sn
# ulimiit -Hn
you can increase these limits for individual users by editing below mentioned file.
# vim /etc/security/limits.conf
Once you change the kernel settings, you can apply the new settings by executing the following command.
# sysctl -p
# sysctl -a (to see all kernel settings)
3. BIOS settings: make sure IDE emulation is not enabled (system admin / data center technician takes care of this )
4. Network Time: make sure all servers in the cluster synchronize with same 'ntp server'. Its is critical for services like Kerberos, zookeeper and log files.
5. NIC Bonding : make sure every server configured with 2 NIC cards as single. (for redundancy)
6. VLAN: its better to keep cluster in seperate vlan.
7. DNS: make sure all servers in cluster have correctly configured with a local dns server for host name resolution.
8. Network : a dedicated switch(also a backup switch) with a fiber connection to a core switch.
9. Disabling SELinux and Iptables:you should disable 'selinux' and 'iptables(firewall) on every cluster node to avoid any blocking.Use below command to disable SELinux:
# vim /etc/seliux/config (change SELINUX to 'disabled')
# setenforce 0
Use below command to flush and disable firewall:
# iptables -F
# iptables -X
# systemctl stop firewalld
# systemctl disable firewalld
10. Disabling swap: by default Linux uses HDD space to swap applications/files, this behavior will kill Hadoop performance. You can use below command to disable swap. #swapoff -a (temporary)
# vim /etc/sysctl.conf (add 'vm.swappiness=0) to disable swappiness permanently.
11. SSH (Password less): you should allow 'name node' (& fail over Name Node”) to SSH into every node in the cluster with entering password. Use blow steps to configure password less ssh login to cluster nodes from 'name node'. - generate ssh-key on 'name node'. # ssh-keygen (just enter couple of time to generate key). This command will create a keys (private and public) under .ssh directory of his home folder. - copy public key to all hosts in the cluster. #ssh-copy-id root@clusterhost(after successful execution of above steps you should be able to ling to remote machine without any password)
@@@@@@ please visit again for updates@@@@@@@@@@
Note: I am using 'CentOS-7' for this entire tutorial ( or I will mention OS flavor when required )
- - -
Pre requisites-1:: (Environmental)
1) Laptop or Desktop (with Min. 4 GB of RAM & 150 GB of HDD)
2) Linux OS iso ( Cent OS is preferred because, its lighter than other OS.)
( I suggest to install Cent OS with GUI as 'host' instead of windows 7/8/10)
3) KVM (Kernel-based Virtual Machine) to install guest OS.
Explanation:
if your laptop is installed with Linux OS, you can use it to install hadoop in 'pseudo-distribution mode' and you can use laptop as 'Name Node' and guests installed under KVM as 'Data Nodes'.
Modes of Installation:
- Fully distribution Mode 2.Pseudo distributed Mode 3. Standalone Mode
- Fully distribution:
You will have daemons for each process and the daemons run on multiple servers.
2.Pseudo distributed Mode:
Can be implemented in a single server. All services will run as separate JVM in the same machine.
all daemons (such as the DataNode, NameNode and ResourceManager processes) run on a single server.
- Standalone Mode:
Pre-Requisites-2 ::: Modifying 'Kernel Parameters' of OS (useful for all modes of installation )
it is not recommended to leave default kernel parameters unchanged when your are implementing hadoop cluster. Following steps will show you how to change linux kernel parameters for better hadoop cluster performance..
- Changing disk mount parameters under '/etc/fstab' file:
Linux keeps 3 time stamps for each file on its filesystem : modified time (mtime), change time(ctime), and access time(atime).
'stat' command show above three time stamps. Ex: $ stat /bin/ls
- The 'noatime' option disables writing file access times to the HDD every time you read a file.
- The 'nodiratime' option disables the writing of file access times only for directories while other files still get access times written.
Note: 'noatime' implies 'nodiratime'. No need to specify both
/etc/fstab :
/dev/sda2 /data1 ext4 defaults,noatime 0 0
/dev/sda3 /data2 ext4 defaults,noatime 0 0
- Increasing the File Limits:
below command will show the maximum, total, global number of file descriptors the kernel will allocate before choking.
# cat /proc/sys/fs/file-max
you can check soft-limit and hard-limit using below commands
# ulimit -Sn
# ulimiit -Hn
you can increase these limits for individual users by editing below mentioned file.
# vim /etc/security/limits.conf
Once you change the kernel settings, you can apply the new settings by executing the following command.
# sysctl -p
# sysctl -a (to see all kernel settings)
3. BIOS settings: make sure IDE emulation is not enabled (system admin / data center technician takes care of this )
4. Network Time: make sure all servers in the cluster synchronize with same 'ntp server'. Its is critical for services like Kerberos, zookeeper and log files.
5. NIC Bonding : make sure every server configured with 2 NIC cards as single. (for redundancy)
6. VLAN: its better to keep cluster in seperate vlan.
7. DNS: make sure all servers in cluster have correctly configured with a local dns server for host name resolution.
8. Network : a dedicated switch(also a backup switch) with a fiber connection to a core switch.
9. Disabling SELinux and Iptables:you should disable 'selinux' and 'iptables(firewall) on every cluster node to avoid any blocking.Use below command to disable SELinux:
# vim /etc/seliux/config (change SELINUX to 'disabled')
# setenforce 0
Use below command to flush and disable firewall:
# iptables -F
# iptables -X
# systemctl stop firewalld
# systemctl disable firewalld
10. Disabling swap: by default Linux uses HDD space to swap applications/files, this behavior will kill Hadoop performance. You can use below command to disable swap. #swapoff -a (temporary)
# vim /etc/sysctl.conf (add 'vm.swappiness=0) to disable swappiness permanently.
11. SSH (Password less): you should allow 'name node' (& fail over Name Node”) to SSH into every node in the cluster with entering password. Use blow steps to configure password less ssh login to cluster nodes from 'name node'. - generate ssh-key on 'name node'. # ssh-keygen (just enter couple of time to generate key). This command will create a keys (private and public) under .ssh directory of his home folder. - copy public key to all hosts in the cluster. #ssh-copy-id root@clusterhost(after successful execution of above steps you should be able to ling to remote machine without any password)
@@@@@@ please visit again for updates
awesome post presented by you..your writing style is fabulous and keep update with your blogs Hadoop administration Online course bangalore
ReplyDelete