Thursday, 12 January 2017

Apache Hadoop Configuration (from the SCRATCH) ::

APACHE HADOOP INSTALLATION AND CONFIGURATION STEP-BY-STEP
Note: I am using 'CentOS-7' for this entire tutorial ( or I will mention OS flavor when required )
- - -
Pre requisites-1:: (Environmental)

1) Laptop or Desktop (with Min. 4 GB of RAM & 150 GB of HDD)
2) Linux OS iso ( Cent OS is preferred because, its lighter than other OS.)
( I suggest to install Cent OS with GUI as 'host' instead of windows 7/8/10)
3) KVM (Kernel-based Virtual Machine) to install guest OS.
Explanation:
if your laptop is installed with Linux OS, you can use it to install hadoop in 'pseudo-distribution mode' and you can use laptop as 'Name Node' and guests installed under KVM as 'Data Nodes'.

Modes of Installation:
  1. Fully distribution Mode 2.Pseudo distributed Mode 3. Standalone Mode
  1. Fully distribution:
you need multiple servers for this mode. Each service will run on separate machine and separate JVM.
You will have daemons for each process and the daemons run on multiple servers.
2.Pseudo distributed Mode:
Can be implemented in a single server. All services will run as separate JVM in the same machine.
all daemons (such as the DataNode, NameNode and ResourceManager processes) run on a single server.
  1. Standalone Mode:
Can be implemented in a single server. All Hadoop services run in a single JVM, and there are no daemon. Hadoop will use local filesystem not HDFS. Best suited for developers to test their code in their machines.
Pre-Requisites-2 ::: Modifying 'Kernel Parameters' of OS (useful for all modes of installation )
it is not recommended to leave default kernel parameters unchanged when your are implementing hadoop cluster. Following steps will show you how to change linux kernel parameters for better hadoop cluster performance..
  1. Changing disk mount parameters under '/etc/fstab' file:
OS maintains file system metadata that records when each file was last accessed as per the POSIX standard. This time-stamp is said as 'atime' and 'atime' comes with a performance penalty: every read operation on filesystem do a write operation.
Linux keeps 3 time stamps for each file on its filesystem : modified time (mtime), change time(ctime), and access time(atime).
'stat' command show above three time stamps. Ex: $ stat /bin/ls
- The 'noatime' option disables writing file access times to the HDD every time you read a file.
- The 'nodiratime' option disables the writing of file access times only for directories while other files still get access times written.
Note: 'noatime' implies 'nodiratime'. No need to specify both
/etc/fstab :
/dev/sda2 /data1 ext4 defaults,noatime 0 0
/dev/sda3 /data2 ext4 defaults,noatime 0 0
  1. Increasing the File Limits:
to avoid any file descriptor errors in the cluster, increase the limits on the number of files a single user or process can have open at a time. The default is only 128.
below command will show the maximum, total, global number of file descriptors the kernel will allocate before choking.
# cat /proc/sys/fs/file-max
you can check soft-limit and hard-limit using below commands
# ulimit -Sn
# ulimiit -Hn
you can increase these limits for individual users by editing below mentioned file.
# vim /etc/security/limits.conf
Once you change the kernel settings, you can apply the new settings by executing the following command.
# sysctl -p
# sysctl -a (to see all kernel settings)

3. BIOS settings: make sure IDE emulation is not enabled (system admin / data center technician takes care of this )

4. Network Time: make sure all servers in the cluster synchronize with same 'ntp server'. Its is critical for services like Kerberos, zookeeper and log files.

5. NIC Bonding : make sure every server configured with 2 NIC cards as single. (for redundancy)

6. VLAN: its better to keep cluster in seperate vlan.

7. DNS: make sure all servers in cluster have correctly configured with a local dns server for host name resolution.

8. Network : a dedicated switch(also a backup switch) with a fiber connection to a core switch.

9. Disabling SELinux and Iptables:you should disable 'selinux' and 'iptables(firewall) on every cluster node to avoid any blocking.Use below command to disable SELinux:
# vim /etc/seliux/config (change SELINUX to 'disabled')
# setenforce 0
Use below command to flush and disable firewall:
# iptables -F
# iptables -X
# systemctl stop firewalld
# systemctl disable firewalld

10. Disabling swap: by default Linux uses HDD space to swap applications/files, this behavior will kill Hadoop performance. You can use below command to disable swap. #swapoff -a (temporary)
# vim /etc/sysctl.conf (add 'vm.swappiness=0) to disable swappiness permanently.

11. SSH (Password less): you should allow 'name node' (& fail over Name Node”) to SSH into every node in the cluster with entering password. Use blow steps to configure password less ssh login to cluster nodes from 'name node'.                                                          - generate ssh-key on 'name node'.                                                                                                          # ssh-keygen (just enter couple of time to generate key). This command will create a keys (private and public) under .ssh directory of his home folder.                                                                                 - copy public key to all hosts in the cluster.                                                                                  #ssh-copy-id root@clusterhost(after successful execution of above steps you should be able to ling to remote machine without any password)

@@@@@@ please visit again for updates @@@@@@@@@@ 

1 comment:

  1. awesome post presented by you..your writing style is fabulous and keep update with your blogs Hadoop administration Online course bangalore

    ReplyDelete