Hadoop Administration: 2017

Friday 10 February 2017

HAProxy 1.7.X Installation and configuration for IMPALA

Steps:

Downloaded HAProxy to Impala server.

# wget http://www.haproxy.org/download/1.7/src/haproxy-1.7.2.tar.gz

Installed dependencies

# yum install gcc pcre-static pcre-devel -y

Untar source and change directory

#tar xzvf ~/haproxy.tar.gz -C ~/

Change into the directory.

#cd ~/haproxy-1.7.2

Then compile the program for your system.

#make TARGET=linux2628

and finally install HAProxy itself.

# make install

To complete the install, use the following commands to copy the settings over.

# cp /usr/local/sbin/haproxy /usr/sbin/

# cp /haproxy-1.7.2/examples/haproxy.init /etc/init.d/haproxy

# chmod 755 /etc/init.d/haproxy

Create these directories and the statistics file for HAProxy to record in.

# mkdir -p /etc/haproxy

# mkdir -p /run/haproxy

# mkdir -p /var/lib/haproxy

# touch /var/lib/haproxy/stats

Then add a new user for HAProxy.

# useradd -r haproxy

Configuring the load balancer

Setting up HAProxy for load balancing is a quite straight forward process. Basically all you need to do is tell HAProxy what kind of connections it should be listening for and which servers it should relay the connections to. This can be done by creating a configuration file /etc/haproxy/haproxy.cfg with the required settings. For documentation help please go to HAProxy Documentation

make sure you have following setting for HAProxy-1.7.2

# Vim /etc/haproxy/haproxy.cfg

______________________________________________
______________________________________________

# HAProxy Server Version 1.7.2
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events. This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    192s
    timeout queue           1m
    timeout connect         192s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 192s
    timeout check           192s
    maxconn                 3000

#kz : Timeout connect 3600000
    timeout client 3600000
    timeout server 3600000
#####################################################
##Default Timeout Settings###########################
#    timeout connect 5000
#    timeout client 50000
#    timeout server 50000
#####################################################
#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats
   bind *:25002
    balance
    mode http
#    stats enable
#    stats hide-version
#    stats scope .
#    stats realm Haproxy\ Statistics
#    stats uri /
#    stats auth prni01:haproxy
    log global
    stats enable
    stats hide-version
    stats refresh 30s
    stats show-node
    stats auth haproxy:h@pr0xy
    stats uri /haproxy?stats
# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
#Config settings for Impala Shell
listen impalashell
   bind *:25003
    mode tcp
    option tcplog
    balance leastconn
#List of Impala Daemons
server  192.168.16.17  192.168.16.17:21000
server  192.168.16.18  192.168.16.18:21000
server  192.168.16.19  192.168.16.19:21000
server  192.168.16.21  192.168.16.21:21000
server  192.168.16.22  192.168.16.22:21000
server  192.168.16.23  192.168.16.23:21000
server  192.168.16.25  192.168.16.25:21000
server  192.168.16.30  192.168.16.30:21000
server  192.168.16.31  192.168.16.31:21000
server  192.168.16.32  192.168.16.32:21000
server  192.168.16.33  192.168.16.33:21000
server  192.168.16.34  192.168.16.34:21000
server  192.168.16.35  192.168.16.35:21000
server  192.168.16.36  192.168.16.36:21000
server  192.168.16.37  192.168.16.37:21000
server  192.168.16.38  192.168.16.38:21000
server  192.168.16.39  192.168.16.39:21000
server  192.168.16.40  192.168.16.40:21000
server  192.168.16.41  192.168.16.41:21000
server  192.168.16.42  192.168.16.42:21000
server  192.168.16.43  192.168.16.43:21000
server  192.168.16.44  192.168.16.44:21000
server  192.168.16.45  192.168.16.45:21000
server  192.168.16.46  192.168.16.46:21000
server  192.168.16.47  192.168.16.47:21000
#Config settings for Impala JDBC
listen impalajdbc
   bind *:25004
    mode tcp
    option tcplog
    balance leastconn
#List of Impala Daemons
server  192.168.16.17  192.168.16.17:21050
server  192.168.16.18  192.168.16.18:21050
server  192.168.16.19  192.168.16.19:21050
server  192.168.16.21  192.168.16.21:21050
server  192.168.16.22  192.168.16.22:21050
server  192.168.16.23  192.168.16.23:21050
server  192.168.16.25  192.168.16.25:21050
server  192.168.16.30  192.168.16.30:21050
server  192.168.16.31  192.168.16.31:21050
server  192.168.16.32  192.168.16.32:21050
server  192.168.16.33  192.168.16.33:21050
server  192.168.16.34  192.168.16.34:21050
server  192.168.16.35  192.168.16.35:21050
server  192.168.16.36  192.168.16.36:21050
server  192.168.16.37  192.168.16.37:21050
server  192.168.16.38  192.168.16.38:21050
server  192.168.16.39  192.168.16.39:21050
server  192.168.16.40  192.168.16.40:21050
server  192.168.16.41  192.168.16.41:21050
server  192.168.16.42  192.168.16.42:21050
server  192.168.16.43  192.168.16.43:21050
server  192.168.16.44  192.168.16.44:21050
server  192.168.16.45  192.168.16.45:21050
server  192.168.16.46  192.168.16.46:21050
server  192.168.16.47  192.168.16.47:21050

_________________________________________________
_________________________________________________

Change settings as per attached haproxy.cfg file (please note configuration properties may change depends upon the version you are using.)

Then start the haproxy service.

# service haproxy start/restart /// # systemctl start haproxy

Check HA on impala-shell from 'haproxy server' host.

#impala-shell -I <impala_daemon_host>:25003 (this port is given for impala-shell)

(you should be able to login and check databases)

NOTE: Please check your firewall setting if you are facing any issue

................DONE................

Thursday 12 January 2017

HDFS Encryption Using Cloudera Navigator Key Trustee Server

[note: this document was prepared after reading to cloudera website.]

Using this tutorial i will explain about " Cloudera Navigator Key Trustee Server (KTS).

Cloudera Navigator is a fully integrated data-management and security system for the Hadoop platform. Cloudera Navigator provides the following functionality:
(here i will be explaining about 'KTS' under 'Data Encryption')
- Data Management
- Data Encryption : Enabling HDFS encryption using Key Trustee Server as the key store involves multiple components.
⦁   Cloudera Navigator Key Trustee Server
⦁    Cloudera Navigator Key HSM
⦁    Cloudera Navigator Encrypt
⦁    Key Trustee KMS

Reference : http://www.cloudera.com/documentation/enterprise/5-8-x/topics/navigator_encryption.html#concept_w4l_yjv_jt

Resource Planning for Data at Rest Encryption:

For high availability, you must provision two dedicated Key Trustee Server hosts and at least two dedicated Key Trustee KMS hosts, for a minimum of four separate hosts. Do not run multiple Key Trustee Server or Key Trustee KMS services on the same physical host, and do not run these services on hosts with other cluster services. Doing so causes resource contention with other important cluster services and defeats the purpose of high availability.

The Key Trustee KMS workload is CPU intensive. Cloudera recommends using machines with capabilities equivalent to your NameNode hosts, with Intel CPUs that support AES-NI for optimum performance.

Make sure that each host is secured and audited. Only authorized key administrators should have access to them. refer following links to secure redhat OS...

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Security_Guide/

For Cloudera Manager deployments, deploy Key Trustee Server in its own dedicated cluster. Deploy Key Trustee KMS in each cluster that uses Key Trustee Server.

Reference: http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_resource_planning.html#concept_cg3_rfp_y5

Virtual Machine Considerations
If you are using virtual machines, make sure that the resources (such as virtual disks, CPU, and memory) for each Key Trustee Server and Key Trustee KMS host are allocated to separate physical hosts. Hosting multiple services on the same physical host defeats the purpose of high availability, because a single machine failure can take down multiple services.

Data at Rest Encryption Reference Architecture

To isolate Key Trustee Server from other Enterprise Data Hub (EDH) services, you must deploy Key Trustee Server on dedicated hosts in a separate cluster in Cloudera Manager. Deploy Key Trustee KMS on dedicated hosts in the same cluster as the EDH services that require access to Key Trustee Server. This provides the following benefits:

⦁    You can restart your EDH cluster without restarting Key Trustee Server, avoiding interruption to other clusters or clients that use the same Key Trustee Server instance.
⦁    You can manage the Key Trustee Server upgrade cycle independently of other cluster components.
⦁    You can limit access to the Key Trustee Server hosts to authorized key administrators only, reducing the attack surface of the system.
⦁    Resource contention is reduced. Running Key Trustee Server and Key Trustee KMS services on dedicated hosts prevents other cluster services from reducing available resources (such as CPU and memory) and creating bottlenecks.
reference : http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_ref_arch.html#concept_npk_rxh_1v

Installing Cloudera Navigator Key Trustee Server

Important: Before installing Cloudera Navigator Key Trustee Server, see Deployment Planning for Data at Rest Encryption for important considerations.
++++
Deployment Planning for Data at Rest Encryption
⦁    Data at Rest Encryption Reference Architecture (explained above)
⦁    Data at Rest Encryption Requirements
⦁    Resource Planning for Data at Rest Encryption (explained above)

Data at Rest Encryption Requirements ( http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_prereqs.html#concept_sbn_zt4_y5 )
Encryption comprises several components, each with its own requirements. See Cloudera Navigator Data Encryption Overview for more information on the components, concepts, and architecture for encrypting data at rest. Continue reading:

⦁    Product Compatibility Matrix ( http://www.cloudera.com/documentation/enterprise/5-8-x/topics/rn_consolidated_pcm.html#pcm_navigator_encryption )
⦁    Entropy Requirements
⦁    Key Trustee Server Requirements
⦁    Key Trustee KMS Requirements
⦁    Key HSM Requirements
⦁    Navigator Encrypt Requirements
++++
You can install Navigator Key Trustee Server using Cloudera Manager with parcels or using the command line with packages.

Prerequisites: See Data at Rest Encryption Requirements for more information about encryption and Key Trustee Server requirements.

Setting Up an Internal Repository:
You must create an internal repository to install or upgrade the Cloudera Navigator data encryption components. For instructions on creating internal repositories (including Cloudera Manager, CDH, and Cloudera Navigator encryption components), see the following topics:

⦁    Creating and Using a Remote Parcel Repository for Cloudera Manager
⦁    Creating and Using a Package Repository for Cloudera Manager

Apache Hadoop Configuration (from the SCRATCH) ::

APACHE HADOOP INSTALLATION AND CONFIGURATION STEP-BY-STEP
Note: I am using 'CentOS-7' for this entire tutorial ( or I will mention OS flavor when required )
- - -
Pre requisites-1:: (Environmental)

1) Laptop or Desktop (with Min. 4 GB of RAM & 150 GB of HDD)
2) Linux OS iso ( Cent OS is preferred because, its lighter than other OS.)
( I suggest to install Cent OS with GUI as 'host' instead of windows 7/8/10)
3) KVM (Kernel-based Virtual Machine) to install guest OS.
Explanation:
if your laptop is installed with Linux OS, you can use it to install hadoop in 'pseudo-distribution mode' and you can use laptop as 'Name Node' and guests installed under KVM as 'Data Nodes'.

Modes of Installation:

Fully distribution Mode 2.Pseudo distributed Mode 3. Standalone Mode

Fully distribution:

you need multiple servers for this mode. Each service will run on separate machine and separate JVM.
You will have daemons for each process and the daemons run on multiple servers.
2.Pseudo distributed Mode:
Can be implemented in a single server. All services will run as separate JVM in the same machine.
all daemons (such as the DataNode, NameNode and ResourceManager processes) run on a single server.

Standalone Mode:

Can be implemented in a single server. All Hadoop services run in a single JVM, and there are no daemon. Hadoop will use local filesystem not HDFS. Best suited for developers to test their code in their machines.
Pre-Requisites-2 ::: Modifying 'Kernel Parameters' of OS (useful for all modes of installation )
it is not recommended to leave default kernel parameters unchanged when your are implementing hadoop cluster. Following steps will show you how to change linux kernel parameters for better hadoop cluster performance..

Changing disk mount parameters under '/etc/fstab' file:

OS maintains file system metadata that records when each file was last accessed as per the POSIX standard. This time-stamp is said as 'atime' and 'atime' comes with a performance penalty: every read operation on filesystem do a write operation.
Linux keeps 3 time stamps for each file on its filesystem : modified time (mtime), change time(ctime), and access time(atime).
'stat' command show above three time stamps. Ex: $ stat /bin/ls
- The 'noatime' option disables writing file access times to the HDD every time you read a file.
- The 'nodiratime' option disables the writing of file access times only for directories while other files still get access times written.
Note: 'noatime' implies 'nodiratime'. No need to specify both
/etc/fstab :
/dev/sda2 /data1 ext4 defaults,noatime 0 0
/dev/sda3 /data2 ext4 defaults,noatime 0 0

Increasing the File Limits:

to avoid any file descriptor errors in the cluster, increase the limits on the number of files a single user or process can have open at a time. The default is only 128.
below command will show the maximum, total, global number of file descriptors the kernel will allocate before choking.
# cat /proc/sys/fs/file-max
you can check soft-limit and hard-limit using below commands
# ulimit -Sn
# ulimiit -Hn
you can increase these limits for individual users by editing below mentioned file.
# vim /etc/security/limits.conf
Once you change the kernel settings, you can apply the new settings by executing the following command.
# sysctl -p
# sysctl -a (to see all kernel settings)

3. BIOS settings: make sure IDE emulation is not enabled (system admin / data center technician takes care of this )

4. Network Time: make sure all servers in the cluster synchronize with same 'ntp server'. Its is critical for services like Kerberos, zookeeper and log files.

5. NIC Bonding : make sure every server configured with 2 NIC cards as single. (for redundancy)

6. VLAN: its better to keep cluster in seperate vlan.

7. DNS: make sure all servers in cluster have correctly configured with a local dns server for host name resolution.

8. Network : a dedicated switch(also a backup switch) with a fiber connection to a core switch.

9. Disabling SELinux and Iptables:you should disable 'selinux' and 'iptables(firewall) on every cluster node to avoid any blocking.Use below command to disable SELinux:
# vim /etc/seliux/config (change SELINUX to 'disabled')
# setenforce 0
Use below command to flush and disable firewall:
# iptables -F
# iptables -X
# systemctl stop firewalld
# systemctl disable firewalld

10. Disabling swap: by default Linux uses HDD space to swap applications/files, this behavior will kill Hadoop performance. You can use below command to disable swap. #swapoff -a (temporary)
# vim /etc/sysctl.conf (add 'vm.swappiness=0) to disable swappiness permanently.

11. SSH (Password less): you should allow 'name node' (& fail over Name Node”) to SSH into every node in the cluster with entering password. Use blow steps to configure password less ssh login to cluster nodes from 'name node'. - generate ssh-key on 'name node'. # ssh-keygen (just enter couple of time to generate key). This command will create a keys (private and public) under .ssh directory of his home folder. - copy public key to all hosts in the cluster. #ssh-copy-id root@clusterhost(after successful execution of above steps you should be able to ling to remote machine without any password)

@@@@@@ please visit again for updates @@@@@@@@@@