Hadoop Administration

Saturday, 31 March 2018

History Server Permissions and Ownership Guide

This is a guide to resolving JobHistory Server (JHS) and SparkHistory Server (SHS) problems related to file ownership and permissions.

Description:::
Users are not able to see the jobs from JobHistoryServer (JHS). It looks like the jobs are not moved from ResourceManager(RM) to JHS.

Issue :::
When clicking on the 'History' link of a job within RM WebUI, an error message is reported:

"Error getting logs at <hostname-node1>:8041"

Notice that the MapReduce (MR) job history metadata, e.g. conf.xml and *.jhist files, are not being moved from HDFS location /user/history/done_intermediate to /user/history/done. An HDFS listing of /user/history/done_intermediate/<username> shows many old files still present in this location.

Also, from JHS log, the following error may be present:

""
ERROR org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Error while trying to scan the directory
hdfs://host1.example.com:8020/user/history/done_intermediate/userA
org.apache.hadoop.security.AccessControlException: Permission denied: user=userB,
access=READ_EXECUTE, inode="/user/history/done_intermediate/userA":userA:hadoop:drwxrwx---
at
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:151)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
.......
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
""

Reason:::
The cause of this issue is because the "mapred" user is not in the "hadoop" group.

Therefore, the metadata files in /user/history/done_intermediate/<username> cannot be read by the mapred user as these directories and files are owned by <username>:hadoop with permission mode 770.
The mapred user needs to be in the hadoop group in order to read those metadata files and move them to /user/history/done/.

Resolution::
to resolve the issue, add the mapred user to the hadoop group. Depending on how you define your user and group, you may either need to do that from Linux side or from LDAP/AD side.
On a basic Linux system you could add the group to the user's list of groups using the "usermod" command:
" sudo usermod -a -G hadoop mapred "

Additional information:::

YARN applications use different history mechanisms depending on the application type. The two most common application types are MapReduce (MR) and Spark. The JobHistory Server (JHS) is responsible for managing MR application history and the SparkHistory Server (SHS) is responsible for Spark history.

These application history mechanisms have a couple things common:
Job summary files contain metadata such as performance counters and configuration
Container logs contain the direct output of the application as it ran in the YARN container
When log aggregation is enabled for YARN, which is the default, the NodeManager (NM) is responsible for moving the container logs into HDFS when the job is complete (failed or not).
Each user has their own directory in HDFS for container logs. While the job is running these container logs are kept in the NM local directories, local to the node. The NM moves the container logs to /tmp/logs/<username/logs/<application_id>.
The log files will be owned by the user who ran the job, and the group ownership will be "hadoop".
The group is used to allow services such as the history server processes to read the files.

Additionally there is a location for metadata about each job. Then the JHS needs to be able to read and sometimes modify these files from their HDFS locations.

How to enable debug level logging for YARN and its dependent services

Summery: How to enable debug level logging for YARN and its dependent services without restart.

Explanation:

you may not be able to identify issue when INFO level logging is enabled while troubleshooting issue. in-order to see more granular logging we have to enable DEBUG mode logging for different YARN services.

Below steps helps you to configure this in Production environments without YARN service restart.

Steps:

1) Open 'resource manager' web interface.

Ex: http://<your rm ip>:8080 (you should see screen similar to below)

2) Now open logging configuration web page from RM web page.

ex: http://<your rm ip>:8080/logLevel ( you should see below web page)

(take a note uppercase L in URL).

3) Now decide, for which log class you want to change logging level.

Ex: i am choosing 'RMAuditLogger'.

4) Check what is the current logging level for chosen class.(check screen shot)

5) Now, change the logging level for chosen classes to desired logging level (in this case INFO to DEBUG)

6) Review the Resource Manager service log. After the change, you should see the DEBUG level log messages in the logs.

Enabling Debugging mode of logging - Hue

Up to CDH-5.6 follow below steps in Cloudera Manager:

1) Hue --> Configuration.
2) search for " Hue Service Environment Advanced Configuration Snippet (Safety Valve) " and add below properties.
###

DEBUG=true
DESKTOP_DEBUG=true

###

3) Save and Restart Hue service.
4) you should be able to collect logs physically navigating into Hue service installed linux machine under below path.
" /var/run/cloudera-scm-agent/process/<id>-hue-HUE_SERVER/logs "
( where <id> is the most recently created, the debug logs do not show up in the default Hue log location, only in the process directory mentioned above.)

From CDH-5.7 on-wards :

Below are two ways to enable DEBUG messages for all the logs in /var/log/hue :

1 ) Cloudera Manager:
- Go to Hue --> Configuration --> check Enable Django Debug Mode --> and Save Changes --> Restart Hue.

2) Hue Web UI:
- Go to the Home page --> select Server Logs --> check Force Debug Level. Debug is enabled on-the-fly.

Friday, 10 February 2017

HAProxy 1.7.X Installation and configuration for IMPALA

Steps:

Downloaded HAProxy to Impala server.

# wget http://www.haproxy.org/download/1.7/src/haproxy-1.7.2.tar.gz

Installed dependencies

# yum install gcc pcre-static pcre-devel -y

Untar source and change directory

#tar xzvf ~/haproxy.tar.gz -C ~/

Change into the directory.

#cd ~/haproxy-1.7.2

Then compile the program for your system.

#make TARGET=linux2628

and finally install HAProxy itself.

# make install

To complete the install, use the following commands to copy the settings over.

# cp /usr/local/sbin/haproxy /usr/sbin/

# cp /haproxy-1.7.2/examples/haproxy.init /etc/init.d/haproxy

# chmod 755 /etc/init.d/haproxy

Create these directories and the statistics file for HAProxy to record in.

# mkdir -p /etc/haproxy

# mkdir -p /run/haproxy

# mkdir -p /var/lib/haproxy

# touch /var/lib/haproxy/stats

Then add a new user for HAProxy.

# useradd -r haproxy

Configuring the load balancer

Setting up HAProxy for load balancing is a quite straight forward process. Basically all you need to do is tell HAProxy what kind of connections it should be listening for and which servers it should relay the connections to. This can be done by creating a configuration file /etc/haproxy/haproxy.cfg with the required settings. For documentation help please go to HAProxy Documentation

make sure you have following setting for HAProxy-1.7.2

# Vim /etc/haproxy/haproxy.cfg

______________________________________________
______________________________________________

# HAProxy Server Version 1.7.2
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events. This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    192s
    timeout queue           1m
    timeout connect         192s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 192s
    timeout check           192s
    maxconn                 3000

#kz : Timeout connect 3600000
    timeout client 3600000
    timeout server 3600000
#####################################################
##Default Timeout Settings###########################
#    timeout connect 5000
#    timeout client 50000
#    timeout server 50000
#####################################################
#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats
   bind *:25002
    balance
    mode http
#    stats enable
#    stats hide-version
#    stats scope .
#    stats realm Haproxy\ Statistics
#    stats uri /
#    stats auth prni01:haproxy
    log global
    stats enable
    stats hide-version
    stats refresh 30s
    stats show-node
    stats auth haproxy:h@pr0xy
    stats uri /haproxy?stats
# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
#Config settings for Impala Shell
listen impalashell
   bind *:25003
    mode tcp
    option tcplog
    balance leastconn
#List of Impala Daemons
server  192.168.16.17  192.168.16.17:21000
server  192.168.16.18  192.168.16.18:21000
server  192.168.16.19  192.168.16.19:21000
server  192.168.16.21  192.168.16.21:21000
server  192.168.16.22  192.168.16.22:21000
server  192.168.16.23  192.168.16.23:21000
server  192.168.16.25  192.168.16.25:21000
server  192.168.16.30  192.168.16.30:21000
server  192.168.16.31  192.168.16.31:21000
server  192.168.16.32  192.168.16.32:21000
server  192.168.16.33  192.168.16.33:21000
server  192.168.16.34  192.168.16.34:21000
server  192.168.16.35  192.168.16.35:21000
server  192.168.16.36  192.168.16.36:21000
server  192.168.16.37  192.168.16.37:21000
server  192.168.16.38  192.168.16.38:21000
server  192.168.16.39  192.168.16.39:21000
server  192.168.16.40  192.168.16.40:21000
server  192.168.16.41  192.168.16.41:21000
server  192.168.16.42  192.168.16.42:21000
server  192.168.16.43  192.168.16.43:21000
server  192.168.16.44  192.168.16.44:21000
server  192.168.16.45  192.168.16.45:21000
server  192.168.16.46  192.168.16.46:21000
server  192.168.16.47  192.168.16.47:21000
#Config settings for Impala JDBC
listen impalajdbc
   bind *:25004
    mode tcp
    option tcplog
    balance leastconn
#List of Impala Daemons
server  192.168.16.17  192.168.16.17:21050
server  192.168.16.18  192.168.16.18:21050
server  192.168.16.19  192.168.16.19:21050
server  192.168.16.21  192.168.16.21:21050
server  192.168.16.22  192.168.16.22:21050
server  192.168.16.23  192.168.16.23:21050
server  192.168.16.25  192.168.16.25:21050
server  192.168.16.30  192.168.16.30:21050
server  192.168.16.31  192.168.16.31:21050
server  192.168.16.32  192.168.16.32:21050
server  192.168.16.33  192.168.16.33:21050
server  192.168.16.34  192.168.16.34:21050
server  192.168.16.35  192.168.16.35:21050
server  192.168.16.36  192.168.16.36:21050
server  192.168.16.37  192.168.16.37:21050
server  192.168.16.38  192.168.16.38:21050
server  192.168.16.39  192.168.16.39:21050
server  192.168.16.40  192.168.16.40:21050
server  192.168.16.41  192.168.16.41:21050
server  192.168.16.42  192.168.16.42:21050
server  192.168.16.43  192.168.16.43:21050
server  192.168.16.44  192.168.16.44:21050
server  192.168.16.45  192.168.16.45:21050
server  192.168.16.46  192.168.16.46:21050
server  192.168.16.47  192.168.16.47:21050

_________________________________________________
_________________________________________________

Change settings as per attached haproxy.cfg file (please note configuration properties may change depends upon the version you are using.)

Then start the haproxy service.

# service haproxy start/restart /// # systemctl start haproxy

Check HA on impala-shell from 'haproxy server' host.

#impala-shell -I <impala_daemon_host>:25003 (this port is given for impala-shell)

(you should be able to login and check databases)

NOTE: Please check your firewall setting if you are facing any issue

................DONE................

Thursday, 12 January 2017

HDFS Encryption Using Cloudera Navigator Key Trustee Server

[note: this document was prepared after reading to cloudera website.]

Using this tutorial i will explain about " Cloudera Navigator Key Trustee Server (KTS).

Cloudera Navigator is a fully integrated data-management and security system for the Hadoop platform. Cloudera Navigator provides the following functionality:
(here i will be explaining about 'KTS' under 'Data Encryption')
- Data Management
- Data Encryption : Enabling HDFS encryption using Key Trustee Server as the key store involves multiple components.
⦁   Cloudera Navigator Key Trustee Server
⦁    Cloudera Navigator Key HSM
⦁    Cloudera Navigator Encrypt
⦁    Key Trustee KMS

Reference : http://www.cloudera.com/documentation/enterprise/5-8-x/topics/navigator_encryption.html#concept_w4l_yjv_jt

Resource Planning for Data at Rest Encryption:

For high availability, you must provision two dedicated Key Trustee Server hosts and at least two dedicated Key Trustee KMS hosts, for a minimum of four separate hosts. Do not run multiple Key Trustee Server or Key Trustee KMS services on the same physical host, and do not run these services on hosts with other cluster services. Doing so causes resource contention with other important cluster services and defeats the purpose of high availability.

The Key Trustee KMS workload is CPU intensive. Cloudera recommends using machines with capabilities equivalent to your NameNode hosts, with Intel CPUs that support AES-NI for optimum performance.

Make sure that each host is secured and audited. Only authorized key administrators should have access to them. refer following links to secure redhat OS...

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Security_Guide/

For Cloudera Manager deployments, deploy Key Trustee Server in its own dedicated cluster. Deploy Key Trustee KMS in each cluster that uses Key Trustee Server.

Reference: http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_resource_planning.html#concept_cg3_rfp_y5

Virtual Machine Considerations
If you are using virtual machines, make sure that the resources (such as virtual disks, CPU, and memory) for each Key Trustee Server and Key Trustee KMS host are allocated to separate physical hosts. Hosting multiple services on the same physical host defeats the purpose of high availability, because a single machine failure can take down multiple services.

Data at Rest Encryption Reference Architecture

To isolate Key Trustee Server from other Enterprise Data Hub (EDH) services, you must deploy Key Trustee Server on dedicated hosts in a separate cluster in Cloudera Manager. Deploy Key Trustee KMS on dedicated hosts in the same cluster as the EDH services that require access to Key Trustee Server. This provides the following benefits:

⦁    You can restart your EDH cluster without restarting Key Trustee Server, avoiding interruption to other clusters or clients that use the same Key Trustee Server instance.
⦁    You can manage the Key Trustee Server upgrade cycle independently of other cluster components.
⦁    You can limit access to the Key Trustee Server hosts to authorized key administrators only, reducing the attack surface of the system.
⦁    Resource contention is reduced. Running Key Trustee Server and Key Trustee KMS services on dedicated hosts prevents other cluster services from reducing available resources (such as CPU and memory) and creating bottlenecks.
reference : http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_ref_arch.html#concept_npk_rxh_1v

Installing Cloudera Navigator Key Trustee Server

Important: Before installing Cloudera Navigator Key Trustee Server, see Deployment Planning for Data at Rest Encryption for important considerations.
++++
Deployment Planning for Data at Rest Encryption
⦁    Data at Rest Encryption Reference Architecture (explained above)
⦁    Data at Rest Encryption Requirements
⦁    Resource Planning for Data at Rest Encryption (explained above)

Data at Rest Encryption Requirements ( http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_prereqs.html#concept_sbn_zt4_y5 )
Encryption comprises several components, each with its own requirements. See Cloudera Navigator Data Encryption Overview for more information on the components, concepts, and architecture for encrypting data at rest. Continue reading:

⦁    Product Compatibility Matrix ( http://www.cloudera.com/documentation/enterprise/5-8-x/topics/rn_consolidated_pcm.html#pcm_navigator_encryption )
⦁    Entropy Requirements
⦁    Key Trustee Server Requirements
⦁    Key Trustee KMS Requirements
⦁    Key HSM Requirements
⦁    Navigator Encrypt Requirements
++++
You can install Navigator Key Trustee Server using Cloudera Manager with parcels or using the command line with packages.

Prerequisites: See Data at Rest Encryption Requirements for more information about encryption and Key Trustee Server requirements.

Setting Up an Internal Repository:
You must create an internal repository to install or upgrade the Cloudera Navigator data encryption components. For instructions on creating internal repositories (including Cloudera Manager, CDH, and Cloudera Navigator encryption components), see the following topics:

⦁    Creating and Using a Remote Parcel Repository for Cloudera Manager
⦁    Creating and Using a Package Repository for Cloudera Manager

Apache Hadoop Configuration (from the SCRATCH) ::

APACHE HADOOP INSTALLATION AND CONFIGURATION STEP-BY-STEP
Note: I am using 'CentOS-7' for this entire tutorial ( or I will mention OS flavor when required )
- - -
Pre requisites-1:: (Environmental)

1) Laptop or Desktop (with Min. 4 GB of RAM & 150 GB of HDD)
2) Linux OS iso ( Cent OS is preferred because, its lighter than other OS.)
( I suggest to install Cent OS with GUI as 'host' instead of windows 7/8/10)
3) KVM (Kernel-based Virtual Machine) to install guest OS.
Explanation:
if your laptop is installed with Linux OS, you can use it to install hadoop in 'pseudo-distribution mode' and you can use laptop as 'Name Node' and guests installed under KVM as 'Data Nodes'.

Modes of Installation:

Fully distribution Mode 2.Pseudo distributed Mode 3. Standalone Mode

Fully distribution:

you need multiple servers for this mode. Each service will run on separate machine and separate JVM.
You will have daemons for each process and the daemons run on multiple servers.
2.Pseudo distributed Mode:
Can be implemented in a single server. All services will run as separate JVM in the same machine.
all daemons (such as the DataNode, NameNode and ResourceManager processes) run on a single server.

Standalone Mode:

Can be implemented in a single server. All Hadoop services run in a single JVM, and there are no daemon. Hadoop will use local filesystem not HDFS. Best suited for developers to test their code in their machines.
Pre-Requisites-2 ::: Modifying 'Kernel Parameters' of OS (useful for all modes of installation )
it is not recommended to leave default kernel parameters unchanged when your are implementing hadoop cluster. Following steps will show you how to change linux kernel parameters for better hadoop cluster performance..

Changing disk mount parameters under '/etc/fstab' file:

OS maintains file system metadata that records when each file was last accessed as per the POSIX standard. This time-stamp is said as 'atime' and 'atime' comes with a performance penalty: every read operation on filesystem do a write operation.
Linux keeps 3 time stamps for each file on its filesystem : modified time (mtime), change time(ctime), and access time(atime).
'stat' command show above three time stamps. Ex: $ stat /bin/ls
- The 'noatime' option disables writing file access times to the HDD every time you read a file.
- The 'nodiratime' option disables the writing of file access times only for directories while other files still get access times written.
Note: 'noatime' implies 'nodiratime'. No need to specify both
/etc/fstab :
/dev/sda2 /data1 ext4 defaults,noatime 0 0
/dev/sda3 /data2 ext4 defaults,noatime 0 0

Increasing the File Limits:

to avoid any file descriptor errors in the cluster, increase the limits on the number of files a single user or process can have open at a time. The default is only 128.
below command will show the maximum, total, global number of file descriptors the kernel will allocate before choking.
# cat /proc/sys/fs/file-max
you can check soft-limit and hard-limit using below commands
# ulimit -Sn
# ulimiit -Hn
you can increase these limits for individual users by editing below mentioned file.
# vim /etc/security/limits.conf
Once you change the kernel settings, you can apply the new settings by executing the following command.
# sysctl -p
# sysctl -a (to see all kernel settings)

3. BIOS settings: make sure IDE emulation is not enabled (system admin / data center technician takes care of this )

4. Network Time: make sure all servers in the cluster synchronize with same 'ntp server'. Its is critical for services like Kerberos, zookeeper and log files.

5. NIC Bonding : make sure every server configured with 2 NIC cards as single. (for redundancy)

6. VLAN: its better to keep cluster in seperate vlan.

7. DNS: make sure all servers in cluster have correctly configured with a local dns server for host name resolution.

8. Network : a dedicated switch(also a backup switch) with a fiber connection to a core switch.

9. Disabling SELinux and Iptables:you should disable 'selinux' and 'iptables(firewall) on every cluster node to avoid any blocking.Use below command to disable SELinux:
# vim /etc/seliux/config (change SELINUX to 'disabled')
# setenforce 0
Use below command to flush and disable firewall:
# iptables -F
# iptables -X
# systemctl stop firewalld
# systemctl disable firewalld

10. Disabling swap: by default Linux uses HDD space to swap applications/files, this behavior will kill Hadoop performance. You can use below command to disable swap. #swapoff -a (temporary)
# vim /etc/sysctl.conf (add 'vm.swappiness=0) to disable swappiness permanently.

11. SSH (Password less): you should allow 'name node' (& fail over Name Node”) to SSH into every node in the cluster with entering password. Use blow steps to configure password less ssh login to cluster nodes from 'name node'. - generate ssh-key on 'name node'. # ssh-keygen (just enter couple of time to generate key). This command will create a keys (private and public) under .ssh directory of his home folder. - copy public key to all hosts in the cluster. #ssh-copy-id root@clusterhost(after successful execution of above steps you should be able to ling to remote machine without any password)

@@@@@@ please visit again for updates @@@@@@@@@@