Hadoop Administration: January 2017

Thursday 12 January 2017

HDFS Encryption Using Cloudera Navigator Key Trustee Server

[note: this document was prepared after reading to cloudera website.]

Using this tutorial i will explain about " Cloudera Navigator Key Trustee Server (KTS).

Cloudera Navigator is a fully integrated data-management and security system for the Hadoop platform. Cloudera Navigator provides the following functionality:
(here i will be explaining about 'KTS' under 'Data Encryption')
- Data Management
- Data Encryption : Enabling HDFS encryption using Key Trustee Server as the key store involves multiple components.
⦁   Cloudera Navigator Key Trustee Server
⦁    Cloudera Navigator Key HSM
⦁    Cloudera Navigator Encrypt
⦁    Key Trustee KMS

Reference : http://www.cloudera.com/documentation/enterprise/5-8-x/topics/navigator_encryption.html#concept_w4l_yjv_jt

Resource Planning for Data at Rest Encryption:

For high availability, you must provision two dedicated Key Trustee Server hosts and at least two dedicated Key Trustee KMS hosts, for a minimum of four separate hosts. Do not run multiple Key Trustee Server or Key Trustee KMS services on the same physical host, and do not run these services on hosts with other cluster services. Doing so causes resource contention with other important cluster services and defeats the purpose of high availability.

The Key Trustee KMS workload is CPU intensive. Cloudera recommends using machines with capabilities equivalent to your NameNode hosts, with Intel CPUs that support AES-NI for optimum performance.

Make sure that each host is secured and audited. Only authorized key administrators should have access to them. refer following links to secure redhat OS...

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Security_Guide/

For Cloudera Manager deployments, deploy Key Trustee Server in its own dedicated cluster. Deploy Key Trustee KMS in each cluster that uses Key Trustee Server.

Reference: http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_resource_planning.html#concept_cg3_rfp_y5

Virtual Machine Considerations
If you are using virtual machines, make sure that the resources (such as virtual disks, CPU, and memory) for each Key Trustee Server and Key Trustee KMS host are allocated to separate physical hosts. Hosting multiple services on the same physical host defeats the purpose of high availability, because a single machine failure can take down multiple services.

Data at Rest Encryption Reference Architecture

To isolate Key Trustee Server from other Enterprise Data Hub (EDH) services, you must deploy Key Trustee Server on dedicated hosts in a separate cluster in Cloudera Manager. Deploy Key Trustee KMS on dedicated hosts in the same cluster as the EDH services that require access to Key Trustee Server. This provides the following benefits:

⦁    You can restart your EDH cluster without restarting Key Trustee Server, avoiding interruption to other clusters or clients that use the same Key Trustee Server instance.
⦁    You can manage the Key Trustee Server upgrade cycle independently of other cluster components.
⦁    You can limit access to the Key Trustee Server hosts to authorized key administrators only, reducing the attack surface of the system.
⦁    Resource contention is reduced. Running Key Trustee Server and Key Trustee KMS services on dedicated hosts prevents other cluster services from reducing available resources (such as CPU and memory) and creating bottlenecks.
reference : http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_ref_arch.html#concept_npk_rxh_1v

Installing Cloudera Navigator Key Trustee Server

Important: Before installing Cloudera Navigator Key Trustee Server, see Deployment Planning for Data at Rest Encryption for important considerations.
++++
Deployment Planning for Data at Rest Encryption
⦁    Data at Rest Encryption Reference Architecture (explained above)
⦁    Data at Rest Encryption Requirements
⦁    Resource Planning for Data at Rest Encryption (explained above)

Data at Rest Encryption Requirements ( http://www.cloudera.com/documentation/enterprise/5-8-x/topics/encryption_prereqs.html#concept_sbn_zt4_y5 )
Encryption comprises several components, each with its own requirements. See Cloudera Navigator Data Encryption Overview for more information on the components, concepts, and architecture for encrypting data at rest. Continue reading:

⦁    Product Compatibility Matrix ( http://www.cloudera.com/documentation/enterprise/5-8-x/topics/rn_consolidated_pcm.html#pcm_navigator_encryption )
⦁    Entropy Requirements
⦁    Key Trustee Server Requirements
⦁    Key Trustee KMS Requirements
⦁    Key HSM Requirements
⦁    Navigator Encrypt Requirements
++++
You can install Navigator Key Trustee Server using Cloudera Manager with parcels or using the command line with packages.

Prerequisites: See Data at Rest Encryption Requirements for more information about encryption and Key Trustee Server requirements.

Setting Up an Internal Repository:
You must create an internal repository to install or upgrade the Cloudera Navigator data encryption components. For instructions on creating internal repositories (including Cloudera Manager, CDH, and Cloudera Navigator encryption components), see the following topics:

⦁    Creating and Using a Remote Parcel Repository for Cloudera Manager
⦁    Creating and Using a Package Repository for Cloudera Manager

Apache Hadoop Configuration (from the SCRATCH) ::

APACHE HADOOP INSTALLATION AND CONFIGURATION STEP-BY-STEP
Note: I am using 'CentOS-7' for this entire tutorial ( or I will mention OS flavor when required )
- - -
Pre requisites-1:: (Environmental)

1) Laptop or Desktop (with Min. 4 GB of RAM & 150 GB of HDD)
2) Linux OS iso ( Cent OS is preferred because, its lighter than other OS.)
( I suggest to install Cent OS with GUI as 'host' instead of windows 7/8/10)
3) KVM (Kernel-based Virtual Machine) to install guest OS.
Explanation:
if your laptop is installed with Linux OS, you can use it to install hadoop in 'pseudo-distribution mode' and you can use laptop as 'Name Node' and guests installed under KVM as 'Data Nodes'.

Modes of Installation:

Fully distribution Mode 2.Pseudo distributed Mode 3. Standalone Mode

Fully distribution:

you need multiple servers for this mode. Each service will run on separate machine and separate JVM.
You will have daemons for each process and the daemons run on multiple servers.
2.Pseudo distributed Mode:
Can be implemented in a single server. All services will run as separate JVM in the same machine.
all daemons (such as the DataNode, NameNode and ResourceManager processes) run on a single server.

Standalone Mode:

Can be implemented in a single server. All Hadoop services run in a single JVM, and there are no daemon. Hadoop will use local filesystem not HDFS. Best suited for developers to test their code in their machines.
Pre-Requisites-2 ::: Modifying 'Kernel Parameters' of OS (useful for all modes of installation )
it is not recommended to leave default kernel parameters unchanged when your are implementing hadoop cluster. Following steps will show you how to change linux kernel parameters for better hadoop cluster performance..

Changing disk mount parameters under '/etc/fstab' file:

OS maintains file system metadata that records when each file was last accessed as per the POSIX standard. This time-stamp is said as 'atime' and 'atime' comes with a performance penalty: every read operation on filesystem do a write operation.
Linux keeps 3 time stamps for each file on its filesystem : modified time (mtime), change time(ctime), and access time(atime).
'stat' command show above three time stamps. Ex: $ stat /bin/ls
- The 'noatime' option disables writing file access times to the HDD every time you read a file.
- The 'nodiratime' option disables the writing of file access times only for directories while other files still get access times written.
Note: 'noatime' implies 'nodiratime'. No need to specify both
/etc/fstab :
/dev/sda2 /data1 ext4 defaults,noatime 0 0
/dev/sda3 /data2 ext4 defaults,noatime 0 0

Increasing the File Limits:

to avoid any file descriptor errors in the cluster, increase the limits on the number of files a single user or process can have open at a time. The default is only 128.
below command will show the maximum, total, global number of file descriptors the kernel will allocate before choking.
# cat /proc/sys/fs/file-max
you can check soft-limit and hard-limit using below commands
# ulimit -Sn
# ulimiit -Hn
you can increase these limits for individual users by editing below mentioned file.
# vim /etc/security/limits.conf
Once you change the kernel settings, you can apply the new settings by executing the following command.
# sysctl -p
# sysctl -a (to see all kernel settings)

3. BIOS settings: make sure IDE emulation is not enabled (system admin / data center technician takes care of this )

4. Network Time: make sure all servers in the cluster synchronize with same 'ntp server'. Its is critical for services like Kerberos, zookeeper and log files.

5. NIC Bonding : make sure every server configured with 2 NIC cards as single. (for redundancy)

6. VLAN: its better to keep cluster in seperate vlan.

7. DNS: make sure all servers in cluster have correctly configured with a local dns server for host name resolution.

8. Network : a dedicated switch(also a backup switch) with a fiber connection to a core switch.

9. Disabling SELinux and Iptables:you should disable 'selinux' and 'iptables(firewall) on every cluster node to avoid any blocking.Use below command to disable SELinux:
# vim /etc/seliux/config (change SELINUX to 'disabled')
# setenforce 0
Use below command to flush and disable firewall:
# iptables -F
# iptables -X
# systemctl stop firewalld
# systemctl disable firewalld

10. Disabling swap: by default Linux uses HDD space to swap applications/files, this behavior will kill Hadoop performance. You can use below command to disable swap. #swapoff -a (temporary)
# vim /etc/sysctl.conf (add 'vm.swappiness=0) to disable swappiness permanently.

11. SSH (Password less): you should allow 'name node' (& fail over Name Node”) to SSH into every node in the cluster with entering password. Use blow steps to configure password less ssh login to cluster nodes from 'name node'. - generate ssh-key on 'name node'. # ssh-keygen (just enter couple of time to generate key). This command will create a keys (private and public) under .ssh directory of his home folder. - copy public key to all hosts in the cluster. #ssh-copy-id root@clusterhost(after successful execution of above steps you should be able to ling to remote machine without any password)

@@@@@@ please visit again for updates @@@@@@@@@@