Hadoop 3 : how to configure / enable erasure coding? - hadoop

I'm trying to setup an Hadoop 3 cluster.
Two questions about the Erasure Coding feature :
How I can ensure that erasure coding is enabled ?
Do I still need to set the replication factor to 3 ?
Please indicate the relevant configuration properties related to erasure coding/replication, in order to get the same data security as Hadoop 2 (replication factor 3) but with the disk space benefits of Hadoop 3 erasure coding (only 50% overhead instead of 200%).

In Hadoop3 we can enable Erasure coding policy to any folder in HDFS. By default erasure coding is not enabled in Hadoop3, you can enable it by using setPolicy command with specifying desired path of folder.
1: To ensure erasure coding is enabled, you can run getPolicy command.
2: In Hadoop3 Replication factor setting will affect only to other folders which is not configured by erasure code setPolicy. You can use both Erasure coding and replication factor settings in single cluster.
Command to List the supported erasure policies:
./bin/hdfs ec -listPolicies
Command to Enable XOR-2-1-1024k Erasure policy:
./bin/hdfs ec -enablePolicy -policy XOR-2-1-1024k
Command to Set Erasure policy to HDFS directory:
./bin/hdfs ec -setPolicy -path /tmp -policy XOR-2-1-1024k
Command to Get the policy set to the given directory:
./bin/hdfs ec -getPolicy -path /tmp
Command to Remove the policy from the directory.i.e unset policy:
./bin/hdfs ec -unsetPolicy -path /tmp
Command to Disable policy:
./bin/hdfs ec -disablePolicy -policy XOR-2-1-1024k

Related

Insufficient number of DataNodes reporting when creating dataproc cluster

I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster.
gcloud dataproc clusters create cluster-538f --image-version 1.2 \
--bucket dataproc_bucket_test --subnet default --zone asia-south1-b \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 \
--num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project delcure-firebase \
--properties 'core:fs.default.name=gs://dataproc_bucket_test/'
I checked and confirmed that the bucket i am using is able to create default folder in the bucker.
As Igor suggests, Dataproc does not support GCS as a default FS. I also suggest unsetting this property. Note, that fs.default.name property can be passed to individual jobs and will work just fine.
The error arises when the file system is tried to be accessed (HdfsClientModule). So, I think it is probable that Google Cloud Storage doesn't have a specific feature that is required for Hadoop and the creation fails after some folders were created (first image).
As somebody else mentioned previously, it is better to give up the idea of using GCS as the default fs and leave HDFS work in Dataproc. Nonetheless, you can still take advantage of Cloud Storage to have data persistence, reliability, and performance because remember that data in HDFS is removed when a cluster is shut down.
1.- From a Dataproc node you can access data through the hadoop command to move data in and out, for example:
hadoop fs -ls gs://CONFIGBUCKET/dir/file
hadoop distcp hdfs://OtherNameNode/dir/ gs://CONFIGBUCKET/dir/file
2.- For accessing data from Spark or any Hadoop application just use the gs:// prefix to access your bucket.
Furthermore, if the Dataproc connector is installed on premises it can help to move HDFS data to Cloud Storage and then access it from a Dataproc cluster.

Streamsets Mapr FS origin/dest. KerberosPrincipal exception (using hadoop impersonation (in mapr 6.0))

I am trying to do a simple data move from a mapr fs origin to a mapr fs destination (this is not my use case, just doing this simple movement for testing purposes). When trying to validate this pipeline, the error message I see in the staging area is:
HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal
Tyring different variations of the hadoop fs URI field (eg. mfs:///mapr/mycluster.cluster.local, maprfs:///mycluster.cluster.local) does not seem to help. Looking at the logs after trying to validate, I see
2018-01-04 10:28:56,686 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Created source of type: com.streamsets.pipeline.stage.origin.maprfs.ClusterMapRFSSource#16978460 DClusterSourceOffsetCommitter *admin preview-pool-1-thread-3
2018-01-04 10:28:56,697 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Error connecting to FileSystem: java.io.IOException: Provided Subject must contain a KerberosPrincipal ClusterHdfsSource *admin preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Authentication Config: ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 ERROR Issues: Issue[instance='MapRFS_01' service='null' group='HADOOP_FS' config='null' message='HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal''] ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,169 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Validation Error: Failed to configure or connect to the 'maprfs:///mapr/mycluster.cluster.local' Hadoop file system: java.io.IOException: Provided Subject must contain a KerberosPrincipal HdfsTargetConfigBean *admin 0 preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
However, to my knowledge, the system is not running Keberos, so this error message is a bit confusing for me. Uncommenting #export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" in the sdc environment variable file for native mapr authentication did not seem to help the problem (even when reinstalling and commenting this line before running the streamsets mapr setup script).
Does anyone have any idea what is happening and how to fix it? Thanks.
This answer was provided on the mapr community forums and worked for me (using mapr v6.0). Note that the instruction here differ from those currently provided by the streamsets documentation. Throughout these instructions, I was logged in as user root.
After installing streamsets (and the mapr prerequisites) as per the documentation...
Change the owner of the the streamsets $SDC_DIST or $SDC_HOME location to the mapr user (or whatever other user you plan to use for the hadoop impersonation): $chown -R mapr:mapr $SDC_DIST (for me this was the /opt/streamsets-datacollector dir.). Do the same for $SDC_CONF (/etc/sdc for me) as well as /var/lib/sdc and var/log/sdc.
In $SDC_DIST/libexec/sdcd-env.sh, set the user and group name (near the top of the file) to mapr user "mapr" and enable mapr password login. The file should end up looking like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
....
# Indicate that MapR Username/Password security is enabled
export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}
Edit the file /usr/lib/systemd/system/sdc.service to look like:
[Service]
User=mapr
Group=mapr
$cd into /etc/systemd/system/ and create a directory called sdc.service.d. Within that directory, create a file (with any name) and add the contents (without spaces):
Environment=SDC_JAVA_OPTS=-Dmaprlogin.passowrd.enabled=true
If you are using mapr's sasl ticket auth. system (or something similar), generate a ticket for the this user on the node that is running streamsets. In this case, with the $maprlogin password command.
Then finally, restart the sdc service: $systemctl deamon-reload then $systemctl retart sdc.
Run something like $ps -aux | grep sdc | grep maprlogin to check that the sdc process is ownned by mapr and that the -Dmaprlogin.passowrd.enabled=true parameter has been successfully set. Once this is done, should be able to validate/run maprFS to maprFS operations in streamsets pipeline builder in batch processing mode.
** NOTE: If using Hadoop Configuration Directory param. instead of Hadoop FS URI, remember to have the files in your $HADOOP_HOME/conf directory (eg.hadoop-site.xml, yarn-site.xml, etc.) (in the case of mapr, something like /opt/mapr/hadoop/hadoop-<version>/etc/hadoop/) either soft-linked or hard-copied to a directory $SDC_DIST/resource/<some hadoop config dir. you made need to create> (I just copy eberything in the directory) and add this path to the Hadoop Configuration Directory param. for your MaprFS (or HadoopFS). In the sdc web UI Hadoop Configuration Directory box, it would look like Hadoop Configuration Directory: <the directory within $SDC_DIST/resources/ that holds the hadoop files>.
** NOTE: If you are still logging errors of the form
2018-01-16 14:26:10,883
ingest2sa_demodata_batch/ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41
ERROR Error in Slave Runner: ClusterRunner *admin
runner-pool-2-thread-29
com.streamsets.datacollector.runner.PipelineRuntimeException:
CONTAINER_0800 - Pipeline
'ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41'
validation error : HADOOPFS_11 - Cannot connect to the filesystem.
Check if the Hadoop FS location: 'maprfs:///' is valid or not:
'java.io.IOException: Provided Subject must contain a
KerberosPrincipal'
you may also need to add -Dmaprlogin.password.enabled=true to the pipeline's /cluster/Worker Java Options tab for the origin and destination hadoop FS stages.
** The video linked to in the mapr community link also says to generate a mapr ticket for the sdc user (the default user that sdc process runs as when running as a service), but I did not do this and the solution still worked for me (so if anyone has any idea why it should be done regardless, please let me know in the comments).

Could not find uri with key dfs.encryption.key.provider.uri to create a keyProvider in HDFS encryption for CDH 5.4

CDH Version: CDH5.4.5
Issue: When HDFS Encryption is enabled using KMS available in Hadoop CDH 5.4 , getting error while putting file into encryption zone.
Steps:
Steps for Encryption of Hadoop as follows:
Creating a key [SUCCESS]
[tester#master ~]$ hadoop key create 'TDEHDP'
-provider kms://https#10.1.118.1/key_generator/kms -size 128
tde group has been successfully created with options
Options{cipher='AES/CTR/NoPadding', bitLength=128, description='null', attributes=null}.
KMSClientProvider[https://10.1.118.1/key_generator/kms/v1/] has been updated.
2.Creating a directory [SUCCESS]
[tester#master ~]$ hdfs dfs -mkdir /user/tester/vs_key_testdir
Adding Encryption Zone [SUCCESS]
[tester#master ~]$ hdfs crypto -createZone -keyName 'TDEHDP'
-path /user/tester/vs_key_testdir
Added encryption zone /user/tester/vs_key_testdir
Copying File to encryption Zone [ERROR]
[tdetester#master ~]$ hdfs dfs -copyFromLocal test.txt /user/tester/vs_key_testdir
15/09/04 06:06:33 ERROR hdfs.KeyProviderCache: Could not find uri with
key [dfs.encryption.key.provider.uri] to create a keyProvider !!
copyFromLocal: No KeyProvider is configured, cannot access an
encrypted file 15/09/04 06:06:33 ERROR hdfs.DFSClient: Failed to close
inode 20823
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /user/tester/vs_key_testdir/test.txt.COPYING (inode
20823): File does not exist. Holder
DFSClient_NONMAPREDUCE_1061684229_1 does not have any open files.
Any idea/suggestion will be helpful.
This issue was crossposted here: https://community.cloudera.com/t5/Storage-Random-Access-HDFS/Could-not-find-uri-with-key-dfs-encryption-key-provider-uri-to/td-p/31637
Main conclusion: It is a non-issue
Here is the answer that was provided by the support staff:
CDH's base release versions are just that: base. The fix for the
harmless log print due to HDFS-7931 is present in all CDH5 releases
since CDH 5.4.1.
If you see that error in context of having configured a KMS, then its
a worthy one to consider. If you do not use KMS or EZs, then the error
may be ignored. Alternatively upgrade to the latest CDH5 (5.4.x or
5.5.x) releases to receive a bug fix that makes the error only appear when in the context of a KMS being configured over an encrypted path.
Per your log snippet, I don't see a problem (the canary does not
appear to be failing?). If you're trying to report a failure, please
send us more characteristics of the failure, as HDFS-7931 is a minor
issue with an unnecessary log print.

HDFS Safe mode issue

I am facing an issue with HDFS. Error is given below:
Problem accessing /nn_browsedfscontent.jsp. Reason:
Cannot issue delegation token. Name node is in safe mode.The reported
blocks 428 needs additional 2 blocks to reach the threshold 0.9990 of
total blocks 430. Safe mode will be turned off automatically.
I even tried to leave safe mode using the command. But I am getting superuser privilege issue, even if I tried as the root user. I am using CDH 4.
Can any one let me know why this is happening and how to over come from this?
hadoop dfsadmin -safemode leave
After doing the above command, Run hadoop fsck so that any inconsistencies crept in the hdfs might be sorted out.
Use hdfs command instead of hadoop command for newer distributions:
hdfs dfsadmin -safemode leave
Modify dfs.safemode.threshold.pct proprty in hdfs-site.xml and restart the service.
For more details refer: SafeModeException : Name node is in safe mode

java.io.IOException: Failed to add a datanode. HDFS (Hadoop)

I am faced with the error while appending file on HDFS (cloudera 2.0.0-cdh4.2.0).
The use case that cause an error is:
Create file on file system (DistributedFileSystem). OK
Append earlier created file. ERROR
OutputStream stream = FileSystem.append(filePath);
stream.write(fileContents);
Then error is thrown:
Exception in thread "main" java.io.IOException: Failed to add a datanode.
User may turn off this feature by setting dfs.client.block.write.replace-datanode-on- failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[host1:50010, host2:50010], original=[host1:50010, host2:50010])
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:469)
Some related hdfs configs:
dfs.replication set to 2
dfs.client.block.write.replace-datanode-on-failure.policy set to true
dfs.client.block.write.replace-datanode-on-failure set to DEFAULT
Any ideas?
Thanks!
Problem was solved by running on file system
hadoop dfs -setrep -R -w 2 /
Old files on file system had replication factor set to 3,
setting dfs.replication to 2 in hdfs-site.xml will not solve the problem
as this config will not apply to already existing files.
So, if u remove machines from cluster you better check files and system replication factor

Resources