where is core-default.xml file? - hadoop

I'm interested in value of fs.s3a.connection.ssl.enabled parameter in my mapr cluster.
I know the value is set in core-default.xml (if not overwritten by core-site.xml) but I cannot find core-default.xml file. Any suggestions where it can be?
Is there any way to check the current value of parameter?

where is core-default.xml file?
It is in resources of hadoop-common; https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/resources/core-default.xml
So in this case you will be able to find it inside jar file of hadoop-common; the jar can be found in /opt/mapr/hadoop/hadoop-<version>/share/hadoop/common/hadoop-common-<version>.jar
I have extracted the jar and list the files;
[... ~]$ jar xf ./hadoop-common-<version>.jar
[... ~]$ ll
-rw-rw-r-- 1 mapr mapr 1041 Mar 15 18:36 common-version-info.properties
-rw-rw-r-- 1 mapr mapr 64287 Mar 15 18:06 core-default.xml
...
Is there any way to check the current value of parameter?
Yes there is, please run the following command to see the property;
hadoop org.apache.hadoop.conf.Configuration | grep "fs.s3a.connection.ssl.enabled"

Related

HDFS NFS locations using weird numerical username values for directory permissions

Seeing nonsense values for user names in folder permissions for NFS mounted HDFS locations, while the HDFS locations themselves (using Hortonworks HDP 3.1) appear fine. Eg.
➜ ~ ls -lh /nfs_mount_root/user
total 6.5K
drwx------. 3 accumulo hdfs 96 Jul 19 13:53 accumulo
drwxr-xr-x. 3 92668751 hadoop 96 Jul 25 15:17 admin
drwxrwx---. 3 ambari-qa hdfs 96 Jul 19 13:54 ambari-qa
drwxr-xr-x. 3 druid hadoop 96 Jul 19 13:53 druid
drwxr-xr-x. 2 hbase hdfs 64 Jul 19 13:50 hbase
drwx------. 5 hdfs hdfs 160 Aug 26 10:41 hdfs
drwxr-xr-x. 4 hive hdfs 128 Aug 26 10:24 hive
drwxr-xr-x. 5 h_etl hdfs 160 Aug 9 14:54 h_etl
drwxr-xr-x. 3 108146 hdfs 96 Aug 1 15:43 ml1
drwxrwxr-x. 3 oozie hdfs 96 Jul 19 13:56 oozie
drwxr-xr-x. 3 882121447 hdfs 96 Aug 5 10:56 q_etl
drwxrwxr-x. 2 spark hdfs 64 Jul 19 13:57 spark
drwxr-xr-x. 6 zeppelin hdfs 192 Aug 23 15:45 zeppelin
➜ ~ hadoop fs -ls /user
Found 13 items
drwx------ - accumulo hdfs 0 2019-07-19 13:53 /user/accumulo
drwxr-xr-x - admin hadoop 0 2019-07-25 15:17 /user/admin
drwxrwx--- - ambari-qa hdfs 0 2019-07-19 13:54 /user/ambari-qa
drwxr-xr-x - druid hadoop 0 2019-07-19 13:53 /user/druid
drwxr-xr-x - hbase hdfs 0 2019-07-19 13:50 /user/hbase
drwx------ - hdfs hdfs 0 2019-08-26 10:41 /user/hdfs
drwxr-xr-x - hive hdfs 0 2019-08-26 10:24 /user/hive
drwxr-xr-x - h_etl hdfs 0 2019-08-09 14:54 /user/h_etl
drwxr-xr-x - ml1 hdfs 0 2019-08-01 15:43 /user/ml1
drwxrwxr-x - oozie hdfs 0 2019-07-19 13:56 /user/oozie
drwxr-xr-x - q_etl hdfs 0 2019-08-05 10:56 /user/q_etl
drwxrwxr-x - spark hdfs 0 2019-07-19 13:57 /user/spark
drwxr-xr-x - zeppelin hdfs 0 2019-08-23 15:45 /user/zeppelin
Notice the difference for users ml1 and q_etl that they have numerical user values when running ls on the NFS locations, rather then their user names.
Even doing something like...
[hdfs#HW04 ml1]$ hadoop fs -chown ml1 /user/ml1
does not change the NFS permissions. Even more annoying, when trying to change the NFS mount permissions as root, we see
[root#HW04 ml1]# chown ml1 /nfs_mount_root/user/ml1
chown: changing ownership of ‘/nfs_mount_root/user/ml1’: Permission denied
This causes real problems, since the differing uid means that I can't access these dirs even as the "correct" user to write to them. Not sure what to make of this. Anyone with more Hadoop experience have any debugging suggestions or fixes?
UPDATE:
Doing a bit more testing / debugging, found that the rules appear to be...
If the NFS server node has no uid (or gid?) that matches the uid of the user on the node accessing the NFS mount, we get the weird uid values as seen here.
If there is a uid associated to the username of the user on the requesting node, then that is the uid user that we see assigned to the location when accessing via NFS (even if that uid on the NFS server node is not actually for the requesting user), eg.
[root#HW01 ~]# clush -ab id ml1
---------------
HW[01,04] (2)
---------------
uid=1025(ml1) gid=1025(ml1) groups=1025(ml1)
---------------
HW[02-03] (2)
---------------
uid=1027(ml1) gid=1027(ml1) groups=1027(ml1)
---------------
HW05
---------------
uid=1026(ml1) gid=1026(ml1) groups=1026(ml1)
[root#HW01 ~]# exit
logout
Connection to hw01 closed.
➜ ~ ls -lh /hdpnfs/user
total 6.5K
...
drwxr-xr-x. 6 atlas hdfs 192 Aug 27 12:04 ml1
...
➜ ~ hadoop fs -ls /user
Found 13 items
...
drwxr-xr-x - ml1 hdfs 0 2019-08-27 12:04 /user/ml1
...
[root#HW01 ~]# clush -ab id atlas
---------------
HW[01,04] (2)
---------------
uid=1027(atlas) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW[02-03] (2)
---------------
uid=1024(atlas) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1005(atlas) gid=1006(hadoop) groups=1006(hadoop)
If wondering why I have, user on the cluster that have varying uids across the cluster nodes, see the problem posted here: How to properly change uid for HDP / ambari-created user? (note that these odd uid setting for hadoop service users was set up by Ambari by default).
After talking with someone more knowledgeable in HDP hadoop, found that the problem is that when Ambari was setup and run to initially install the hadoop cluster, there may have been other preexisting users on the designated cluster nodes.
Ambari creates its various service users by giving them the next available UID of a nodes available block of user UIDs. However, prior to installing Ambari and HDP on the nodes, I created some users on the to-be namenode (and others) in order to do some initial maintenance checks and tests. I should have just done this as root. Adding these extra users offset the UID counter on those nodes and so as Ambari created users on the nodes and incremented the UIDs, it was starting from different starting counter values. Thus, the UIDs did not sync and caused problems with HDFS NFS.
To fix this, I...
Used Ambari to stop all running HDP services
Go to Service Accounts in Ambari and copy all of the expected service users name strings
For each user, run something like id <service username> to get the group(s) for each user. For service groups (which may have multiple members), can do something like grep 'group-name-here' /etc/group. I recommend doing it this way as the Ambari docs of default users and groups does not have some of the info that you can get here.
Use userdel and groupdel to remove all the Ambari service users and groups
Then recreate all the groups across the cluster
Then recreate all the users across the cluster (may need to specify UID if nodes have other users not on others)
Restart the HDP services (hopefully everything should still run as if nothing happend, since HDP should be looking for the literal string (not the UIDs))
For the last parts, can use something like clustershell, eg.
# remove user
$ clush -ab userdel <service username>
# check that the UID you want to use is actually available on all nodes
$ clush -ab id <some specific UID you want to use>
# assign that UID to a new service user
$ clush -ab useradd --uid <the specific UID> --gid <groupname> <service username>
To get the lowest common available UID from each node, used...
# for UID
getent passwd | awk -F: '($3>1000) && ($3<10000) && ($3>maxuid) { maxuid=$3; } END { print maxuid+1; }'
# for GID
getent passwd | awk -F: '($4>1000) && ($4<10000) && ($4>maxuid) { maxuid=$4; } END { print maxuid+1; }'
Ambari also creates some /home dirs for users. Once you are done recreating the users, will need to change the permissions for the dirs (can also use something like clush there as well).
* Note that this was a huge pain and you would need to manually correct the UIDs of users whenever you added another cluster node. I did this for a test cluster, but for production (or even a larger test) you should just useKerberos or SSSD + Active Directory.

Hadoop Log File Analysis from 2 separate machines

I am a fresher to Hadoop. I have to find the trend of symbols traded among users.
I have 2 machines b040n10 and b040n11. The files in the machine are as mentioned below:
b040n10:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 482342353 Feb 8 2014 A.log
-rw-r--r-- 1 root root 481231231 Feb 8 2014 B.log
b040n11:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 412312312 Feb 8 2014 C.log
-rw-r--r-- 1 root root 412356315 Feb 8 2014 D.log
There is a field called "symbol_name" on all these logs (example below).
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:13:05
IP=145.45.34.2;***symbol_name=XYZ;***timestamp=12:13:56
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:14:56
I have Hadoop running on my Laptop and I have 2 machines connected to my Laptop (can be used as Datanodes).
My task now is to get the list of symbol_name and the Symbol count.
As mentioned below:
ABC - 2
XYZ - 1
Should I now:
1. copy all the files (A.log,B.log,C.log,D.log) from b040n10 and b040n11 to my Laptop,
2. Issue a copyFromLocal command to HDFS system and analyze the data?
or is there a better way to findout the symbol_name and count without copying these files to my laptop?
The question is a basic one, but I am new to Hadoop, please help me to understand and use Hadoop to better. Please let me know if more information on the question is need.
Thanks
Copying the files from Hadoop to your local laptop defies the entire purpose of Hadoop which is to move the processing to the data not the other way. Because when you really have "BigData", you won't be able to move the data around to process it locally.
Your problem is a typical case of Map/Reduce, all what you need is a job that counts the occurrence of each symbol. Just search for Map/Reduce WordCount example and adapt it to your case

Two copies of each file being copied from local to HDFS

I am using fs.copyFromLocalFile(local path, Hdfs dest path) in my program.
I am deleting the destination path on HDFS every time and before copying file from local machine. But after copying files from Local path, and implementing map reduce on it generates two copies of each file, hence the word count doubles.
To be clear, I have "Home/user/desktop/input/" as my local path and HDFS dest path to be "/input".
When I check the HDFS Destination path, i.e folder on which map reduce was applied this is the result
hduser#rallapalli-Lenovo-G580:~$ hdfs dfs -ls /input
14/03/30 08:30:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt~
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt~
When I provide Input as single file Home/user/desktop/input/1.txt creates no problem and only single file is copied. But mentioning the directory creates a problem
But manually placing each file in the HDFS Dest through command line creates no problem.
I am not sure If I am missing a simple logic of file system. But would be great if any one could suggest where I am going wrong.
I am using hadoop 2.2.0.
I have tried deleting the local temporary files and made sure the text files are not open. Looking for a way to avoid copiying the temporary files.
Thanks in advance.
The files /input/1.txt~ /input/2.txt~ are temporary files created by the File editor you are using in your machine.You can use Ctrl + H to see all hidden temporary files in your local directory and delete them.

How to put a file to hdfs with secondary group?

I have a local file
-rw-r--r-- 1 me developers 102445154 Oct 22 10:02 file1.csv
which I'm attempting to put to hdfs:
/usr/bin/hdfs dfs -put ./file1.csv hdfs://000.00.00.00/user/me/
which works fine, but the group is wrong
-rw-r--r-- 3 me me 102445154 2013-10-22 10:23 hdfs://000.00.00.00/user/file1.csv
How do I get the group developers to come with?
Use the chgrp option on the file.

Target already exists error in hadoop put command

I am trying my hands on Hadoop 1.0. I am getting Target does not exists while copying one file from local system into HDFS.
My hadoop command and its output is as follows :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt .
Warning: $HADOOP_HOME is deprecated.
put: Target already exists
After observing the output, we can see that there are two blank spaces between word 'Target' and 'already'. I think there has to be something like /user/${user} between those 2 words. If I give destination path explicitly as /user/shekhar then I get following error :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /user/shekhar/data.txt
Warning: $HADOOP_HOME is deprecated.
put: java.io.FileNotFoundException: Parent path is not a directory: /user/shekhar
Output of ls command is as follows :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -lsr /
Warning: $HADOOP_HOME is deprecated.
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred
drwx------ - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred/system
-rw------- 1 shekhar supergroup 4 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred/system/jobtracker.info
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /user
-rw-r--r-- 1 shekhar supergroup 6541526 2012-02-21 19:56 /user/shekhar
Please help me in copying file into HDFS. If you need any other information then please let me know.
I am trying this in Ubuntu which is installed using WUBI (Windows Installer for ubuntu).
Thanks in Advance !
The problem in the put command is the trailing .. You need to specify the full path on HDFS where you want the file to go, for ex:
hadoop fs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /whatever/20111201.txt
If the directory that you are putting the file in doesn't exist yet, you need to create it first:
hadoop fs -mkdir /whatever
The problem that you are having when you specify the path explicitly is that on your system, /user/shekar is a file, not a directory. You can see that because it has a non-0 size.
-rw-r--r-- 1 shekhar supergroup 6541526 2012-02-21 19:56 /user/shekhar
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /user/shekhar/data.txt
you must make the file first!
hdfs dfs -mkdir /user/hadoop
hdfs dfs -put /home/bigdata/.password /user/hadoop/

Resources