Spring XD stream creates only empty .tmp files - hadoop

I'm trying to get Spring-XD working with Hortonworks Sandbox VM.
Everything went smooth until first, test stream:
xd:>stream create --name ticktockhdfs --definition "Time | HDFS"
xd:>stream destroy --name ticktockhdfs
xd:>hadoop fs ls /xd/ticktockhdfs
-rw-r--r-- 3 user hdfs 0 2014-04-03 22:05 /xd/ticktockhdfs/ticktockhdfs-0.txt.tmp
-rw-r--r-- 3 user hdfs 0 2014-04-03 22:07 /xd/ticktockhdfs/ticktockhdfs-1.txt.tmp
-rw-r--r-- 3 user hdfs 0 2014-04-03 22:38 /xd/ticktockhdfs/ticktockhdfs-2.txt.tmp
-rw-r--r-- 3 user hdfs 0 2014-04-03 22:49 /xd/ticktockhdfs/ticktockhdfs-3.txt.tmp
Files remains with .tmp extension and they are empty.
On XD Admin console I can see error:
could only be replicated to 0 nodes instead of 1
What can be wrong?

Problem was in VirtualBox network configuration. I switched from NAT to host only and it's started to working.
This video can be helpful: https://www.youtube.com/watch?v=xG3nQAfkEyM&feature=youtu.be

Related

HDFS NFS locations using weird numerical username values for directory permissions

Seeing nonsense values for user names in folder permissions for NFS mounted HDFS locations, while the HDFS locations themselves (using Hortonworks HDP 3.1) appear fine. Eg.
➜ ~ ls -lh /nfs_mount_root/user
total 6.5K
drwx------. 3 accumulo hdfs 96 Jul 19 13:53 accumulo
drwxr-xr-x. 3 92668751 hadoop 96 Jul 25 15:17 admin
drwxrwx---. 3 ambari-qa hdfs 96 Jul 19 13:54 ambari-qa
drwxr-xr-x. 3 druid hadoop 96 Jul 19 13:53 druid
drwxr-xr-x. 2 hbase hdfs 64 Jul 19 13:50 hbase
drwx------. 5 hdfs hdfs 160 Aug 26 10:41 hdfs
drwxr-xr-x. 4 hive hdfs 128 Aug 26 10:24 hive
drwxr-xr-x. 5 h_etl hdfs 160 Aug 9 14:54 h_etl
drwxr-xr-x. 3 108146 hdfs 96 Aug 1 15:43 ml1
drwxrwxr-x. 3 oozie hdfs 96 Jul 19 13:56 oozie
drwxr-xr-x. 3 882121447 hdfs 96 Aug 5 10:56 q_etl
drwxrwxr-x. 2 spark hdfs 64 Jul 19 13:57 spark
drwxr-xr-x. 6 zeppelin hdfs 192 Aug 23 15:45 zeppelin
➜ ~ hadoop fs -ls /user
Found 13 items
drwx------ - accumulo hdfs 0 2019-07-19 13:53 /user/accumulo
drwxr-xr-x - admin hadoop 0 2019-07-25 15:17 /user/admin
drwxrwx--- - ambari-qa hdfs 0 2019-07-19 13:54 /user/ambari-qa
drwxr-xr-x - druid hadoop 0 2019-07-19 13:53 /user/druid
drwxr-xr-x - hbase hdfs 0 2019-07-19 13:50 /user/hbase
drwx------ - hdfs hdfs 0 2019-08-26 10:41 /user/hdfs
drwxr-xr-x - hive hdfs 0 2019-08-26 10:24 /user/hive
drwxr-xr-x - h_etl hdfs 0 2019-08-09 14:54 /user/h_etl
drwxr-xr-x - ml1 hdfs 0 2019-08-01 15:43 /user/ml1
drwxrwxr-x - oozie hdfs 0 2019-07-19 13:56 /user/oozie
drwxr-xr-x - q_etl hdfs 0 2019-08-05 10:56 /user/q_etl
drwxrwxr-x - spark hdfs 0 2019-07-19 13:57 /user/spark
drwxr-xr-x - zeppelin hdfs 0 2019-08-23 15:45 /user/zeppelin
Notice the difference for users ml1 and q_etl that they have numerical user values when running ls on the NFS locations, rather then their user names.
Even doing something like...
[hdfs#HW04 ml1]$ hadoop fs -chown ml1 /user/ml1
does not change the NFS permissions. Even more annoying, when trying to change the NFS mount permissions as root, we see
[root#HW04 ml1]# chown ml1 /nfs_mount_root/user/ml1
chown: changing ownership of ‘/nfs_mount_root/user/ml1’: Permission denied
This causes real problems, since the differing uid means that I can't access these dirs even as the "correct" user to write to them. Not sure what to make of this. Anyone with more Hadoop experience have any debugging suggestions or fixes?
UPDATE:
Doing a bit more testing / debugging, found that the rules appear to be...
If the NFS server node has no uid (or gid?) that matches the uid of the user on the node accessing the NFS mount, we get the weird uid values as seen here.
If there is a uid associated to the username of the user on the requesting node, then that is the uid user that we see assigned to the location when accessing via NFS (even if that uid on the NFS server node is not actually for the requesting user), eg.
[root#HW01 ~]# clush -ab id ml1
---------------
HW[01,04] (2)
---------------
uid=1025(ml1) gid=1025(ml1) groups=1025(ml1)
---------------
HW[02-03] (2)
---------------
uid=1027(ml1) gid=1027(ml1) groups=1027(ml1)
---------------
HW05
---------------
uid=1026(ml1) gid=1026(ml1) groups=1026(ml1)
[root#HW01 ~]# exit
logout
Connection to hw01 closed.
➜ ~ ls -lh /hdpnfs/user
total 6.5K
...
drwxr-xr-x. 6 atlas hdfs 192 Aug 27 12:04 ml1
...
➜ ~ hadoop fs -ls /user
Found 13 items
...
drwxr-xr-x - ml1 hdfs 0 2019-08-27 12:04 /user/ml1
...
[root#HW01 ~]# clush -ab id atlas
---------------
HW[01,04] (2)
---------------
uid=1027(atlas) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW[02-03] (2)
---------------
uid=1024(atlas) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1005(atlas) gid=1006(hadoop) groups=1006(hadoop)
If wondering why I have, user on the cluster that have varying uids across the cluster nodes, see the problem posted here: How to properly change uid for HDP / ambari-created user? (note that these odd uid setting for hadoop service users was set up by Ambari by default).
After talking with someone more knowledgeable in HDP hadoop, found that the problem is that when Ambari was setup and run to initially install the hadoop cluster, there may have been other preexisting users on the designated cluster nodes.
Ambari creates its various service users by giving them the next available UID of a nodes available block of user UIDs. However, prior to installing Ambari and HDP on the nodes, I created some users on the to-be namenode (and others) in order to do some initial maintenance checks and tests. I should have just done this as root. Adding these extra users offset the UID counter on those nodes and so as Ambari created users on the nodes and incremented the UIDs, it was starting from different starting counter values. Thus, the UIDs did not sync and caused problems with HDFS NFS.
To fix this, I...
Used Ambari to stop all running HDP services
Go to Service Accounts in Ambari and copy all of the expected service users name strings
For each user, run something like id <service username> to get the group(s) for each user. For service groups (which may have multiple members), can do something like grep 'group-name-here' /etc/group. I recommend doing it this way as the Ambari docs of default users and groups does not have some of the info that you can get here.
Use userdel and groupdel to remove all the Ambari service users and groups
Then recreate all the groups across the cluster
Then recreate all the users across the cluster (may need to specify UID if nodes have other users not on others)
Restart the HDP services (hopefully everything should still run as if nothing happend, since HDP should be looking for the literal string (not the UIDs))
For the last parts, can use something like clustershell, eg.
# remove user
$ clush -ab userdel <service username>
# check that the UID you want to use is actually available on all nodes
$ clush -ab id <some specific UID you want to use>
# assign that UID to a new service user
$ clush -ab useradd --uid <the specific UID> --gid <groupname> <service username>
To get the lowest common available UID from each node, used...
# for UID
getent passwd | awk -F: '($3>1000) && ($3<10000) && ($3>maxuid) { maxuid=$3; } END { print maxuid+1; }'
# for GID
getent passwd | awk -F: '($4>1000) && ($4<10000) && ($4>maxuid) { maxuid=$4; } END { print maxuid+1; }'
Ambari also creates some /home dirs for users. Once you are done recreating the users, will need to change the permissions for the dirs (can also use something like clush there as well).
* Note that this was a huge pain and you would need to manually correct the UIDs of users whenever you added another cluster node. I did this for a test cluster, but for production (or even a larger test) you should just useKerberos or SSSD + Active Directory.

Hadoop archive file cannot be looked up using hadoop fs -ls har://hdfs-master/tank/zoo.har/

here is my files on hdfs:
hadoop fs -ls /
Found 5 items
-rw-r--r-- 3 hadoop supergroup 25 2016-04-18 11:29 /abc.txt
drwxr-xr-x - hadoop supergroup 0 2016-04-17 11:39 /hbase
drwxr-xr-x - hadoop supergroup 0 2016-04-18 11:49 /tank
drwx------ - hadoop supergroup 0 2016-04-18 11:30 /tmp
-rw-r--r-- 3 hadoop supergroup 66 2016-04-18 11:29 /user.txt
hadoop fs -ls /tank/
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2016-04-18 11:49 /tank/zoo.har
while l am typing
hadoop fs -ls har://hdfs-master/zoo.har/
Blockquote
that got response:
ls: Invalid path for the Har Filesystem. No index file in
har://hdfs-master/zoo.har
please help me out! Thanks!
I guess there are two format to access these files or directories:
First one as following:
hadoop fs -lsr har:///tank/zoo.har/
The other:
hadoop fs -lsr har://hdfs-master/tank/zoo.har/
By the way, are you sure your host is master and the HDFS daemon is listening on default port? Cause second format means har://hdfs-host:port/path/to/somewhere.
I forgot to add my parent path to the har url,it should be har:///parent-path/har-path!

Too many small files HDFS Sink Flume

agent.sinks=hpd
agent.sinks.hpd.type=hdfs
agent.sinks.hpd.channel=memoryChannel
agent.sinks.hpd.hdfs.path=hdfs://master:9000/user/hduser/gde
agent.sinks.hpd.hdfs.fileType=DataStream
agent.sinks.hpd.hdfs.writeFormat=Text
agent.sinks.hpd.hdfs.rollSize=0
agent.sinks.hpd.hdfs.batchSize=1000
agent.sinks.hpd.hdfs.fileSuffix=.i
agent.sinks.hpd.hdfs.rollCount=1000
agent.sinks.hpd.hdfs.rollInterval=0
I'm trying to use HDFS Sink to write events to HDFS. And have tried Size, Count and Time bases rolling but none is working as expected. It is generating too many small files in HDFS like:
-rw-r--r-- 2 hduser supergroup 11617 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832879.i
-rw-r--r-- 2 hduser supergroup 1381 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832880.i
-rw-r--r-- 2 hduser supergroup 553 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832881.i
-rw-r--r-- 2 hduser supergroup 2212 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832882.i
-rw-r--r-- 2 hduser supergroup 1379 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832883.i
-rw-r--r-- 2 hduser supergroup 2762 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832884.i.tmp
Please assist to resolve the given problem. I'm using flume 1.6.0
~Thanks
My provided configurations were all correct. The reason behind such behavior was HDFS. I had 2 data nodes out of which one was down. So, files were not achieving minimum required replication. In Flume logs one can see below warning message too:
"Block Under-replication detected. Rotating file."
To remove this problem one can opt for any of below solution:-
Up the data node to achieve required replication of blocks, or
Set property hdfs.minBlockReplicas accordingly.
~Thanks
You are now rolling the files for every 1000 items. You can try either of two methods mentioned below.
Try increasing hdfs.rollCount to much higher value, this value decides number of events contained in each rolled file.
Remove hdfs.rollCount and set hdfs.rollInterval to interval at which you want to roll your file. Say hdfs.rollInterval = 600 to roll file every 10 minutes.
For more information refer Flume Documentation

How to put a file to hdfs with secondary group?

I have a local file
-rw-r--r-- 1 me developers 102445154 Oct 22 10:02 file1.csv
which I'm attempting to put to hdfs:
/usr/bin/hdfs dfs -put ./file1.csv hdfs://000.00.00.00/user/me/
which works fine, but the group is wrong
-rw-r--r-- 3 me me 102445154 2013-10-22 10:23 hdfs://000.00.00.00/user/file1.csv
How do I get the group developers to come with?
Use the chgrp option on the file.

How do I configure Hadoop such that each datanode uses a different directory?

How do I configure Hadoop such that each datanode uses a different directory for storage?
Every datanode shares a storage space. I'd like datanode1 to use dir1, datanode2 to use dir2. At first, I configured all the datanodes to use a same directory in the shared storage and it turned out that there's only one datanode running.
You'll need to have a custom hdfs-site.xml file for each node in your cluster, with the data directory property (dfs.data.dir) configured appropriately. If you're currently using a shared directory for the hadoop configuration as well then you'll need to amend how you're doing this as well.
Somewhat painful, i guess you could try and use some shell scripting to generate the files, or a tool such as Puppet or Chef.
A question back at you - why are you using NFS, you're somewhat defeating the point of data locality - Hadoop is designed to move your code to where the data is, not (as your case) both the code and the data.
If you're using NFS because it's backed by some SAN array with data redundancy then again you're making things difficult for yourself, HDFS will (if configured) manage data replication for you, assuming you have a big enough cluster and it's properly configured. It should in theory also cost less using commodity hardware than backing with an expensive SAN (depends on your setup / situation i guess)
I don't know if its a crude way of doing but this is how I customized slaves.sh file in the namenode to achieve implementation of different directory structure for each datanode:
Edit the ssh remote command executed on each datanode in $HADOOP_HOME/bin/slaves.sh :
for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do
# If the slave node is ap1001 (first datanode),
# Then use a different directory path for SSH command.
if [ $slave == "ap1001" ]
then
input=`/bin/echo $"${#// /\\ }"` >/dev/null 2>&1
# If the command type is start-dfs (start the datanodes)
# Then construct the start command for remote execution on datanode through ssh
/bin/echo $input | grep -i start
if [ $? -eq 0 ]
then
inputArg="cd /app2/configdata/hdp/hadoop-1.2.1 ; /app2/configdata/hdp/hadoop-1.2.1/bin/hadoop-daemon.sh --config /app2/configdata/hdp/hadoop-1.2.1/libexec/../conf start datanode"
else
# If the command type is stop-dfs (stop the datanodes)
# Then construct the stop command for remote execution on datanode through ssh
inputArg="cd /app2/configdata/hdp/hadoop-1.2.1 ; /app2/configdata/hdp/hadoop-1.2.1/bin/hadoop-daemon.sh --config /app2/configdata/hdp/hadoop-1.2.1/libexec/../conf stop datanode"
fi
ssh $HADOOP_SSH_OPTS $slave $inputArg 2>&1 &
else
# Use default command for remaining slaves.
ssh $HADOOP_SSH_OPTS $slave $"${#// /\\ }" \
2>&1 | sed "s/^/$slave: /" &
fi
if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then
sleep $HADOOP_SLAVE_SLEEP
fi
done
you can have datanodes and namenodes to share a common storage by creating soft-links like below:
host1:
lrwxrwxrwx 1 user user 39 Dec 2 17:30 /hadoop/hdfs/datanode -> /shared_storage/datanode1/
lrwxrwxrwx 1 user user 39 Dec 2 17:31 /hadoop/hdfs/namenode -> /shared_storage/namenode1/
host2:
lrwxrwxrwx 1 user user 39 Dec 2 17:32 /hadoop/hdfs/datanode -> /shared_storage/datanode2/
lrwxrwxrwx 1 user user 39 Dec 2 17:32 /hadoop/hdfs/namenode -> /shared_storage/namenode2/
host3
lrwxrwxrwx 1 user user 39 Dec 2 17:33 /hadoop/hdfs/datanode -> /shared_storage/datanode3/
lrwxrwxrwx 1 user user 39 Dec 2 17:32 /hadoop/hdfs/namenode -> /shared_storage/namenode3/
host4:
lrwxrwxrwx 1 user user 39 Dec 2 17:33 /hadoop/hdfs/datanode -> /shared_storage/datanode4/
lrwxrwxrwx 1 user user 39 Dec 2 17:33 /hadoop/hdfs/namenode -> /shared_storage/namenode4/
In hdfs-site.xml on each datanode:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///hadoop/hdfs/datanode</value>
</property>

Resources