MrJob spends a lot of time Copying local files into hdfs - hadoop

The problem I'm encountering is this:
Having already put my input.txt (50MBytes) file into HDFS, I'm running
python ./test.py hdfs:///user/myself/input.txt -r hadoop --hadoop-bin /usr/bin/hadoop
It seems that MrJob spends a lot of time copying files to hdfs (again?)
Copying local files into hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/
Is this logical? Shouldn't it use input.txt directly from HDFS?
(Using Hadoop version 2.6.0)

Look at the contents of hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/ and you will see that input.txt isn't the file that's being copied into HDFS.
What's being copied is mrjob's entire python directory, so that it can be unpacked on each of your nodes. (mrjob assumes that mrjob is not installed on each of the nodes in your cluster.)

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

copying local files to hdfs is very very slow

I am trying to copy local files to the hdfs using following command
$ hdfs dfs -put [local_path] [dfs_uri]
the commands works and copy the files, but 2 KB/sec or even slower.
any help appreciated :)
NOTE: Cloudera Express 5.5.1 (#8 built by jenkins on 20151201-1818 git: 2a7dfe22d921bef89c7ee3c2981cb4c1dc43de7b)

How to deploy & run Samza job on HDFS?

I want to get a Samza job running on a remote system with the Samza job being stored on HDFS. The example (https://samza.apache.org/startup/hello-samza/0.7.0/) for running a Samza job on a coal machine involves building a tar file, then unzipping the tar file, then running a shell script that's located within the tar file.
The example here for HDFS is not really well-documented at all (https://samza.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html). It says to copy the tar file to HDFS, then to follow the other steps in the non-HDFS example.
That would imply that the tar file that now resides on HDFS needs to be untarred within HDFS, then a shell script to be run on that unzipped tar file. But you can't untar a HDFS tar file with the hadoop fs shell...
Without untarring the tar file, you don't have access to run-job.sh to initiate the Samza job.
Has anyone managed to get this to work please?
We deploy our Samza jobs this way: we have hadoop libraries in /opt/hadoop, we have Samza sh scripts in /opt/samza/bin and we have Samza config file in /opt/samza/config. In this config file there is this line:
yarn.package.path=hdfs://hadoop1:8020/deploy/samza/samzajobs-dist.tgz
When we wanna deploy new version of our Samza job we just create the tgz archive, we move it (without untaring) to HDFS to /deploy/samza/ and we run /opt/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file:///opt/samza/config/$CONFIG_NAME.properties
The only downside is that we ignore config files in the archive. If you change the config in the archive it does not take an effect. You have to change the config files in /opt/samza/config. On the other side we are able to change config of our Samza job without deploying the new tgz archive. The shell scripts under /opt/samza/bin remains the same every build thus you don't need to untar the archive package because of the shell scripts.
Good luck with Samzing! :-)

hadoop file system change directory command

I was going through the HADOOP fs commands list. I am little perplexed not to find any "cd" command in hadoop fs.
Why is it so? It might sound silly question for the HADOOP users, but as I am beginner I can not understand why there is no list of cd command in HADOOP fs level?
Think about it like this:
Hadoop has a special file system called "hdfs" which runs on top of existing say linux file system. There is no concept of current or present working directory a.k.a. pwd
Let's say we have following structure in hdfs:
d1/
d2/
f1
d3/
f2
d4/
f3
You could do cd in your Linux file system from moving from one to the other but do you think changing directory in hadoop would makes sense? HDFS is like virtual file system and you dont directly interact with hdfs except via hadoop command or job tracker.
HDFS provides various features that enable accessing HDFS(Hadoop Filesystem) easy on local machines or edge nodes. You have an option to mount HDFS using any of the following methods. Once Hadoop file system is mounted on your machine, you may use cd command to browse through the file system (It's is like mounting remote network filesystem like NAS)
Fuse dfs (Available from Hadoop 0.20 onwards )
NFSv3 Gateway access to HDFS data (Available from Hadoop version
Hadoop 2.2.0)

Issues when using hadoop to copy files from grid to local

I am trying to copy some files from the hadoop HDFS to local. I used the following command
hadoop fs -copyToLocal <hdfs path> <local path>
The size of the file is just 80M. I had run a job before where I had no issue in copying files of size 70MB to local. However, this time I am having Input/Output error
copyToLocal: Input/output error
can anyone tell me what could have gone wrong?
It might be a space constraint on your machine. I had the same issue because the file was too big for it to be moved to my local machine. Once I made space, I was able to perform the copyToLocal operation.

Resources