I learned that if you want to copy multiple files from one hadoop folder to another hadoop folder you can better create one big 'hdfs dfs -cp' statement with lots of components, instead of creating several hdfs dfs -cp statements.
With 'better' I mean that it will improve the overal time it takes to copy files: one command is quicker than several separate -cp commands run after each other.
When I do this and my target directory is the same for all files that I want to copy I get a warning.
I'm executing the following command:
hdfs dfs -cp -f /path1/file1 /pathx/target /path2/file2 /pathx/target /path3/file3 /pathx/target
After executing it I get the following warning returned:
cp: `/pathx/target' to `/pathx/target/target': is a subdirectory of itself
Although I get this weird warning the copy itself succeeds like it should.
Is this a bug or am I missing something?
Try to use the following syntax:
hadoop fs -cp /path1/file1 /path2/file2 path3/file3 /pathx/target
Or You could do it like this:
hadoop fs -cp /path1/{file1, file2, file3} /pathx/target
If you want to copy all the files then:
hadoop fs -cp /path1/* /pathx/target
Related
I'm sorry if this is a rather simple question, but I haven't found anything exactly online and just needed a quick answer.
I am trying to copy files from one HDFS directory to a new directory to make a backup. I was given something like this:
hadoop fs -mkdir one/two/three/dir1_bkp
hadoop fs -cp one/two/three/dir1/* one/two/three/dir1_bkp
This should only copy all of the files in dir1 to dir1_bkp and not affect anything in dir1, correct?
Copying doesn't affect the source location, no.
Depending on the size of the data, distcp might be a better option
I was running hadoop distcp to copy a whole directory (500GB+) from /path/to/source to /path/to/destination. However, instead of running
$ hadoop distcp /path/to/source /path/to/destination
I did the following in mistake
$ hadoop distcp /path/to/source path/to/destination
The operation completed like a normal distcp copy, with mapreduce taking some time to run, and of course I did not get my data in /path/to/destination. It was also not in /path/to/source/path/to/destination, or other relative paths I could think of.
Where did the data go? Thanks.
It doesn't go anywhere if the destination path is not correct it stays in the source location
I need to move files written by a Hive job that look like this
/foo/0000_0
/foo/0000_1
/bar/0000_0
into a file structure that looks like this
/foo/prefix1/prefix2-0000_0
/foo/prefix1/prefix2-0000_1
/bar/prefix1/prefix2-0000_0
before migrating this out of the cluster (using s3distcp). I've been looking around hadoop fs but I can't find something that would let me do this. I don't want to rename file by file.
first, you need to create the sub directory inside /foo. For this use following command
$hdfs dfs -mkdir /foo/prefix1
this will create a sub directory in /foo. if you want to create more subdirectory inside prefix1 use this same command recursively with updated path structure.In case you are using an older version of Hadoop (1.x) replace hdfs by hadoop.
now you can move files from /foo to /foo/prefix1 using the following command.Here newfilename can be any name you want to give to your file.
$hdfs dfs -mv /foo/filename /foo/prefix1/newfilename
Hope this answer your query
I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!
I'm trying to execute the code below : http://blog.xebia.com/2011/09/27/wiki-pagerank-with-hadoop/
The javac shows no errors, but I don't know how to get the output? These are the execution steps I'm following:
$ javac -Xlint -classpath /home/james/Downloads/hadoop-0.20.203.0/hadoop-core-0.20.203.0.jar -d doc WikiPageRanking.java
$ jar -cvf WikiPageRanking.jar -C doc/ .
$ bin/hadoop dfs -mkdir /user/james/wiki/in
$ bin/hadoop dfs -copyFromLocal wiki-micro.txt /user/james/wiki/in
$ bin/hadoop jar WikiPageRanking.jar org.myorg.WikiPageRanking /user/james/wiki/in /user/james/wiki/result
Is this right? I seriously doubt the last step - the input and output paths!! In the code, they have used wiki/in, that's why I gave the same path here, and I have copied my sample dataset to this path. The map reduce process starts, but I get no output!!
What does the following commands give you:
hadoop fs -ls /user/james/wiki/result
hadoop fs -text /user/james/wiki/result/part*
Running a job does not automatically dump the results of the job to the console - they are most typically stored in HDFS (in your case in the path /user/james/wiki/result). You can view the contents of this directory using the first command, and assuming there are some part* files, the second command will dump their contents to disk
Final point to note - if the output format is SequenceFileOutputFormat and you're using custom key / value objects, you'll need to amend the second command to include your jar:
hadoop fs -libjars WikiPageRanking.jar -text /user/james/wiki/result/part*