Multiple source files for s3distcp - hadoop

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.
I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.
Workaround that I am currently using is to tell all the file names in srcPattern
hadoop jar s3distcp.jar
--src s3n://bucket/src_folder/
--dest hdfs:///test/output/
--srcPattern '.*somefile.*|.*anotherone.*'
Can this thing work when the number of files is too many? like around 10 000?

hadoop distcp should solve your problem.
we can use distcp to copy data from s3 to hdfs.
And it also supports wildcards and we can provide multiple source paths in the command.
http://hadoop.apache.org/docs/r1.2.1/distcp.html
Go through the usage section in this particular url
Example:
consider you have the following files in s3 bucket(test-bucket) inside test1 folder.
abc.txt
abd.txt
defg.txt
And inside test2 folder you have
hijk.txt
hjikl.txt
xyz.txt
And your hdfs path is hdfs://localhost.localdomain:9000/user/test/
Then distcp command is as follows for a particular pattern.
hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/

Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here

Related

How to gzip compress a directory in hdfs without changing the name of the files

I need to gzip compress a directory which will have many files. As i cant modify the file name of the files inside the directory i cant use mapreduce. Is there any way using java interface we can compress a directory without changing the names of the files inside the directory.

filename of the sqoop import

When we import from RDBMS to HDFS using sqoop we will give target directory to store data, once the job completed we can see the filename as part-m-0000 as mapper output. Is there any way we can pass the filename in which the data will stored? Is sqoop have any option like that?
According to this answer, you can specify arguments passed to mapreduce with -D option, which can accept file name options:
-Dmapreduce.output.basename=myoutputprefix
Although this will change the basename of your file, it will not change the part numbers.
Same answers on other sites:
cloudera
hadoopinrealworld
No you can't rename it.
You can specify --target-dir <dir> to tell the location of directory where all the data is imported,
In this directory, you see many part files (e.g. part-m-00000). These part files are created by various mappers (remember -m <number> in your sqoop import command)
Since data is imported in multiple files, how would you name each part file?
I did not see any additional benefit for this renaming.

Running custom jar file with input parameters on Amazon EMR

So, I am trying to run the WordCount hadoop application on Amazon EMR. I have my own data file which I uploaded to abc bucket. I also added the wordcount.jar file under abc bucket. Can anyone tell me when we create the cluster, how can we give the path to the data file and also do we need to give the output directory path as well and if yes then how can I give the output directory path?
The data file is passed in as a parameter to the Jar, the datafile lives on the S3 bucket. The output is also a s3 bucket, in this case you can use the same bucket, just have a directory /output in the bucket and send all output to there.
https://blog.safaribooksonline.com/2013/05/07/running-hadoop-mapreduce-jobs-on-amazon-emr/
"""Our WordCount JAR file will take the JAR’s main file, followed by the bucket name where you uploaded the input data and the output path. Note, that you only have to provide the paths, and not the precise file names. Also, make sure that no output file exists in the output path. The format for specifying input and output paths is: s3n:///path."""

Merging MapReduce output

I have two MapReduce jobs which produce files in two separate directories which look like so:
Directory output1:
------------------
/output/20140102-r-00000.txt
/output/20140102-r-00000.txt
/output/20140103-r-00000.txt
/output/20140104-r-00000.txt
Directory output2:
------------------
/output-update/20140102-r-00000.txt
I want to merge these two directories together in a new directory /output-complete/ where the 20140102-r-00000.txt replaces the original file in the /output directory and all of the "-r-0000x" is removed from the file name. The two original directories will now be empty and the resulting directory should look as follows:
Directory output3:
-------------------
/output-complete/20140102.txt
/output-complete/20140102.txt
/output-complete/20140103.txt
/output-complete/20140104.txt
What is the best way to do this? Can I use only HDFS shell commands? Do I need to create a java program to traverse both directories and do the logic?
you can use pig ...
get_data = load '/output*/20140102*.txt' using Loader()
store get_data into "/output-complete/20140102.txt"
or HDFS Command...
hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt
single qoutes may not work, then try with double quotes
You can use hdfs command -getMerge for merging hdfs files.

Hadoop read files with following name patterns

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.
You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

Resources