Is it possible to run hadoop fs -getmerge in S3? - hadoop

I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file.
Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm running hadoop fs -getmerge and hadoop fs copyToLocal to obtain the file.
is there anyway to use hadoop fs directly on S3?

Actually, this response about getmerge is incorrect. getmerge expects a local destination and will not work with S3. It throws an IOException if you try and responds with -getmerge: Wrong FS:.
Usage:
hadoop fs [generic options] -getmerge [-nl] <src> <localdst>

An easy way (if you are generating a small file that fits on the master machine) is to do the following:
Merge the file parts into a single file onto the local machine (Documentation)
hadoop fs -getmerge hdfs://[FILE] [LOCAL FILE]
Copy the result file to S3, and then delete the local file (Documentation)
hadoop dfs -moveFromLocal [LOCAL FILE] s3n://bucket/key/of/file

I haven't personally tried the getmerge command myself but hadoop fs commands on EMR cluster nodes support S3 paths just like HDFS paths. For example, you can SSH into the master node of your cluster and run:
hadoop fs -ls s3://<my_bucket>/<my_dir>/
The above command will list of out all the S3 objects under the specified directory path.
I would expect hadoop fs -getmerge to work the same way. So, just use full S3 paths (starting with s3://) instead of HDFS paths.

Related

Hadoop distcp with file list

I would like to use distcp to copy a list of files (> 1K files) into hdfs. I have already stored list of files in local directory, now can I use -f to copy all files? if yes what is the format do I have to maintain in my files list file? or is there any other better way?
You don't have to use distcp if your use-case is copying data from local filesystem (say Linux) to HDFS. You can simply use hdfs dfs -put command for the same. Here is the syntax.
hdfs dfs -put /path/to/local/dir/* /path/on/hdfs/
e.g.
hdfs dfs -mkdir /user/hduser/destination-dir/
hdfs dfs -put /home/abc/mydir/* /user/hduser/destination-dir/
You have created a file containing list of file paths but that is not at all needed. It's mainly used (for distcp) when you are copying data from one cluster to other cluster

how do you perform hadoop fs -getmerge on dataproc from google storage

How do you use getmerge on dataproc for part files which are dumped to the google storage bucket.
If I try this hadoop fs -getmerge gs://my-bucket/temp/part-* gs://my-bucket/temp_merged
I get an error
getmerge: /temp_merged (Permission denied)
It works fine for hadoop fs -getmerge gs://my-bucket/temp/part-* temp_merged but that of course writes the merged file on the cluster machine and not in GS.
According to the fsshell documentation, the getmerge command fundamentally treats the destination path as a "local" path (so in gs://my-bucket/temp_merged it's ignoring the "scheme" and "authority" components, trying to write directly to your local filesystem path /temp_meged; this is not specific to the GCS connector; you'll see the same thing if you try hadoop fs -getmerge gs://my-bucket/temp/part-* hdfs:///temp_merged, and even worse, if you try something like hadoop fs -getmerge gs://my-bucket/temp/part-* hdfs:///tmp/temp_merged, you may think it succeeded when in fact the file did not appear inside hdfs:///tmp/temp_merged, but instead appeared under your local filesystem, file:///tmp/temp_merged.
You can instead make use of piping stdout/stdin to make it happen; unfortunately -getmerge doesn't play well with /dev/stdout due to permissions and usage of .crc files, but you can achieve the same effect using the feature in hadoop fs -put which supports reading from stdin:
hadoop fs -cat gs://my-bucket/temp/part-* | \
hadoop fs -put - gs://my-bucket/temp_merged

Loading files into hadoop

I have a directory structure with data on a local filesystem. I need to replicate it to Hadoop cluster.
For now I found three ways to do it:
using "hdfs dfs -put" command
using hdfs nfs gateway
mounting my local dir via nfs on each datanode and using distcp
Am I missing any other tools? Which one of these would be the fastest way to make a copy?
I think hdfs dfs -put or hdfs dfs -copyFromLocal would be the simplest way of doing it.
If you have a lot data (many files), you can copy them programmatically.
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/home/me/localdirectory/"), new Path("/me/hadoop/hdfsdir"));

Using multiple local folders as source in hadoop mapreduce job

I have data in multiple local folders i.e. /usr/bigboss/data1, /usr/bigboss/data2 and many more folders. I want to use all of these folders as input source for my MapReduce command and store the result at HDFS. I can not find a working command to use Hadoop Grep example to do it.
The data will need to reside in HDFS for you to process it with the grep example. You can upload the folders to HDFS using the -put FsShell command:
hadoop fs -mkdir bigboss
hadoop fs -put /usr/bigboss/data* bigboss
Which will create a folder in the current user HDFS directory, and upload each of the data directories to it
Now you should be able to run the grep example over the data

hadoop dfs -ls complains

Can anyone let me know what seems to be wrong here ? hadoop dfs command seems to be OK but any following options are not recognized.
[hadoop-0.20]$bin/hadoop dfs -ls ~/wordcount/input/
ls: Cannot access /home/cloudera/wordcount/input/ : No such file or directory
hadoop fs -ls /some/path/here - will list a HDFS location, not your local linux location
try first this command
hadoop fs -ls /
then investigate step by step other folders.
if you want to copy some files from local directory to users directory on HDFS location, then just use this:
hadoop fs -mkdir /users
hadoop fs -put /some/local/file /users
for more hdfs commands see this: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
FS relates to a generic file system which can point to any file systems like local, HDFS, s3 etc But dfs is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination. But specifying DFS operation relates to HDFS.

Resources