Copy a file using WebHDFS

Copy a file using WebHDFS - hadoop

Is there a way to copy a file from (let's say) hdfs://old to hdfs://new without first downloading the file and then uploading it again?

Don't know about WebHDFS, but this is achievable using hadoop distcp.
The command looks something like this:
hadoop distcp hdfs://old_nn:8020/old/location/path.file hdfs://new_nn:8020/new/location/path.file

Related

How can I transfer a zip file directly from a URL to HDFS by using Java?

How can I transfer a zip file from a URL to HDFS by using Java? I am not supposed to download it. I can only transfer the file directly from URL to HDFS. Anyone has some idea would work out?

You can use a simple ssh code like:
wget http://domain/file.zip
and then
hadoop fs -put /path/file.zip
In java, you should download the file and then put it in hdfs

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.

Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

How to access .pem file from amazon emr

Im new to amazon EMR and i want to use .pem file in EMR.
.pem file is in my local folder. When i create same file with pem file contents in EMR instance its not working.
it would be really help if anyone can provide steps to copy file to EMR from local machine or access file from S3.
thhanks in advance.

create a bootstrap script to copy .pem file to EMR boxes
use below command in bootstrap script to download file to any loction of EMR ( I am downloading file to /mnt/
#!/bin/bash
hadoop fs -copyToLocal s3:n://mybucket/myfolder/my.pem /mnt/my.pem

If you are trying to copy the keypair.pem into the master node then ssh core nodes. Then, you can copy the file from your computer using winscp or by running the command
sftp -i key.pem hadoop#master
put keypair.pem

If I understand your question, you want to transfer the .pem file to one of the EMR instances. So, you can use the winscp application to do such operation. If you are not familiar with such application, have a look on this tutorial here. Let me know if you need any help..

How to copy from subdirectories using s3DistCp

Trying to use s3DistCp to copy from s3://my-bucket/dir1/ , s3://my-bucket/dir2, s3://my-bucket/dir3.
And all three DIRs has some files in them. Wanted to do something like:
hadoop jar s3distcp.jar --src s3://my-bucket/*/ --dest s3://my-bucket/some-other-dir/
But it generate an error saying that:
's3://my-bucket/*/' directory not found...
So does it mean s3DistCp doesn't take wildcards in paths? Is there any work around or any ideas?

Have you tried with s3n? Difference between Amazon S3 and S3n in Hadoop

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs

This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:

Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS

The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).

The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_

Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem

I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..

U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Copy a file using WebHDFS - hadoop

Is there a way to copy a file from (let's say) hdfs://old to hdfs://new without first downloading the file and then uploading it again?

Don't know about WebHDFS, but this is achievable using hadoop distcp. The command looks something like this: hadoop distcp hdfs://old_nn:8020/old/location/path.file hdfs://new_nn:8020/new/location/path.file

Related

How can I transfer a zip file directly from a URL to HDFS by using Java?

Running oozie job using a modified hadoop config file to support S3 to HDFS

How to access .pem file from amazon emr

How to copy from subdirectories using s3DistCp

Run a Local file system directory as input of a Mapper in cluster

Categories

Resources