<ask> How to Backup and Restore HDFS - hadoop

Actually i have develop application which use Hdfs to store image.Now i want to migrate server and setup hadoop again in new server.How i can backup my image file in HDFS (old sever) to HDFS in my new server ?
I've try to use CopyToLocal command to backup and CopyFromLocal to restore, but i've error, when application running, image which i've restore on hdfs can't show on my application.
How to solve this ?
Thanks

Distcp is the command to use when performing data for large inter/intra-cluster copying. Here is the documentation for the same.
CopyToLocal and CopyFromLocal should also work well for small amounts of data. Run the HDFS CLI and make sure that the files are there. Then it might be a problem with the application.

Related

How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use

How to download Hadoop files (on HDFS) via FTP?

I would like to implement an SSIS job that is able to download large CSV files that are located on a remote Hadoop cluster. Of course, having just a regular FTP server on Hadoop system does not expose HDFS files since it uses the local filesystem.
I would like to know whether there is an FTP server implementation that sits on top of HDFS. I would prefer this approach rather than having to copy files from HDFS to the local FS and then having the FTP server serving this because I will need to allocate more storage space.
I forked from an open-source project that works as expected: https://github.com/jamesattard/maroodi

how to load text files into hdfs through oozie workflow in a cluster

I am trying to load text/csv files in hive scripts with oozie and schedule it on daily basis. Text files are in local unix file system.
I need to put those text files into hdfs before executing the hive scripts in a oozie workflow.
In a real time cluster we don't know job will run on which node.it will run randomly in any one of the node in cluster.
can any one provide me the solution.
Thanks in advance.
Not sure I understand what you want to do.
The way I see it, it can't work:
Oozie server has access to HDFS files only (same as Hive)
your data is on a local filesystem somewhere
So why don't you load your files into HDFS beforehand? The transfer may be triggered either when the files are available (post-processing action in the upstream job) or at fixed time (using Linux CRON).
You don't even need the Hadoop libraries on the Linux box if the WebHDFS service is active on your NameNode - just use CURL and a HTTP upload.

How can I use Oozie to copy remote files into HDFS?

I have to copy remote files into HDFS. I want to use Oozie because I need to run this job everyday at a specific time.
Oozie can help you create a workflow. Using oozie you can invoke an external action capable of copying files from your source to HDFS, but oozie will not do it automatically.
Here are a few suggestions:
Use a custom program to write files to hdfs, for example using a SequenceFile.Writer.
Flume might help.
Use an integration component like camel-hdfs to move files to hdfs.
ftp files to hdfs node and then copy from local disk to hdfs.
Investigate more options that might be a good fit for your case.

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

Resources