How to get data from temp files of hadoop? - hadoop

I have an application to transfer data from remote systems to HDFS using map reduce . I however am lost when I have to deal with isues like network failure .. That is , when a connection from remote data source is lost and data is no longer accessible to my mapreduce application. I can always restart the job but when data is huge then restarting is an expensive option . I know the mapreduce would create temp folder but will it put data there ? Can I read that data out and then Can I somehow start reading the rest of the data ?

A mapreduce job can write arbitrary files, not only the ones managed by Hadoop.
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
out = fs.create(new Path(fileName));
using this code you create arbitrary files which work like normal files in the local filesystem. Then, you manage connection exceptions such that when a source is unaccessible you nicely close the file and record somewhere (e.g. in HDFS itself) that happened an interruption and at which point.
In the case of FTP, you could write just the list of file paths and folders. When a job finish to download a file, write its path on the downloaded list, and when an entire folder is downloaded write the folder path, so in case of resume you will not have to traverse a directory content to check that all files were downloaded.
At the program startup, on the other hand, it will check this file to decide whether the previous attempt failed and, in case, where to start the download.
In general, Hadoop will kill your program if it's not writing/reading anything for a timeout. Your application can tell it to wait but in general is not good to have an idle job, so it's better to end the job nicely instead that waiting for the network to work again.
You can also create your own filewriter, this way:
conf.setOutputFormat(MyOwnOutputFormat.class);
your filewriter could save its own temporary files in the format you prefer, so if the application crashes you know how files are saved.
HDFS saves files with chunks of 64MB by default, and when a job fails you may not even have a temporary file unless you use your own writer.
This is a generic solution, it depends on which is the source of data (ftp, samba, http...) and its support to download resumes.
EDIT: in case of FTP, you could just use csync to syncronize a FTP server with your local filesystem, and hdfs-fuse to mount a HDFS filesystem. It works when you have many small files.

You haven't specified what tool you are using to ingress data into HDFS/Hadoop.
Some of the tools that you can use to ingress data into HDFS/Hadoop which support recoverability are Flume, Scribe & Chukwa (for log files) and they all support various configurable levels of file transfer reliability guarantees, and Sqoop for transferring relational db data into HDFS or Hive, etc.

Related

How to archive data stored in HDFS files on another (non-distributed) server?

I have a project folder containing approx. 50 GB of parquet files on a hadoop cluster (CDH 5.14), which I need to archive and move to another host (non-distributed with Windows or Linux). This is only a one time job - I do not plan to bring the data back to HDFS any time soon, however there should be a way to deploy it back to a distributed file system. What would be the optimal way to do it? Unfortunately, I don't have another hadoop cluster or a cloud environment where I could place this data.
I would appreciate any hints.
The optimal solution can depend on the actual data (e.g. Tables, many/few flat files). If you know how they got in there, looking at the inverse could be a logical first step.
For example, if you just use put to place the files, consider using get.
If you use Nifi to get it in, try Nifi to get it out.
After the data is on your Linux box, you can use SCP or something like FTP or a mounted drive to move it to the desired computer.

Is it possible join a lot of files in Apache Flume?

Our server receive a lot of files every moment. Size of files is pretty small. Around 10 MB. Our management want to make Hadoop cluster for analysis and storage of these files. But it is not effective to storage small files in hadoop. Is it any options in hadoop or in Flume to join (make one big file) this files?
Thanks a lot for help.
Here's what comes to my mind:
1) Use Flume's "Spooling Directory Source". This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk.
Write your files to that directory.
2) Use whichever channel you want for Flume: "memory" or "File". Both have advantages and disadvantages.
3) Use HDFS Sink to write to HDFS.
The "spooling directory source" will rename the file once ingested (or optionally delete). The data also survives crash or restart.
Here's the documentation:
https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

Pull a file from remote location (local file system in some remote machine) into Hadoop HDFS

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.
Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?
I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.
There are two ways to achieve this:
You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.
If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.
This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.
outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
Path newFilePath=new Path(outFile);
FSDataOutputStream out = hdfs.create(outFile);
// put in a while loop here which would read until EOF and write to the file using below statement
out.write(buffer);
Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something
Please tell me if I am getting your Question right way.
1-you want to copy the file in a remote location.
2- client machine is not a part of Hadoop cluster.
3- It is may not contains the required libraries for Hadoop.
Best way is webHDFS i.e. Rest API

How to Use third party API in hadoop to read files from hdfs if those API uses only local file system path?

I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!
For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources