How do s3n/s3a manage files? - hadoop

I've been using services like Kafka Connect and Secor to persist Parquet files to S3. I'm not very familiar with HDFS or Hadoop but it seems like these services typically write temporary files either into local memory or to disk before writing in bulk to s3. Do the s3n/s3a file systems virtualize an HDFS-style file system locally and then push at configured intervals or is there a one-to-one correspondence between a write to s3n/s3a and a write to s3?
I'm not entirely sure if I'm asking the right question here. Any guidance would be appreciated.

S3A/S3N just implement the Hadoop FileSystem APIs against the remote object store, including pretending it has directories you can rename and delete.
They have historically saved all the data you write to the local disk until you close() the output stream, at which point the upload takes place (which can be slow). This means that you must have as much temporary space as the biggest object you plan to create.
Hadoop 2.8 has a fast upload stream which uploads the file in 5+MB blocks as it gets written, then in the final close() makes it visible in the object store. This is measurably faster when generating lots of data in a single stream. This also avoids needing so much disk space.

Related

How to archive data stored in HDFS files on another (non-distributed) server?

I have a project folder containing approx. 50 GB of parquet files on a hadoop cluster (CDH 5.14), which I need to archive and move to another host (non-distributed with Windows or Linux). This is only a one time job - I do not plan to bring the data back to HDFS any time soon, however there should be a way to deploy it back to a distributed file system. What would be the optimal way to do it? Unfortunately, I don't have another hadoop cluster or a cloud environment where I could place this data.
I would appreciate any hints.
The optimal solution can depend on the actual data (e.g. Tables, many/few flat files). If you know how they got in there, looking at the inverse could be a logical first step.
For example, if you just use put to place the files, consider using get.
If you use Nifi to get it in, try Nifi to get it out.
After the data is on your Linux box, you can use SCP or something like FTP or a mounted drive to move it to the desired computer.

Is it possible join a lot of files in Apache Flume?

Our server receive a lot of files every moment. Size of files is pretty small. Around 10 MB. Our management want to make Hadoop cluster for analysis and storage of these files. But it is not effective to storage small files in hadoop. Is it any options in hadoop or in Flume to join (make one big file) this files?
Thanks a lot for help.
Here's what comes to my mind:
1) Use Flume's "Spooling Directory Source". This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk.
Write your files to that directory.
2) Use whichever channel you want for Flume: "memory" or "File". Both have advantages and disadvantages.
3) Use HDFS Sink to write to HDFS.
The "spooling directory source" will rename the file once ingested (or optionally delete). The data also survives crash or restart.
Here's the documentation:
https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

HDFS configuration & what is the user directory for?

I am currently "playing around" with Hadoop in a VM (CDH4.1.3 image from cloudera). What I am wondering about is the following (and the documentation did not really help me in that regard).
Following the tutorial, I would format a NameNode first - OK, that is already done if one uses the cloudera image. Likewise the HDFS file structure is already present. In the hdfs-site.xml the datanode data dir is set to:
/var/lib/hadoop-hdfs/cache/${user.name}/dfs/data
which is obviously where the blocks are supposed to be copied to in a real distributed setting. In the cloudera tutorial, one is told to create hdfs "home directories" for each user (/users/<username>), which I do not understand what they are for. Are they just for local test-runs in a single-node setup?
Say I really had petabytes of data on type not fitting into my local storage. This data would have to be distributed straight away, rendering a local "home directory" entirely useless.
Could someone tell me, just to give me an intuition, how a real Hadoop workflow with massive data would look like? What kind of distinct nodes would I have running for a start?
There's the master (JobTracker) with its slave file (where would I put that) allowing the master to resolve all the DataNodes. Then there is my NameNode that keeps track of where the block IDs are stored. The DataNodes are also carry TaskTracker responsibility. In the config files, the NameNode's URI is included -- am I correct so far? Then there is still the ${user.name} variable in the configuration which apparently, if I understood it right, has something to do with WebHDFS, which would also be great if someone could explain to me. In the running examples, the directions tend to be hardcoded to
/var/lib/hadoop-hdfs/cache/1/dfs/data, /var/lib/hadoop-hdfs/cache/2/dfs/data and so on.
So, back to the example: Say, I have my tape and want to import data into my HDFS (and I am required to stream data into the filesystem because I lack the local storage to save it locally on a single machine). Where would I start with the migration process? On an arbitrary DataNode? On the NameNode that distributes the chunks? After all, I cannot assume the data just to "be there", because the name node has to be aware of the block IDs.
It would be great if someone could shortly elaborate on these topics:
What is the home directory really for?
Do I migrate data to the home directory first and to the real distributed system afterwards?
How does WebHDFS work and what role does it play with regards to the user.name variable
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
What is the home directory really for?
You have a small confusion here. Just like /home exists for local filesystems on Linux, where users are given their own storage space, /users is a home mount ON the HDFS (Distributed FS). The tutorial needs you to administratively create a home directory for the user you wish to later be running data loads and queries as, such that they get adequate permissions and storage access onto the HDFS. The tutorial is not asking you to create these directories locally.
Do I migrate data to the home directory first and to the real distributed system afterwards?
I believe my above answer should clarify this for you. You should create your home directory on the HDFS, and then load all your data inside of that directory.
How does WebHDFS work and what role does it play with regards to the user.name variable
WebHDFS is one of the various ways to access HDFS. Regular clients to talk to HDFS require use of Java APIs. WebHDFS (and also HttpFs) techniques were added to HDFS to let other languages have their own set of APIs by providing a REST front-end to HDFS. WebHDFS allows user-authentication, to help persist the permission and security models.
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
The large part of problem HDFS solves for you is that of managing distribution of data. When loading files or data streams to HDFS (via CLI tools, sinks from Apache Flume, etc.), the blocks are spread in an ideal distribution by HDFS itself, and the chunking is managed by it as well. All you need to do is use the user-side regular FileSystem style APIs and forget about what goes where underneath - its all managed for you.

How to get data from temp files of hadoop?

I have an application to transfer data from remote systems to HDFS using map reduce . I however am lost when I have to deal with isues like network failure .. That is , when a connection from remote data source is lost and data is no longer accessible to my mapreduce application. I can always restart the job but when data is huge then restarting is an expensive option . I know the mapreduce would create temp folder but will it put data there ? Can I read that data out and then Can I somehow start reading the rest of the data ?
A mapreduce job can write arbitrary files, not only the ones managed by Hadoop.
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
out = fs.create(new Path(fileName));
using this code you create arbitrary files which work like normal files in the local filesystem. Then, you manage connection exceptions such that when a source is unaccessible you nicely close the file and record somewhere (e.g. in HDFS itself) that happened an interruption and at which point.
In the case of FTP, you could write just the list of file paths and folders. When a job finish to download a file, write its path on the downloaded list, and when an entire folder is downloaded write the folder path, so in case of resume you will not have to traverse a directory content to check that all files were downloaded.
At the program startup, on the other hand, it will check this file to decide whether the previous attempt failed and, in case, where to start the download.
In general, Hadoop will kill your program if it's not writing/reading anything for a timeout. Your application can tell it to wait but in general is not good to have an idle job, so it's better to end the job nicely instead that waiting for the network to work again.
You can also create your own filewriter, this way:
conf.setOutputFormat(MyOwnOutputFormat.class);
your filewriter could save its own temporary files in the format you prefer, so if the application crashes you know how files are saved.
HDFS saves files with chunks of 64MB by default, and when a job fails you may not even have a temporary file unless you use your own writer.
This is a generic solution, it depends on which is the source of data (ftp, samba, http...) and its support to download resumes.
EDIT: in case of FTP, you could just use csync to syncronize a FTP server with your local filesystem, and hdfs-fuse to mount a HDFS filesystem. It works when you have many small files.
You haven't specified what tool you are using to ingress data into HDFS/Hadoop.
Some of the tools that you can use to ingress data into HDFS/Hadoop which support recoverability are Flume, Scribe & Chukwa (for log files) and they all support various configurable levels of file transfer reliability guarantees, and Sqoop for transferring relational db data into HDFS or Hive, etc.

Amazon S3 multipart upload often fails

I'm trying to upload a 32GB file to a S3 bucket using the s3cmd CLI. It's doing a multipart upload and often fails. I'm doing this from a server which has a 1000mbps bandwidth to play with. But the upload still is VERY slow. Is there something I can do to speed this up?
On the other hand, the file is on the HDFS on the server I mentioned. Is there a way to reference the Amazon Elastic Map Reduce job to pick it up from this HDFS? It's still an upload but the job is getting executed as well. So the overall process is much quicker.
First I'll admit that I've never used the Multipart feature of s3cmd, so I can't speak to that. However, I have used boto in the past to upload large (10-15GB files) to S3 with a good deal of success. In fact, it became such a common task for me that I wrote a little utility to make it easier.
As for your HDFS question, you can always reference an HDFS path with a fully qualified URI, e.g., hdfs://{namenode}:{port}/path/to/files. This assumes your EMR cluster can access this external HDFS cluster (might have to play with security group settings)

Resources