multiple file streaming hdfs - hadoop

I have two matrices on separate files . I have to read the files into cache so that I can multiply them. I have been wondering if HDFS would help me. I am suspecting that HDFS does not because it does not have enough cache memory to read the files and processes it . in short can i open two files at the same time

To answer your shorter version of the question, yes the HDFS API does allow concurrent reads of two files at a time. You may simply create two input streams over the two files and read them in parallel (as you would with regular files) and manage your logic around that.
However, the HDFS is a simple FileSystem and has no cache of its own to offer (other than the OS buffer cache) and any cache for computation you need to carry, needs to be taken care of by your own application.
As another general recommendation, since you look to be multiplying matrices, perhaps look at the Apache Mahout and Apache Hama projects that support HDFS.

Related

Will sequence file help in improve performance for reading in HDFS compared to Local File System?

I want to compare performance for HDFS and Local File System for 1000 of small files (1-2 mb). Without using Sequence files, HDFS takes almost double the time for reading up 1000 files as compared to local file system.
I heard of sequence files here - Small Files Problem in HDFS
I want to show better response time for HDFS for retrieving these records than Local FS. Will sequence files help or should I look for something else? (HBase maybe)
edit: I'm using Java program to read files like here HDFS Read though Java
Yes, for simple file retrieval grabbing a single sequence file will be much quicker then grabbing 1000 files. When reading from HDFS you incur much more overhead including spinning up the JVM (assuming you're using hadoop fs -get ...), getting the location of each of the files from the NameNode, as well as network time (assuming you have more then one datanode).
A sequence file can be thought of as a form of container. If you put all the 1000 files into a sequence file, you only need to grab 32 blocks (if your blocksize is set to 64MB) rather then 1000. This will reduce location lookups and total network connections made. You do run into another issue at this point with reading the sequence file. It is a binary format.
HBase is better suited for low-latency and random reads, so it may be a better option for you. Keep in mind that disk seeks still occur (unless you're working from memory), so reading a bunch of small files locally may be a better solution then using HDFS as a file store.

How to handle unsplittable 500 MB+ input files in hadoop?

I am writing a hadoop MapReduce job that is running over all source code files of a complete Debian mirror (≈ 40 GB). Since the Debian mirror data is on a separate machine and not in the hadoop cluster, the first step is to download the data.
My first implementation downloads a file and outputs key=$debian_package, value=$file_contents. The various values (typically 4) per key should then be reduced to a single entry. The next MapReduce job will then operate on debian packages as keys and all their files as values.
However, I noticed that hadoop works very poorly with output values that can sometimes be really big (700 MB is the biggest I’ve seen). In various places in the MapReduce framework, entire files are stored in memory, sometimes twice or even three times. I frequently encounter out of memory errors, even with a java heap size of 6 GB.
Now I wonder how I could split the data so that it better matches hadoop’s 64 MB block size.
I cannot simply split the big files into multiple pieces, because they are compressed (tar/bz2, tar/xz, tar/gz, perhaps others in the future). Until I shell out to dpkg-source on them to extract the package as a whole (necessary!), the files need to keep their full size.
One idea that came to my mind was to store the files on hdfs in the first MapReduce and only pass the paths to them to the second MapReduce. However, then I am circumventing hadoop’s support for data locality, or is there a way to fix that?
Are there any other techniques that I have been missing? What do you recommend?
You are correct. This is NOT a good case for Hadoop internals. Lots of copying... There are two obvious solutions, assuming you can't just untar it somewhere:
break up the tarballs using any of several libraries that will allow you to recursively read compressed and archive files (apache VFS has limited capability for this, but the apache compression library has more capability).
nfs mount a bunch of data nodes local space to your master node and then fetch and untar into that directory structure... then use forqlift or similar utility to load the small files into HDFS.
Another option is to write a utility to do this. I have done this for a client. Apache VFS and compression, truezip, then hadoop libraries to write (since I did a general purpose utility I used a LOT of other libraries, but this is the basic flow).

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?
I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!
Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS
Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.
If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G

How big is too big for a DistributedCache file hadoop?

Are there any guidelines for whether to distribute a file using a distributed cache or not ?
I have a file of size 86746785 (I use hadoop dfs -dus - don't know if this is in bytes or what). Is it a good idea to distribue this file ?
The only viable answer is "it depends".
What you have to consider about using distributed cache is the file gets copied to every node that is involved in your task, which obviously takes bandwidth. Also, usually if you want the file in distributed cache, you'll keep the file in memory, so you'd have to take that into consideration.
As for your case -- yes, those are bytes. The size is roughly 86 MB, which is perfectly fine for distributed cache. Anything within a couple hundred MBs should probably still be.
In addition to TC1's answer, also consider:
When/where are you going to use the file(s) and how big is your cluster?
In a many mappers, single reducer (or small number of) scenario where you only need the file in the reducer i would advise against it as you might as well just pull down the file yourself in the reducer (setup method), rather than unnecessarily for each task node your mappers run on - especially if the file is large (this depends on how many nodes you have in your cluster)
How many files are you putting into the cache?
If for some reason you have 100's of files to distribute, you're better off tar'ing them up and putting the tar file in the distributed cache's archives set (the dist cache will take care of untaring the file for you). The thing you're trying to avoid here is if you didn't put them in the dist cache but directly loaded them from HDFS, you may run into a scenario where you have 1000's of mappers and or reducers trying to open the same file which could caused too many open files problems for the name node and data nodes
The size of Distributed Cache is 10GB by default. But its better to keep a few MBs of data in Distributed Cache.Otherwise it will affect the performance of your application.

Hadoop for processing very large binary files

I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the community agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.

Resources