The approache for precessing a large file on spark - hadoop

When i process a large file on spark cluster the out of memory is occurred. I know i can extend the size of heap. But in more general case, it is not good method i think. I am curious splitting the large file into small files in batch is good choice. So we can process small files in batch instead of a large file.

I have encountered the OOM problem either.As spark uses the memory to compute,the data,the intermediate file and so on all stored in the memory.I think cache or persist will be helpful.You can set the storage level as MEMORY_AND_DISK_SER.

Related

Will sequence file help in improve performance for reading in HDFS compared to Local File System?

I want to compare performance for HDFS and Local File System for 1000 of small files (1-2 mb). Without using Sequence files, HDFS takes almost double the time for reading up 1000 files as compared to local file system.
I heard of sequence files here - Small Files Problem in HDFS
I want to show better response time for HDFS for retrieving these records than Local FS. Will sequence files help or should I look for something else? (HBase maybe)
edit: I'm using Java program to read files like here HDFS Read though Java
Yes, for simple file retrieval grabbing a single sequence file will be much quicker then grabbing 1000 files. When reading from HDFS you incur much more overhead including spinning up the JVM (assuming you're using hadoop fs -get ...), getting the location of each of the files from the NameNode, as well as network time (assuming you have more then one datanode).
A sequence file can be thought of as a form of container. If you put all the 1000 files into a sequence file, you only need to grab 32 blocks (if your blocksize is set to 64MB) rather then 1000. This will reduce location lookups and total network connections made. You do run into another issue at this point with reading the sequence file. It is a binary format.
HBase is better suited for low-latency and random reads, so it may be a better option for you. Keep in mind that disk seeks still occur (unless you're working from memory), so reading a bunch of small files locally may be a better solution then using HDFS as a file store.

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?
I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!
Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS
Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.
If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G

How big is too big for a DistributedCache file hadoop?

Are there any guidelines for whether to distribute a file using a distributed cache or not ?
I have a file of size 86746785 (I use hadoop dfs -dus - don't know if this is in bytes or what). Is it a good idea to distribue this file ?
The only viable answer is "it depends".
What you have to consider about using distributed cache is the file gets copied to every node that is involved in your task, which obviously takes bandwidth. Also, usually if you want the file in distributed cache, you'll keep the file in memory, so you'd have to take that into consideration.
As for your case -- yes, those are bytes. The size is roughly 86 MB, which is perfectly fine for distributed cache. Anything within a couple hundred MBs should probably still be.
In addition to TC1's answer, also consider:
When/where are you going to use the file(s) and how big is your cluster?
In a many mappers, single reducer (or small number of) scenario where you only need the file in the reducer i would advise against it as you might as well just pull down the file yourself in the reducer (setup method), rather than unnecessarily for each task node your mappers run on - especially if the file is large (this depends on how many nodes you have in your cluster)
How many files are you putting into the cache?
If for some reason you have 100's of files to distribute, you're better off tar'ing them up and putting the tar file in the distributed cache's archives set (the dist cache will take care of untaring the file for you). The thing you're trying to avoid here is if you didn't put them in the dist cache but directly loaded them from HDFS, you may run into a scenario where you have 1000's of mappers and or reducers trying to open the same file which could caused too many open files problems for the name node and data nodes
The size of Distributed Cache is 10GB by default. But its better to keep a few MBs of data in Distributed Cache.Otherwise it will affect the performance of your application.

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Hadoop for processing very large binary files

I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the community agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.

Resources