Best way to split computation too large to fit into memory? - memory-management

I have an operation that's running out of memory when using a batch size greater than 4 (I normally run with 32). I thought I could be clever by splitting this one operation along the batch dimension, using tf.split, running it on a subset of the batch, and then recombining using tf.concat. For some reason this doesn't work and results in an OOM error. Just to be clear, if I run on a batch size of 4 it works without splitting. If instead I run on a batch size of 32, and even if I were to perform a 32-way split so that each individual element is run independently, I still run out of memory. Doesn't TF schedule separate operations so that they do not overwhelm the memory? If not do I need to explicitly set up some sort of conditional dependence?

I discovered that the functional ops, specifically map_fn in this case, address my needs. By setting the parallel_iterations option to 1 (or some small number that would make the computation fit in memory), I'm able to control the degree of parallelism and avoid running out of memory.

Related

Achieving interactive large-dataset map-reduce on AWS/GCE in the least lines of code / script

I have 1 billion rows of data (about 400GB uncompressed; about 40GB compressed) that I would like to process in map-reduce style, and I have two executables (binaries, not scripts) that can handle the "map" and "reduce" steps. The "map" step can process about 10,000 rows per second, per core, and its output is approximately 1MB in size, regardless of the size of its input. The "reduce" step can process about 50MB / second (excluding IO latency).
Assume that I can pre-process the data once, to do whatever I'd like such as compress it, break it into pieces, etc. For simplicity, assume input is plain text and each row terminates with a newline and each newline is a row terminator.
Once that one-time pre-processing is complete, the goal is to be able to execute a request within 30 seconds. So, if my only bottleneck is the map job (which I don't know will really be true-- it could very well be the IO), and assuming I can do all the reduce jobs in under 5 seconds, then I would need about 425 8-core computers, all processing different parts of the input data, to complete the run in time.
Assuming you have the data, and the two map/reduce executables, and you have unlimited access to AWS or GCE, what is a solution to this problem that I can implement with the fewest lines of code and/or script (and not ignoring potential IO or other non-CPU bottlenecks)?
(As an aside, it would be interesting to also knowing what would execute with the fewest nodes, if different from the solution with fewest SLOC)

Smaller block size for more intense jobs Hadoop

Does it make to use smaller blocks for jobs that perform more intense tasks?
For example, in Mapper I'm calculating the distance between two signals, which may take some time depending on the signal length, but on the other hand my dataset size is currently not so big. That takes me into temptation to specify smaller block size (like 16MB) and to increase the number of nodes in cluster. Does that make sense?
How should I perform? And if it is ok to use smaller blocks, how to do it? I haven't done it before...
Whether that makes sense to do for your job can only really be known by testing the performance. There is an overhead cost associated with launching additional JVM instances, and it's a question of whether the additional parallelization is given enough load to offset that cost and still make it a win.
You can change this setting for for a particular job rather than across the entire cluster. You'll have to decide what's a normal use case when deciding whether to make this a global change or not. If you want to make this change globally, you'll put the property in the XML config file or Cloudera Manager. If you only want to do it for particular jobs, put it in the job's configuration.
Either way, the smaller the value in mapreduce.input.fileinputformat.split.maxsize, the more mappers you'll get (it defaults to Integer.MAX_VALUE). That will work for any InputFormat that uses block size to determine it's splits (most do, since most extend FileInputFormat).
So to max out your utilization, you might do something like this
long bytesPerReducer = inputSizeInBytes / numberOfReduceTasksYouWant;
long splitSize = (CLUSTER_BLOCK_SIZE > bytesPerReducer) ? CLUSTER_BLOCK_SIZE : bytesPerReducer);
job.getConfiguration.setLong("mapreduce.input.fileinputformat.split.maxsize", splitSize);
Note that you can also increase the value of mapreduce.input.fileinputformat.split.minsize to reduce the number of mappers (it defaults to 1).

Is the number of map tasks determined by the number of nodes in hadoop in case the input is small in size or is part of the hardware idle?

I have a basic mapreduce question.
My input consists of many small files and I have designed a custom CombinedFileInputFormat (which is working properly).
The size of all files together is only like 100 Mb for 20 000 files, but processing an individual file takes a couple of minutes (it's a heavy indexing problem), therefore I want as many map tasks as possible. Will hadoop take care of this or do I have to enforce it and how? In the latter case my first guess would be to manipulate the maximum split size but I am not sure if I am on the right track. Any help greatly appreciated! (suggestions on how to set the split size best in the latter case are also helpful)
Some extra information to be more clear:
There is however another reason I wanted to process multiple files per task and that is that I want to be able to use combiners. The output of a single task only produces unique keys, but between several files there might be a substantial overlap. By processing multiple files with the same map task I can implement a combiner or make use of in-mapper combining. This would definitely limit the amount of IO. The fact is that although a single file has a size of a couple of kilobytes the output of this file is roughly 30 * 10^6 key-value pairs which easily leads to a couple of Gigabytes.
I don't think there is another way to allow combining (or in-mapper combining) if you have only one file per maptask?
Regards, Dieter
To get the best utilization for your long running map tasks, you'll probably want each file to run in it's own task rather than using your implementation of CombineInputFormat.
Using combine input format is usually advisable when you have small files that are quickly processed as it takes longer to instantiate the map task (jvm, config etc) than it does to process the file itself. You can alleviate this you by configuring 'JVM reuse', but still for a CPU bound tasks (as opposed to an IO bound tasks) you'll just want to run map tasks for each input file.
You will however need your Job Tracker to have a good chunk of memory allocated to it so it can manage and track the 20k map tasks created.
Edit: In response to your updated question, if you want to use combined input format then you'll need to set the configuration properties for min / max size per node / rack. Hadoop won't be able to do anything more intelligible than try and keep files that are data local or rack local together in the same map task.

General approach to count word occurrence in large number of files

This is sort of an algorithm question. To make it clear, I'm not interested in working code but in how to approach the task generally.
We have a server with 4 CPU's, and no databases. There are 100,000 HTML documents, stored on disk. Each document is 2MB in size. We need an efficient way to determine the count of the word "CAMERA" (case insensitive) appearing in that collection.
My approach would be to
parse the HTML document to extract only words
and then sort the words,
then use binary search on that collection.
In other words, I would create threads to let them use all 4 CPU's to parse the HTML documents into a single, large word collection text file, then sort it, and then using binary search.
What do you think of this?
Have you tried grep? That's what I would do.
It will probably take some experimentation to figure out the right way to pass it so much data and make sure ahead of time that the results come out right, because it's going to take a little while.
I would not recommend sorting that much data.
Well, it is not a complete pseudo code answer, but I don't think there is one. To get optimal performance you need to know a LOT on your HW architecture. Here are the notes:
There is no need to sort the data at all, nor use binary search. Just read the files (read each file sequentially from disk) and while doing so search if the word camera appears in it.
The bottle neck in the program will most likely be IO (disk reads), since disk access is MUCH slower then CPU calculations. So, to optimize the program - one should focus on optimizing the disk reads.
To optimize the disk reads, one should know the architecture of it. For example, if you have only one disk (and no RAID), there is really no point in multi-threading, assuming the disk can process a single request at the same time. If it is the case - use a single thread.
However, if you have multiple disks - it does not matter how many cores you have, you should spawn #disks threads (assuming the files are evenly seperated among the disks). Since it is the bottle-neck, by having multiple threads that concurrently requesting the data from the disks, you make all of them work, and effectively reduce the time consumption significantly.
Something like?
htmlDocuments = getPathsOfHtmlDocuments()
threadsafe counter = new Counter(0)
scheduler = scheduler with max 4 threads
for(htmlDocument: htmlDocuments){
scheduler.schedule(new SearchForCameraJob("Camera",htmlDocument,counter))
}
wait while scheduler.hasUnfinishedJobs
print Found camera +counter+ times
class SearchForCameraJob(searchString, pathToFile, counter){
document = readFile(pathToFile);
while(document.findNext(searchString)){
counter.increment();
}
}
If your documents are located on single local hard drive, you will be constrained by I/O, not CPU.
I would use very simple approach of simply serially loading every file into memory and scanning memory searching for target word and increasing counter.
If you try to use 4 threads in attempt to speed it up (like 25000 files to every thread), it will likely make it slower, because I/O does not like overlapping access patterns from competing processes/threads.
If, however, files are spread accross multiple hard drives, you should start as many threads as you have drives, and each thread should read data from that drive only.
You can use Boyer-Moore algorithm. Is difficult to say what programming language is proper for make such of application, but you can make it in C++ so as to directly optimize your native code. Obviously you need to use multithreading.
Of the HTML document parsing libraries you can choose Xerces-C++.

How can I determine the appropriate number of tasks with GCD or similar?

I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!

Resources