I am new to Hadoop programming.
I have a situation in which I want to stop writing <k3,v3> to my output file after n-lines.
In my program, I am sure that the output file will be sorted according to k3, but I don't want the entire list. I only want the first n.
Is there a mechanism in Hadoop to do this?
I couldn't find an Class/API for the same.
But, you could increment a Counter when the OutputCollector.collect() is called in the Reduce function. When the counter reaches the a certain value, stop calling the OutputCollector.collect().
It's a waste of CPU cycles because the reduce tasks keeps on running even after n lines are written to the o/p. There might be a better approach for the problem.
Related
4kb memory
1gb data
8 bytes per data
Sequential data output without writing to disk.
First, you can only accumulate and output a maximum of 4kb per time through the data. So you will need at least something like 250,000 passes.
How can you accumulate stuff in a pass?
In pseudocode the idea is like this.
while not done:
for each data (8 bytes) in dataset:
if this data has never been output:
if it might belong in the current batch:
add to current batch (evicting something else if needed)
if current_batch not empty:
sort current batch
emit current batch
update "never been output" filter
else:
done
What does that filter look like? It needs to know three things:
What is the maximum value so far emitted?
How many times has it been emitted?
How many times has it been seen on this pass?
Any value below the maximum value gets ignored. After you've seen the value enough times, you can add it to the current batch.
Now how about the current batch you're accumulating? That can be a heap that tells you the maximum value in the batch. If the heap is not full, or if the current value is below the maximum in the batch, you add it to the batch and lose the current max.
If the heap is arranged in memory so that the smallest is first, when the batch is done you can remove the max, which will free up the last slot (that's how heaps work), and put the max there. Keep doing that and you'll heapsort the batch. Now you can easily update the filter, and then emit the batch.
I don't think you can get significantly more efficient than this.
If I was asked this in an interview, I'd know the answer, but I'd also see being asked the question as a sign that the company's hiring process is suboptimal. This would make me less inclined to be hired there unless there was some purpose I could see to why they hired this way. (I know why FAANGs do. But at most companies I'd call it a red flag.)
I have an operation that's running out of memory when using a batch size greater than 4 (I normally run with 32). I thought I could be clever by splitting this one operation along the batch dimension, using tf.split, running it on a subset of the batch, and then recombining using tf.concat. For some reason this doesn't work and results in an OOM error. Just to be clear, if I run on a batch size of 4 it works without splitting. If instead I run on a batch size of 32, and even if I were to perform a 32-way split so that each individual element is run independently, I still run out of memory. Doesn't TF schedule separate operations so that they do not overwhelm the memory? If not do I need to explicitly set up some sort of conditional dependence?
I discovered that the functional ops, specifically map_fn in this case, address my needs. By setting the parallel_iterations option to 1 (or some small number that would make the computation fit in memory), I'm able to control the degree of parallelism and avoid running out of memory.
I have one requirement saying that -
a. Lets say i have 100GB of file/data
b. I have written Map Reduce job to process this data on certain logic.
c. I fired Map Reduce job, but it failed after reading 50GB
So my question is -
Can i resume the Map Reduce job from the 51th GB?
Please let me know if anybody have idea on how to do it, i don't want to reprocess the data which i processed before point of failure.
Thanks in advance
Brief answer: no.
And that's why working with large batch processing systems such as Hadoop or MPI is hard. Not only restarts of large jobs are inefficient from resource consumption point of view, but are also very psychologically depressive. That's why your primary goal is to reduce running time of single job to no more than couple of hours. Maybe it would be possible some day to implement "pausing" of jobs and "hot fixing" code, but currently it is not supported to my knowledge.
Solution #1. Split your job into error-prone parallelizable job and final error-free non-parallelizable job. Consider following example: you have hundreds of gigabytes of textual access logs from web server and you want to write job that will print how popular different browsers are. If you combine parsing and aggregating (summing) to a single huge job, then it's running time will be of order of days, and also chances that it will fail are very high because textual logs are usually hard to parse due to disambiguity. Much better idea is to split this job into two separate jobs:
First job is solely responsible for parsing log files. It prints only browser string as its output and even doesn't need to have any reducers. This job is the place for 99% of all errors because here is where parsing of "wild" data occurs. This job is parallelizable in the sense that you may split your input into chunks and process each chunk separately, so that each chunk is processed in 10-30 minutes. If job fails for some chunk, you fix it and restart; 30 minutes is not a big loss.
Second job is grand job that takes outputs from instances of first jobs and performs aggregation. Because aggregation code is very simple, this job is not likely to fail.
chunk(20G)->parse-job(20G)->browsers(0.5G)
chunk(20G)->parse-job(20G)->browsers(0.5G)
input(1T)->chunk(20G)->parse-job(20G)->browsers(0.5G)->aggregate-job->output
... .... ...
chunk(20G)->parse-job(20G)->browsers(0.5G)
Solution #2. Sometimes you may be satisfied with result even if parts of input data are dropped out. In this case you may set options mapred.max.map.failures.percent and/or mapred.max.reduce.failures.percent to non-zero values.
If your entire job fails, the output gets cleared, so you loose whatever you processed. However, Hadoop retries failed tasks of a job. So as long as your failure is recoverable within preconfigured number of attempts, a job will not fail and you are not going to loose output from already completed tasks.
If your failure is not recoverable, then in most cases it is your fault, and you might need to do one or more of the following:
Fix your code, even simple bug may cause all your tasks to consistently fail
Use less resources (e.g. care of available memory)
Better partition the problem (see if some tasks are fed more data than others or make sure task input is getting split into smaller chunks)
Have a bigger cluster capacity.
I called take() method of an RDD[LabeledPoint] from spark-shell, which seemed to be a laborious job for spark.
The spark-shell shows a progress-bar:
The progress-bar fills again and again and I don't know how to produce a reasonable estimate of the needed time (or total-progress ) from those numbers above.
Does anyone know what those numbers mean ?
Thanks in advance.
The numbers show the Spark stage that is running, the number of completed, in-progress, and total tasks in the stage. (See
What do the numbers on the progress bar mean in spark-shell? for more on the progress bar.)
Spark stages run tasks in parallel. In your case 5 tasks are running in parallel at the moment. If each task takes roughly the same time, this should give you an idea of how much longer you have to wait for this stage to finish.
But RDD.take can take more than one stage. take(1) will first get the first element of the first partition. If the first partition is empty, it will take the first elements from the second, third, fourth, and fifth partitions. The number of partitions it looks at in each stage is 4× the number of partitions already checked. So if you have a whole lot of empty partitions, take(1) can take many iterations. This can be the case for example if you have a large amount of data, then do filter(_.name == "John").take(1).
If you know your result will be small, you can save time by using collect instead of take(1). This will always gather all the data in a single stage. The main advantage is that in this case all the partitions will be processed in parallel, instead of the somewhat sequential manner of take.
I have a basic mapreduce question.
My input consists of many small files and I have designed a custom CombinedFileInputFormat (which is working properly).
The size of all files together is only like 100 Mb for 20 000 files, but processing an individual file takes a couple of minutes (it's a heavy indexing problem), therefore I want as many map tasks as possible. Will hadoop take care of this or do I have to enforce it and how? In the latter case my first guess would be to manipulate the maximum split size but I am not sure if I am on the right track. Any help greatly appreciated! (suggestions on how to set the split size best in the latter case are also helpful)
Some extra information to be more clear:
There is however another reason I wanted to process multiple files per task and that is that I want to be able to use combiners. The output of a single task only produces unique keys, but between several files there might be a substantial overlap. By processing multiple files with the same map task I can implement a combiner or make use of in-mapper combining. This would definitely limit the amount of IO. The fact is that although a single file has a size of a couple of kilobytes the output of this file is roughly 30 * 10^6 key-value pairs which easily leads to a couple of Gigabytes.
I don't think there is another way to allow combining (or in-mapper combining) if you have only one file per maptask?
Regards, Dieter
To get the best utilization for your long running map tasks, you'll probably want each file to run in it's own task rather than using your implementation of CombineInputFormat.
Using combine input format is usually advisable when you have small files that are quickly processed as it takes longer to instantiate the map task (jvm, config etc) than it does to process the file itself. You can alleviate this you by configuring 'JVM reuse', but still for a CPU bound tasks (as opposed to an IO bound tasks) you'll just want to run map tasks for each input file.
You will however need your Job Tracker to have a good chunk of memory allocated to it so it can manage and track the 20k map tasks created.
Edit: In response to your updated question, if you want to use combined input format then you'll need to set the configuration properties for min / max size per node / rack. Hadoop won't be able to do anything more intelligible than try and keep files that are data local or rack local together in the same map task.