How to avoid Parquet MemoryManager exception - hadoop

I'm generating some parquet (v1.6.0) output from a PIG (v0.15.0) script. My script takes several input sources and joins them with some nesting. The script runs without error but then during the STORE operation I get:
2016-04-19 17:24:36,299 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=FAILED, progress=TotalTasks: 249 Succeeded: 220 Running: 0 Failed: 1 Killed: 28 FailedTaskAttempts: 43, diagnostics=Vertex failed, vertexName=scope-1446, vertexId=vertex_1460657535752_15030_1_18, diagnostics=[Task failed, taskId=task_1460657535752_15030_1_18_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:parquet.hadoop.MemoryManager$1: New Memory allocation 134217728 exceeds minimum allocation size 1048576 with largest schema having 132 columns
at parquet.hadoop.MemoryManager.updateAllocation(MemoryManager.java:125)
at parquet.hadoop.MemoryManager.addWriter(MemoryManager.java:82)
at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:104)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:309)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:81)
at org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:398)
...
The above exception was thrown when I executed the script using -x tez but I get the same exception when using mapreduce. I have tried to increase parallelization using SET default_parallel as well as adding an (unneccessary w.r.t. my real objectives) ORDER BY operation just prior to my STORE operations to ensure PIG has an opportunity to ship data off to different reducers and minimize the memory required on any given reducer. Finally, I've tried pushing up the available memory using SET mapred.child.java.opts. None of this has helped however.
Is there something I'm just missing? Are there known strategies for avoiding the issue of one reducer carrying too much of the load and causing things to fail during write? I've experienced similar issues writing to avro output that appear to be caused by insufficient memory to execute the compression step.
EDIT: per this source file the issue seems to boil down to the fact that memAllocation/nCols<minMemAllocation. However the memory allocation seems unaffected by the mapred.child.java.opts setting I tried out.

I solved this finally using the parameter parquet.block.size. The default value (see source) is big enough to write a 128-column wide file, but no bigger. The solution in pig was to use SET parquet.block.size x; where x >= y * 1024^2 and y is the number of columns in your output.

Related

Memory problems when running stanford nlp (stanford segmentator)

I downloaded the stanford segmentator and I am following the instructions but I am getting a memory error, the full message is here:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.shapeOf(Sighan2005DocumentReaderAndWriter.java:230)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.access$300(Sighan2005DocumentReaderAndWriter.java:49)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter$CTBDocumentParser.apply(Sighan2005DocumentReaderAndWriter.java:169)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter$CTBDocumentParser.apply(Sighan2005DocumentReaderAndWriter.java:114)
at edu.stanford.nlp.objectbank.LineIterator.setNext(LineIterator.java:42)
at edu.stanford.nlp.objectbank.LineIterator.<init>(LineIterator.java:31)
at edu.stanford.nlp.objectbank.LineIterator$LineIteratorFactory.getIterator(LineIterator.java:108)
at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.getIterator(Sighan2005DocumentReaderAndWriter.java:86)
at edu.stanford.nlp.objectbank.ObjectBank$OBIterator.setNextObjectHelper(ObjectBank.java:435)
at edu.stanford.nlp.objectbank.ObjectBank$OBIterator.setNextObject(ObjectBank.java:419)
at edu.stanford.nlp.objectbank.ObjectBank$OBIterator.<init>(ObjectBank.java:412)
at edu.stanford.nlp.objectbank.ObjectBank.iterator(ObjectBank.java:250)
at edu.stanford.nlp.sequences.ObjectBankWrapper.iterator(ObjectBankWrapper.java:45)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1193)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1137)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1091)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3023)
Before executing the file I tried increasing the heap space by doing export JAVA_OPTS=-Xmx4000m. I also tried splitting the file but still had the same error - I split the file to 8 chunks, so each had around 15MB each. What should I do to adjust the memory problem?
The segment.sh script that ships with the segmenter limits the memory to 2G, which is probably the cause of the error. Editing that file will hopefully fix the issue for you.

runtime: failed to create new OS thread

On a 54-core machine, I use os.Exec() to spawn hundreds of client processes, and manage them with an abundance of goroutines.
Sometimes, but not always, I get this:
runtime: failed to create new OS thread (have 1306 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
My ulimit is pretty high already:
$ ulimit -u
1828079
There's never a problem if I limit myself to, say, 54 clients.
Is there a way I can handle this situation more gracefully? E.g. not bomb out with a fatal error, and just do less/delayed work instead? Or query the system ahead of time and anticipate the maximum amount of stuff I can do (I don't just want to limit to the number of cores though)?
Given my large ulimit, should this error even be happening? grep -c goroutine on the stack output following the fatal error only gives 6087. Each client process (of which there are certainly less than 2000) might have a few goroutines of their own, but nothing crazy.
Edit: the problem only occurs on high-core machines (~60). Keeping everything else constant and just changing the number of cores down to 30 (this being an OpenStack environment, so the same underlying hardware still being used), these runtime errors don't occur.

Trying to sort a very large dataset using PROC SORT and it is throwing unusual error

Running this piece of code on LINUX and dataset is on MAINFRAME and has 60000000+ obs...
proc sort data=test_history force;
by acct score;
run;
I'm getting this following error...
NOTE: There were 67397829 observations read from the data set test_HISTORY.
435 ERROR: Failure while merging sorted runs from utility file 1 to final output.
436 ERROR: Failure encountered during external sort.
437 ERROR: Attempt to communicate with server AMDAHL refused by server. The current request failed.
438 NOTE: The SAS System stopped processing this step because of errors.
439 NOTE: SAS set option OBS=0 and will continue to check statements. This might cause NOTE: No observations in data set.
440 WARNING: The data set test_HISTORY may be incomplete. When this step was stopped there were 20002488 observations and 148
441 variables.
442 ERROR: The connection to server AMDAHL has been lost. The current request failed. This error may reoccur on subsequent requests.
Refer to this SUGI paper
Several options are indicated to reduce the probability of getting an error when proc sorting a large dataset in the mainframe environment. I have pasted one option below.
This code limits the number of sort work areas in the SAS code...use SOTWKNO option either as a global option or as a PROC SORT option. This option determines the maximum number of
sort work areas that PROC SORT is allowed to use.
options SORTWKNO=3;
proc sort test_history SORTWKNO=5;
by acct score;
run;

Cuda kernel launch failure

I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3
A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.

Hadoop Benchmark: TestDFSIO

I am testing my hadoop configuration with the apache provided benchmark file TestDFSIO. I'm running it according to this tutorial (resource 1):
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/#testdfsio
The usage of the test is as follows:
TestDFSIO.0.0.4
Usage: hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO
-read | -write | -clean
[-nrFiles N] [-fileSize MB]
[-resFile resultFileName] [-bufferSize Bytes]
I'm a little confused about some of the flags, specifically, what is the buffer size flag for? Also, while navigating hdfs when the job completed successfully (I first performed a write TestDFSIO), I couldn't find the filename I supposedly created by choosing a resultFileName. Why can't I find the file by the resultFileName I used?
I had also looked at this page (resource 2) (specifically page 25):
http://wr.informatik.uni-hamburg.de/_media/research/labs/2009/2009-12-tien_duc_dinh-evaluierung_von_hadoop-report.pdf
As one of the parameters of their test, they were using block sizes of 64MB and 128MB. I tried putting '64MB' (converted to bytes) after the bufferSize flag, but this led to a failed job, which leads me to believe I do not understand what the buffersize flag is for, and how to use different block sizes for testing. How do you change the block size of the test (as per resource 2)?
What is the buffer size flag for?
The buffer size flag describes the length of the write buffer in bytes. See the WriteMapper constructor in TestDFSIO.java:
public WriteMapper() {
for(int i=0; i < bufferSize; i++)
buffer[i] = (byte)('0' + i % 50);
}
Here, data is generated and written to the buffer in memory before being written to disk. When it's written to disk later, it's all written in one step rather than one step per byte. Fewer writes often means better performance, so a larger buffer might improve performance.
Why can't I find the file by the resultFileName I used?
Results are usually automatically written to /benchmarks/TestDFSIO. If you don't find it there, search for mapred.output.dir in your job log.
How do you change the block size of the test (as per resource 2)?
Block size can be passed as a parameter as a generic option. Try something like:
hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -D dfs.block.size=134217728 -write
Why can't I find the file by the resultFileName I used?
You should have probably seen a line like this at the end of job execution log:
java.io.FileNotFoundException: File does not exist: /benchmarks/TestDFSIO/io_write/part-00000
while dealing with TestDFSIO it usually means that lzo or other compression is used (so there's extra something appended to the filename).
so instad of looking for
/benchmarking/TestDFSIO/io_write/part-00000
try this (see * wildcard at the end):
hadoop fs -ls /benchmarking/TestDFSIO/io_write/part-00000*
Try this for this question (How do you change the block size of the test (as per resource 2)?
hadoop jar $_HADOOP_HOME/share/hadoop/mapreduce/hadoop-*test*.jar.jar TestDFSIO -write -nrFiles 4 -fileSize 250GB -resFile /tmp/TestDFSIOwrite.txt

Resources