java.lang.OutOfMemoryError: GC overhead limit exceeded using R - rstudio

Running a script in RStudio where I attempt to write an output list in '.xlsx'.
The data is 79k rows x 12 columns.
I see the following message:
Error in .jnew("org/apache/poi/xssf/usermodel/XSSFWorkbook") :
java.lang.OutOfMemoryError: GC overhead limit exceeded
This is not a massive file, can someone suggest a solution please?

Related

Copy between Elastices | REQUEST_ENTITY_TOO_LARGE

I'm trying to copy indexes between AWS Opensearch(Elastic engine v6.8) to AWS Opensearch (Opensearch engine v2.3).
I'm using elasticdump to copy the indexes.
I got the foloowing error while in one of the indexes:
Error Emitted => {"Message":"Request size exceeded 104857600 bytes"}
dump ended with error (get phase) => REQUEST_ENTITY_TOO_LARGE: {"Message":"Request size exceeded 104857600 bytes"}
I read that in AWS Opensearch it's limit to 100MB per wrrting to the Opensearch.
I tried to reduce the limit to 10 (using --limit) and still failed at the same point.
Only when I set the limit to 1 it's works.
the question if someone can help me to understand if there is any parameter that I missing or I need a combination of parameters to get it works.
Thanks!!

Kafka Connect JDBC OOM - Large Amount of Data

I am trying to implement something similar to this tutorial. However, it worked because the data set is very small. How would I do this for a larger table? Because I keep gettting an out of memory error. My logs are
ka.connect.runtime.rest.RestServer:60)
[2018-04-04 17:16:17,937] INFO [Worker clientId=connect-1, groupId=connect-cluster] Marking the coordinator ip-172-31-14-140.ec2.internal:9092 (id: 2147483647 rack: null) dead (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:341)
[2018-04-04 17:16:17,938] ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:218)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,939] ERROR Uncaught exception in thread 'kafka-coordinator-heartbeat-thread | connect-sink-redshift': (org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread:51)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,940] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:65)
[2018-04-04 17:16:17,940] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2018-04-04 17:16:17,940] ERROR WorkerSinkTask{id=sink-redshift-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,940] ERROR WorkerSinkTask{id=sink-redshift-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:173)
[2018-04-04 17:16:17,940] INFO Stopping task (io.confluent.connect.jdbc.sink.JdbcSinkTask:96)
[2018-04-04 17:16:17,941] INFO WorkerSourceTask{id=production-db-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:306)
[2018-04-04 17:16:17,940] ERROR Unexpected exception in Thread[KafkaBasedLog Work Thread - connect-statuses,5,main] (org.apache.kafka.connect.util.KafkaBasedLog:334)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,946] INFO WorkerSourceTask{id=production-db-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:323)
[2018-04-04 17:16:17,954] ERROR WorkerSourceTask{id=production-db-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,960] ERROR WorkerSourceTask{id=production-db-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:173)
[2018-04-04 17:16:17,960] INFO [Producer clientId=producer-4] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer:341)
[2018-04-04 17:16:17,960] INFO Stopped ServerConnector#64f4bfe4{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2018-04-04 17:16:17,967] INFO Stopped o.e.j.s.ServletContextHandler#2f06a90b{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
I have also tried increasing the memory with the suggestion here but I am unable to load the entire table into memory. Is there a way to limit the number of data produced?
For the JDBC Connector, the most important property you can probably apply would be this, which seems to be what you are asking for.
batch.max.rows
Maximum number of rows to include in a single batch when polling for new data. This setting can be used to limit the amount of data
buffered internally in the connector.
There is no need to "buffer the entire table into memory", With smaller batches, and more frequent polls and commits, you can ensure that almost all rows will be scanned, and you won't be at risk for a large batch failing, then the connector stopping for a period of time, then restarting and missing a few rows on the next poll.
Otherwise, make sure you aren't doing bulk table mode, as it'll try to scan the entire table again and again.
Also query option can do a column projection on the table.
You can find more configuration options in the documentation, but any OOM errors will need to be carefully examined on a case-by-case basis by enabling JMX monitoring and exporting these values into some aggregate system you can monitor more closely like Prometheus rather than just seeing the OOM error and not knowing if changing any particular parameter is really helping.
Another option would be to use CDC based connectors like another blog post shows

Map Reduce Completed but pig Job Failed

I recently came across this scenario where a MapReduce job seems to be successful in RM where as the PIG script returned with an exit code 8 which refers to "Throwable thrown (an unexpected exception)"
Added the script as requested:
REGISTER '$LIB_LOCATION/*.jar';
-- set number of reducers to 200
SET default_parallel $REDUCERS;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;
SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;
SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;
-- load data from EAP data catalog using given ($ENV = PROD)
data = LOAD 'eap-$ENV://event'
-- using a custom function
USING com.XXXXXX.pig.DataDumpLoadFunc
('{"startDate": "$START_DATE", "endDate" : "$END_DATE", "timeType" : "$TIME_TYPE", "fileStreamType":"$FILESTREAM_TYPE", "attributes": { "all": "true" } }', '$MAPPING_XML_FILE_PATH');
-- filter out null context entity records
filtered = FILTER data BY (attributes#'context_id' IS NOT NULL);
-- group data by session id
session_groups = GROUP filtered BY attributes#'context_id';
-- flatten events
flattened_events = FOREACH session_groups GENERATE FLATTEN(filtered);
-- remove the output directory if exists
RMF $OUTPUT_PATH;
-- store results in specified output location
STORE flattened_events INTO '$OUTPUT_PATH' USING com.XXXX.data.catalog.pig.EventStoreFunc();
And I can see "ERROR 2998: Unhandled internal error. GC overhead limit exceeded" in the pig logs.(log below)
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.mapreduce.FileSystemCounter.values(FileSystemCounter.java:23)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:219)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:199)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:210)
at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:241)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:370)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:391)
at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:451)
at org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:594)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:545)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:543)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:543)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:282)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
at org.apache.pig.PigServer.execute(PigServer.java:1405)
at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:624)
Configuration in the pig script looks like below:
SET default_parallel 200;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;
SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;
SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;
Status of the Job in the RM of the Cluster says the job succeeded [can't post the image as my reputation is too low ;) ]
This issue occurs frequently and we have to restart the job the job successful.
Please let me know a fix for this.
PS: The cluster the job is running is one of the biggest in the world, so no worry with resources or the storage space I say.
Thanks
Can you add your pig script here?
I think, you get this error because the pig itself (not mappers and reducers) can't handle the output.
If you use DUMP operation it your script, then try to limit the displayed dataset first. Let's assume, you have a X alias for your data. Try:
temp = LIMIT X 1;
DUMP temp;
Thus, you will see only one record and save some resources. You can do a STORE operation as well (see in pig manual how to do it).
Obviously, you can configure pig's heap size to be bigger, but pig's heap size is not mapreduce.map or mapreduce.reduce. Use PIG_HEAPSIZE environment variable to do that.
From oracle docs:
After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit
As said in docs, you can turn this exception off or increase heap size.

Error in pig script while processing large file

I am trying to split a large file (15GB) into multiple small files based on a key column inside the file.The same code works fine if i run it on few 1000s of rows.
My code is as below.
REGISTER /home/auto/ssachi/piggybank-0.16.0.jar;
input_dt = LOAD '/user/ssachi/sywr_sls_ln_ofr_dtl/sywr_sls_ln_ofr_dtl.txt-10' USING PigStorage(',');
STORE input_dt into '/user/rahire/sywr_sls_ln_ofr_dtl_split' USING org.apache.pig.piggybank.storage.MultiStorage('/user/rahire/sywr_sls_ln_ofr_dtl_split','4','gz',',');
Error is as below
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 6015: During execution, encountered a Hadoop error.
HadoopVersion 2.6.0-cdh5.8.2
PigVersion 0.12.0-cdh5.8.2
I tried setting the below parameters assuming it is a memory issue, but it did not help.
SET mapreduce.map.memory.mb 16000;
SET mapreduce.map.java.opts 14400;
With the above parameters set, i got the below error.
Container exited with a non-zero exit code 1
org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1486048646102_2613_m_000066_3 Info:Exception from container-launch.
Whats the Cardinality of your " key column " is it in 1000?
If its in 1000 then you will get the error as your Mappers are dying because of OOME.
Do understand each Mapper now maintain 1000 file pointers and a associated buffer for each filePointer enough to occupy whole of your heap.
Can you please provide logs of your mappers for further investigation
Multioutput in MapReduce which is being called internally.
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/

Cypher query removing properties results in out-of-memory error in neo4j-shell

I have a large network of over 15 million nodes. I want to remove the property "CONTROL" from all of them using a Cypher query in the neo4-shell.
If I try and execute any of the following:
MATCH (n) WHERE has(n.`CONTROL`) REMOVE n.`CONTROL` RETURN COUNT(n);
MATCH (n) WHERE has(n.`CONTROL`) REMOVE n.`CONTROL`;
MATCH (n) REMOVE n.`CONTROL`;
the system returns:
Error occurred in server thread; nested exception is:
java.lang.OutOfMemoryError: Java heap space
Even the following query gives the OutOfMemoryError:
MATCH (n) REMOVE n.`CONTROL` RETURN n.`ID` LIMIT 10;
As a test, the following does execute properly:
MATCH (n) WHERE has(n.`CONTROL`) RETURN COUNT(n);
returning 16636351.
Some details:
The memory limit depends on the following settings:
wrapper.java.maxmemory (conf/neo4j-wrapper.conf)
neostore..._memory (conf/neo4j.properties)
By setting these values to total 28 GB in both files, results in a java_pidXXX.hprof file of about 45 GB (wrapper.java.additional=-XX:+HeapDumpOnOutOfMemoryError).
The only clue I could google was:
...you use the Neo4j-Shell which is just an ops tool and just collects the data in memory before sending back, it was never meant to handle huge result sets.
Is it really not possible to remove properties in large networks using the neo4j-shell and cypher? Or what am I doing wrong?
PS
Additional information:
Neo4j version: 2.1.3
Java versions: Java(TM) SE Runtime Environment (build 1.7.0_76-b13) and OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
The database is 7.4 GB (16636351 nodes, 14724489 relations)
The property "CONTROL" is empty, i.e., it has just been defined for all the nodes without actually assigning a property value.
An example of the exception from data/console.log:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid20541.hprof ...
Dump file is incomplete: file size limit
Exception in thread "GC-Monitor" Exception in thread "pool-2-thread-2" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.lang.StringCoding.safeTrim(StringCoding.java:79)
at java.lang.StringCoding.access$300(StringCoding.java:50)
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:305)
at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
at java.lang.String.getBytes(String.java:956)
at ch.qos.logback.core.encoder.LayoutWrappingEncoder.convertToBytes(LayoutWrappingEncoder.java:122)
at ch.qos.logback.core.encoder.LayoutWrappingEncoder.doEncode(LayoutWrappingEncoder.java:135)
at ch.qos.logback.core.OutputStreamAppender.writeOut(OutputStreamAppender.java:194)
at ch.qos.logback.core.FileAppender.writeOut(FileAppender.java:209)
at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:219)
at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:103)
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:88)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:273)
at ch.qos.logback.classic.Logger.callAppenders(Logger.java:260)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:442)
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:396)
at ch.qos.logback.classic.Logger.warn(Logger.java:709)
at org.neo4j.kernel.logging.LogbackService$Slf4jToStringLoggerAdapter.warn(LogbackService.java:243)
at org.neo4j.kernel.impl.cache.MeasureDoNothing.run(MeasureDoNothing.java:84)
java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Statistics Gatherer[primitives]" java.lang.OutOfMemoryError: Java heap space
Exception in thread "RMI RenewClean-[10.65.4.212:42299]" java.lang.OutOfMemoryError: Java heap space
Exception in thread "RMI RenewClean-[10.65.4.212:43614]" java.lang.OutOfMemoryError: Java heap space
see here: http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/
To update data with Cypher it is also necessary to take transaction size into account. For the embedded case, batching transactions is discussed in the next installment of this series. For the remote execution via the Neo4j REST API there are a few important things to remember. Especially with large index lookups and match results, it might happen that the query updates hundreds of thousands of elements. Then a paging mechanism using WITH and SKIP/LIMIT can be put in front of the updating operation.
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WITH a, count(*) AS cnt
SKIP {offset} LIMIT {pagesize}
SET a.movie_count = cnt
RETURN count(*)
Run with pagesize=20000 and increasing offset=0,20000,40000,… until the query returns a count < pagesize
So in your case, repeat this until it returns 0 rows. You can also increase the limit to 1M.
MATCH (n) WHERE has(n.`CONTROL`)
WITH n
LIMIT 100000
REMOVE n.`CONTROL`
RETURN COUNT(n);

Resources