Hadoop Error: Java heap space - hadoop

So, after seeing the a percent or so of running the job I get an error that says, "Error: Java heap space" and then something along the lines of, "Application container killed"
I am literally running an empty map and reduce job. However, the job does take in an input that is, roughly, about 100 gigs. For whatever reason, I run out of heap space. Although the job does nothing.
I am using default configurations and it's on a single machine. It is running on hadoop version 2.2 and ubuntu. The machine has 4 gigs of ram.
Thanks!
//Note
Got it figured out.
Turns out I was setting the configuration to have a different terminating token/string. The format of the data had changed, so that token/string no longer existed. So it was trying to send all 100gigs into ram for one key.

Related

How to rebuild an elasticsearch index for lots of data without it getting "Killed" after 15 hours or so

I have about 130 million articles in my Postgres database on AWS. I am trying to index them with elasticsearch. In a screen, I entered:
python manage.py search_index --rebuild -f --parallel --model [APP NAME].[MODEL NAME]
Everything began correctly. The output was
Deleting index '[MODEL NAME]'
Creating index '[MODEL NAME]'
Indexing 129413202 'MODEL NAME' objects (parallel)
But after about 15 hours, the output was "Killed". I was running this on a t2.xlarge EC2 instance, which has 16 GBs of memory. Interestingly, the "Killed" message happened after I saw that the connection to the AWS server was broken, but that shouldn't matter if the process was run in a screen. Any idea what the issue is? Do I just need to get an even larger EC2 instance?
A process unexpectedly exiting with message Killed often means it received a SIGKILL; if so then the exit code would be 137. Hard to be certain here, a process can obviously print Killed and exit with code 137 anyway, but assuming you're not doing that in your code then this is what I'd check next.
An unexpected SIGKILL often comes from the kernel's OOM killer which takes action when the system runs out of memory and typically kills the process with the largest memory footprint. If so it will have logged details in the kernel logs that you can read with dmesg.
If it was the OOM killer then this sounds like a bug in this indexing code. Indexing a large body of documents into Elasticsearch should require pretty limited working memory, nowhere near 16GB, but it's easy to accidentally keep too much data in memory for too long which would lead to excessive memory usage.
python manage.py search_index suggests you're using the Django Elasticsearch DSL which fixed a performance issue relatively recently. Make sure you're using a version that contains this fix.

In SPL TEDA 4.2 , do we have limitation on number of input file types that can be included?

I am working on TEDA v2.0.1 and SPL v4.2 . When I am trying to add more than 18 different input file types , the job is compiled successfully but at runtime, it is going to status 'no' without any error in logs.
I have faced this issue while developing multiple applications.
Do you face the problem that pe trace files having size of 0 bytes?
The status "no" is related to job status?
How much memory does the zookeeper have? If jobs are getting bigger, than the Zookeeper might become the bottleneck, increase the heap size for the zookeeper helps in this case.

Hadoop EMR job runs out of memory before RecordReader initialized

I'm trying to figure out what could be causing my emr job to run out of memory before it has even started processing my file inputs. I'm getting a
"java.lang.OutOfMemoryError cannot be cast to java.lang.Exception" error before my RecordReader is even initialized (aka, before it even tried to unzip the files and process them). I am running my job on a directory with a large amount of inputs. I am able to run my job just fine on a smaller input set. Does anyone have any ideas?
I realized that the answer is that there was too much metadata overhead on the master node. The master node must store ~150 kb of data for each file that will be processed. With millions of files, this can be gigabytes of data, which was too much and caused the master node to crash.
Here's a good source for more information: http://www.inquidia.com/news-and-info/working-small-files-hadoop-part-1#sthash.YOtxmQvh.dpuf

Solr ate all Memory and throws -bash: cannot create temp file for here-document: No space left on device on Server

I have been started solr for long time approx 2 weeks then I saw that Solr ate around 22 GB from 28 GB RAM of my Server.
While checking status of Solr, using bin/solr -i it throws -bash: cannot create temp file for here-document: No space left on device
I stopped the Solr, and restarted the solr. It is working fine.
What's the problem actually. Didn't get?
And what is the solution for that?
I never want that Solr gets stop/halt while running.
First you should check the space on your file system. For example using df -h. Post the output here.
Is there any mount-point without free space?
2nd: find out the reason, why there is no space left. Your question handles two different thing: no space left on file system an a big usage of RAM.
Solr stores two different kind of data: the search index an the data himself.
Storing the data is only needed, if you like to output the documents after finding them in index. For example if you like to use highlighting. So take a look at your schema.xml an decide for every singe field, if it must be stored or if "indexing" the field is enough for your needs. Use the stored=true parameter for that.
Next: if you rebuild the index: keep in mind, that you need double space on disc during index rebuild.
You also could think about to move your index/data files to an other disk.
If you have solved you "free space" problem on disc, so you probably don't have an RAM issue any more.
If there is still a RAM problem, please post you java start parameter. There you can define, how much RAM is available for Solr. Solr needs a lot of virtual RAM, but an moderate size of physical RAM.
And: you could post the output of your logfile.

I/O performance of multiple JVM (Windows 7 affected, Linux works)

I have a program that creates a file of about 50MB size. During the process the program frequently rewrites sections of the file and forces the changes to disk (in the order of 100 times). It uses a FileChannel and direct ByteBuffers via fc.read(...), fc.write(...) and fc.force(...).
New text:
I have a better view on the problem now.
The problem appears to be that I use three different JVMs to modify a file (one creates it, two others (launched from the first) write to it). Every JVM closes the file properly before the next JVM is started.
The problem is that the cost of fc.write() to that file occasionally goes through the roof for the third JVM (in the order of 100 times the normal cost). That is, all write operations are equally slow, it is not just one that hang very long.
Interestingly, one way to help this is to insert delays (2 seconds) between the launching of JVMs. Without delay, writing is always slow, with delay, the writing is slow aboutr every second time or so.
I also found this Stackoverflow: How to unmap a file from memory mapped using FileChannel in java? which describes a problem for mapped files, which I'm not using.
What I suspect might be going on:
Java does not completely release the file handle when I call close(). When the next JVM is started, Java (or Windows) recognizes concurrent access to that file and installes some expensive concurrency handler for that file, which makes writing expensive.
Would that make sense?
The problem occurs on Windows 7 (Java 6 and 7, tested on two machines), but not under Linux (SuSE 11.3 64).
Old text:
The problem:
Starting the program from as a JUnit test harness from eclipse or from console works fine, it takes around 3 seconds.
Starting the program through an ant task (or through JUnit by kicking of a separate JVM using a ProcessBuilder) slows the program down to 70-80 seconds for the same task (factor 20-30).
Using -Xprof reveals that the usage of 'force0' and 'pwrite' goes through the roof from 34.1% (76+20 tics) to 97.3% (3587+2913+751 tics):
Fast run:
27.0% 0 + 76 sun.nio.ch.FileChannelImpl.force0
7.1% 0 + 20 sun.nio.ch.FileDispatcher.pwrite0
[..]
Slow run:
Interpreted + native Method
48.1% 0 + 3587 sun.nio.ch.FileDispatcher.pwrite0
39.1% 0 + 2913 sun.nio.ch.FileChannelImpl.force0
[..]
Stub + native Method
10.1% 0 + 751 sun.nio.ch.FileDispatcher.pwrite0
[..]
GC and compilation are negligible.
More facts:
No other methods show a significant change in the -Xprof output.
It's either fast or very slow, never something in-between.
Memory is not a problem, all test machines have at least 8GB, the process uses <200MB
rebooting the machine does not help
switching of virus-scanners and similar stuff has no affect
When the process is slow, there is virtually no CPU usage
It is never slow when running it from a normal JVM
It is pretty consistently slow when running it in a JVM that was started from the first JVM (via ProcessBuilder or as ant-task)
All JVMs are exactly the same. I output System.getProperty("java.home") and the JVM options via RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean(); List arguments = RuntimemxBean.getInputArguments();
I tested it on two machines with Windows7 64bit, Java 7u2, Java 6u26 and JRockit, the hardware of the machines differs, though, but the results are very similar.
I tested it also from outside Eclipse (command-line ant) but no difference there.
The whole program is written by myself, all it does is reading and writing to/from this file, no other libraries are used, especially no native libraries. -
And some scary facts that I just refuse to believe to make any sense:
Removing all class files and rebuilding the project sometimes (rarely) helps. The program (nested version) runs fast one or two times before becoming extremely slow again.
Installing a new JVM always helps (every single time!) such that the (nested) program runs fast at least once! Installing a JDK counts as two because both the JDK-jre and the JRE-jre work fine at least once. Overinstalling a JVM does not help. Neither does rebooting. I haven't tried deleting/rebooting/reinstalling yet ...
These are the only two ways I ever managed to get fast program runtimes for the nested program.
Questions:
What may cause this performance drop for nested JVMs?
What exactly do these methods do (pwrite0/force0)? -
Are you using local disks for all testing (as opposed to any network share) ?
Can you setup Windows with a ram drive to store the data ? When a JVM terminates, by default its file handles will have been closed but what you might be seeing is the flushing of the data to the disk. When you overwrite lots of data the previous version of data is discarded and may not cause disk IO. The act of closing the file might make windows kernel implicitly flush data to disk. So using a ram drive would allow you to confirm that their since disk IO time is removed from your stats.
Find a tool for windows that allows you to force the kernel to flush all buffers to disk, use this in between JVM runs, see how long that takes at the time.
But I would guess you are hitten some iteraction with the demands of the process and the demands of the kernel in attempting to manage disk block buffer cache. In linux there is a tool like "/sbin/blockdev --flushbufs" that can do this.
FWIW
"pwrite" is a Linux/Unix API for allowing concurrent writing to a file descriptor (which would be the best kernel syscall API to use for the JVM, I think Win32 API already has provision for the same kinds of usage to share a file handle between threads in a process, but since Sun have Unix heritige things get named after the Unix way). Google "pwrite(2)" for more info on this API.
"force" I would guess that is a file system sync, meaning the process is requesting the kernel to flush unwritten data (that is currently in disk block buffer cache) into the file on the disk (such as would be needed before you turned your computer off). This action will happen automatically over time, but transactional systems require to know when the data previously written (with pwrite) has actually hit the physical disk and is stored. Because some other disk IO is dependant on knowing that, such as with transactional checkpointing.
One thing that could help is making sure you explicitly set the FileChannel to null. Then call System.runFinalization() and maybe System.gc() at the end of the program. You may need more than 1 call.
System.runFinalizersOnExit(true) may also help, but it's deprecated so you will have to deal with the compiler warnings.

Resources