Stanford NER - Heap space error - stanford-nlp

I am using Stanford NER in my web application and english.muc.7class.distsim.crf.ser.gz (16 MB size)as a classifier.When I try to deploy and run my application am getting a heap space - Out of Memory error while loading the classifier.
Have tried keeping the useful code only also checked if the code is not creating too many objects and occupying the space. But no success.
Is it because of the size of the classifier? But i want to use the same so what should I do?
Have increased the heap size on local using vm options in tomcat.But I can increase the heap size of vm on the actual server where I will host my application and that's not the right way either.
Can anyone guide me about this?

Yes, you basically shouldn't worry much about the size of the code, since it is dominated by the size of the data loaded.
Model data: The classifier models just take a lot of space. It seems like you need a heap of about 140 MB to load the current (2012) version of english.muc.7class.distsim.crf.ser.gz . They're just a lot of Strings and doubles, but there's a big increase from the size on disk because: the on disk data is compressed, as is well known, String objects in java each take a huge amount of space, and they're linked via a HashMap which takes more space. It seems like the String data alone ends up taking about 72 MB in memory (36 MB of char[] data, 36 MB of String objects).
Data to be analyzed: This depends on how you're calling it and may not be a problem in your case with tomcat, but if NER is run on a file, it will read the whole file into memory before classifying. So you can reduce memory by giving it multiple smaller units (files, Strings, or whatever) to classify.
Also, you're much more likely to get prompt help on questions like this with the tag stanford-nlp.

I agree with Christopher suggestions, you don't worry about the size.
But for robust performance try to use Java thread that is live for ever and load the classifier only once at the start of serevr via static method or some listener. Then for further annotation use the same context.

Related

Spring Data JPA Meta JpaMetamodelMappingContext Memory Consumption

My Spring Data JPA/Hibernate Application consumes over 2GB of memory at start without a single user hitting it. I am using Hazelcast as the second level cache but I had the same issue when I used ehCache as well so that is probably not the cause of the issue.
I ran a profile with a Heap Dump in Visual VM and I see where the bulk of the memory is being consumed by JpaMetamodelMappingContext and secondary a ton of Map objects. I just need help in deciphering what I am seeing and if this is actually a problem. I do have a hundred classes in the model so this may be normal but I have no point of reference. It just seems a bit excessive.
Once I get a load of 100 concurrent users, my memory consumption increases to 6-7 GB. That is quite normal for the amount of data I push around and cache, but I feel like if I could reduce the initial memory, I'd have a lot more room for growth.
I don't think you have a problem here.
Instead, I think you are misinterpreting the data you are looking at.
Note that the heap space diagram displays two numbers: Heap size and Used heap
Heap size (orange) is the amount of memory available to the JVM for the heap.
This means it is the amount that the JVM requested at some point from the OS.
Used heap is the part of the Heap size that is actually used.
Ignoring the startup phase, it grows linear and then drops repeatedly over time.
This is typical behavior of an idling application.
Some part of the application generates a moderate amount of garbage (rising part of the curve) which from time to time gets collected.
The low points of that curve are the amount of memory you are actually really using.
It seems to be about 250MB which doesn't sound very much to me, especially when you say that the total consumption of 6-7GB when actually working sounds reasonable to you.
Some other observations:
Both CPU load and heap grows fast/fluctuates a lot at start time.
This is to be expected because the analysis of repositories and entities happen at that time.
JpaMetamodelMappingContext s retained size is about 23MB.
Again, a good chunk of memory, but not that huge.
This includes the stuff it references, which is almost exclusively metadata from the JPA implementation as you can easily see when you take a look at its source.

Ignite uses more memory than expected

I am using Ignite to build a framework for data calculation. One big problem is the memory usage is a little more than expected. The data using 1G memory outside Ignite will use more than 1.5G in Ignite cache.
I turned off backup and copyOnRead already. I don't use query feature so no extra index space. I also counted in the extra space used for each cache and cache entry. The total memory usages still doesn't add up.
The data value for each cache entry is a big map contains list of primitive arrays. Each entry is about 120MB.
What can be the problem? The data structure or the configuration?
Ignite does introduce some overhead to your data and half of a GB doesn't sound too bad too me. I would recommend you to refer to this guide for more details: https://apacheignite.readme.io/docs/capacity-planning
Difference between expected and real memory usage arises from 2 main points:
Each entry takes constant overhead consists of objects providing support for processing entries in distributed computing environment.
E.g. you can declare integer local variable, it takes 4 bytes in the stack, but it's hard to make the variable long live and accessible from other places of program. So you have to create new Integer object, which consumes at least 16 bytes (300% overhead isn't it?). Going further, if you want to make this object mutable and safely acsessible by multiple threads, you have to create new AtomicReference and store your object inside. Total memory consumption will be at least 32 bytes... and so on. Every time we're extending object functionality, we get additional overhead, there is no other way.
Each entry stored inside a cache in a special serialized format. So the actual memory footprint of an entry depends on the format is used. By default Ignite uses BinaryMarshaller to convert an object to the byte array, and this array is stored inside a BinaryObject.
The reason is simple, distributed computing systems continiously exchange entries between nodes, and every entry in cache should be ready to be transferred as a byte array.
Please, read the article, it was recently updated. You could estimate entry overhead for small entries by hand, but for big entries you should inspect actual entry stored in the cache as a byte array. Look at the withKeepBinary method.

JVM memory tuning for eXist

Suppose you had a server with 24G RAM at your disposal, how much memory would you allocate to (Tomcat to run) eXist?
I'm setting up our new webserver, with an Intel Xeon E5649 (2.53GHz) processor, running Ubuntu 12.04 64-bit. eXist is running as a webapp inside Tomcat, and the db is only used for querying 'stable' collections --that is, no updates are being executed to the resources inside eXist.
I've been experimenting with different heap sizes (via -Xms and -Xmx settings when starting the Tomcat process), and so far haven't noticed much difference in response time for queries against eXist. In other words, it doesn't seem to matter much whether the JVM is allocated 4G or 16G. I have also upped the #cachesize and #collectionCache in eXist's WEB-INF/conf.xml file to e.g. 8192M, but this doesn't seem to have much effect. I suppose these settings /do/ have an influence when eXist is running inside Tomcat?
I know each situation is different (and I know there's a Tomcat server involved), but are there some rules of thumb for eXist performance w.r.t. the memory it is allocated? I'd like to get at a sensible memory configuration for a setup with a larger amount of RAM available.
This question was asked and answered on the exist-open mailing list. The answer from wolfgang#exist-db.org was:
Giving more memory to eXist will not necessarily improve response times. "Bad"
queries may consume lots of RAM, but the better your queries are optimized, the
less RAM they need: most of the heavy processing will be done using index
lookups and the optimizer will try to reduce the size of the node sets to be
passed around. Caching memory thus has to be large enough to hold the most
relevant index pages. If this is already the case, increasing the caching space
will not improve performance anymore. On the other hand, a too small cacheSize
of collectionCache will result in a recognizable bottleneck. For example, a
batch upload of resources or creating a backup can take several hours (instead
of e.g. minutes) if #collectionCache is too small.
If most of your queries are optimized to use indexes, 8gb RAM for eXist does
usually give you enough room to handle the occasional high load. Ideally you
could run some load tests to see what the maximum memory use actually is. For
#cacheSize, I rarely have to go beyond 512m. The setting for #collectionCache
depends on the number of collections and documents in the database. If you have
tens or hundreds of thousands of collections, you may have to increase it up to
768m or more. As I said above, you will recognize a sudden breakdown in
performance during uploads or backups if the collectionCache becomes too small.
So to summarize, a reasonable setting for me would be: -Xmx8192m,
#cacheSize="512m", #collectionCache="768m". If you can afford giving 16G main
memory it certainly won’t hurt. Also, if you are using the lucene index or the
new range index, you should consider increasing the #buffer setting in the
corresponding index module configurations in conf.xml as well:
<module id="lucene-index" buffer="256" class="org.exist.indexing.lucene.LuceneIndex" />
<module id="range-index" buffer="256" class="org.exist.indexing.range.RangeIndex"/>

Gradle heap error on uploadArchives

I am trying to upload an archive that's 600MB in size.
I get this error:
Execution failed for task ':uploadArchives'.
> Java heap space
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
I have tried to set GRADLE_OPTS, JVM_OPTS and MAVEN_OPTS variables, for setting the max. heap size, like for example:
export GRADLE_OPTS=-Xmx1024m
gradle uploadArchives
But I am still getting the same error.
What am I missing here?
Ultimately you always have a finite max of heap to use no matter what platform you are running on. In Windows 32 bit this is around 2gb (not specifically heap but total amount of memory per process). It just happens that Java happens to make the default smaller (presumably so that the programmer can't create programs that have runaway memory allocation without running into this problem and having to examine exactly what they are doing).
So this given there are several approaches you could take to either determine what amount of memory you need or to reduce the amount of memory you are using. One common mistake with garbage collected languages such as Java or C# is to keep around references to objects that you no longer are using, or allocating many objects when you could reuse them instead. As long as objects have a reference to them they will continue to use heap space as the garbage collector will not delete them.
In this case you can use a Java memory profiler to determine what methods in your program are allocating large number of objects and then determine if there is a way to make sure they are no longer referenced, or to not allocate them in the first place. One option which I have used in the past is "JMP" http://www.khelekore.org/jmp/.
If you determine that you are allocating these objects for a reason and you need to keep around references (depending on what you are doing this might be the case), you will just need to increase the max heap size when you start the program. However, once you do the memory profiling and understand how your objects are getting allocated you should have a better idea about how much memory you need.
In general if you can't guarantee that your program will run in some finite amount of memory (perhaps depending on input size) you will always run into this problem. Only after exhausting all of this will you need to look into caching objects out to disk etc. At this point you should have a very good reason to say "I need Xgb of memory" for something and you can't work around it by improving your algorithms or memory allocation patterns. Generally this will only usually be the case for algorithms operating on large datasets (like a database or some scientific analysis program) and then techniques like caching and memory mapped IO become useful.
Run Java with the command-line option -Xmx, which sets the maximum size of the heap.
http://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html#nonstandard
The problem was because the actual size of package was a lot higher due to gradle not being able to handle symlinks.
When I manually handled the symlinks, the problem ended.
The JVM heap size can be set in the gradle.properties file, in the root directory of your gradle project. Like this:
org.gradle.jvmargs=-Xms256m -Xmx1024m

Use 3GB Free Space to Access 30 GB info without Virtual Memory Paging?

I have a quick question:
How can we Use a 3GB Free Space to Access roughly 30 GB of data without Virtual Memory or Compression? It's more of a Data Structure Question.
Thanks
You should somehow mimic the paging mechanism.
One way to do it is hashing1.
Hash all your data into bins, and store these bins in disk. In your main memory (RAM) you will only hold an array of pointers to disk. Once you need an address, you know where it is on disk by accessing the RAM and taking the pointer from the location hash(address)
You can of course optimize it to keep a portion of the data in memory - using the principle of locality - and hoping to get a hit - and avoid reloading a chunk from disk.
(1) The hashing does not have to be complex or uniformly distributed. I believe using the MSb's of the address will be just fine - and will actually mimic the paging mechanism better.
The most obvious way would be through a typical filesystem API with read, write, and seek functions.

Resources