How to configure Lucene (SOLR) internal caching - memory issue/leak? - caching

I am using SOLR 4.4.0 - I found (possible) issue related to internal caching mechanism.
JVM: -Xmx=15g but 12g was never free.
I created heap dump and analyze it using MemoryAnyzer - I found 2 x 6Gb used as cache data.
In second time I do the same for -Xmx12g - I found 1 x 3.5Gb
It was always the same cache.
I check in source code and I found:
/** Expert: The cache used internally by sorting and range query classes. */
public static FieldCache DEFAULT = new FieldCacheImpl();
see http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.4.0/org/apache/lucene/search/FieldCache.java#FieldCache.0DEFAULT
This is very bad news because it is public static field and it is used in about 160 places in source code.
MemoryAnalyzer say:
One instance of "org.apache.lucene.search.FieldCacheImpl" loaded by
"org.apache.catalina.loader.WebappClassLoader # 0x58c3a9848" occupies
4,103,248,240 (80.37%) bytes. The memory is accumulated in one
instance of "java.util.HashMap$Entry[]" loaded by "".
Keywords java.util.HashMap$Entry[]
org.apache.catalina.loader.WebappClassLoader # 0x58c3a9848
org.apache.lucene.search.FieldCacheImpl
I do not know how to manage this kind of caches - any advice?
And finally I got OutOfMemoryError + 12Gb of memory is blocked.

I implemented kind of workaround:
I created this kind of class:
public class InternalApplicationCacheManager implements InternalApplicationCacheManagerMBean {
public synchronized int getInternalCacheSize() {
return FieldCache.DEFAULT.getCacheEntries().length;
}
public synchronized void purgeInternalCaches() {
FieldCache.DEFAULT.purgeAllCaches();
}
}
and registered it in JMX via org.apache.lucene.search.FieldCacheImpl
...
private synchronized void init() {
...
initBeans();
}
private void initBeans() {
try {
InternalApplicationCacheManager cacheManagerMBean = new InternalApplicationCacheManager();
MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
ObjectName name = new ObjectName("org.apache.lucene.search.jmx:type=InternalApplicationCacheManager");
mbs.registerMBean(cacheManagerMBean, name);
} catch (InstanceAlreadyExistsException e) {
...
}
}
...
This solution provide you invalidate internal caches - which solve partially this issue.
Unfortunately there are other places (mostly caches) where some data is stored and not removed as fast as I expect.

If you use FieldCacheRangeFilter you may wanna try range filters which work without field cache. If sorting is an issue, you may try using less sort fields or ones with a data type using less memory.
The field cache for each reader/atomic reader is thrown away when the reader is garbage collected. so a re-initialization of the reader should clear the cache which also means that the first operation using the cache will be a lot slower.
Fact is: FieldCache based range filter and sorting relies on the cache. There is no getting around when you really need those. You only can adapt your usage to minimize the memory consumption.

Related

JAVA 8 Extract predicates as fields or methods?

What is the cleaner way of extracting predicates which will have multiple uses. Methods or Class fields?
The two examples:
1.Class Field
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty)
.forEach(System.out::println);
}
private IntPredicate isOverFifty = number -> number > 50;
2.Method
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty())
.forEach(System.out::println);
}
private IntPredicate isOverFifty() {
return number -> number > 50;
}
For me, the field way looks a little bit nicer, but is this the right way? I have my doubts.
Generally you cache things that are expensive to create and these stateless lambdas are not. A stateless lambda will have a single instance created for the entire pipeline (under the current implementation). The first invocation is the most expensive one - the underlying Predicate implementation class will be created and linked; but this happens only once for both stateless and stateful lambdas.
A stateful lambda will use a different instance for each element and it might make sense to cache those, but your example is stateless, so I would not.
If you still want that (for reading purposes I assume), I would do it in a class Predicates let's assume. It would be re-usable across different classes as well, something like this:
public final class Predicates {
private Predicates(){
}
public static IntPredicate isOverFifty() {
return number -> number > 50;
}
}
You should also notice that the usage of Predicates.isOverFifty inside a Stream and x -> x > 50 while semantically the same, will have different memory usages.
In the first case, only a single instance (and class) will be created and served to all clients; while the second (x -> x > 50) will create not only a different instance, but also a different class for each of it's clients (think the same expression used in different places inside your application). This happens because the linkage happens per CallSite - and in the second case the CallSite is always different.
But that is something you should not rely on (and probably even consider) - these Objects and classes are fast to build and fast to remove by the GC - whatever fits your needs - use that.
To answer, it's better If you expand those lambda expressions for old fashioned Java. You can see now, these are two ways we used in our codes. So, the answer is, it all depends how you write a particular code segment.
private IntPredicate isOverFifty = new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
private IntPredicate isOverFifty() {
return new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
}
1) For field case you will have always allocated predicate for each new your object. Not a big deal if you have a few instances, likes, service. But if this is a value object which can be N, this is not good solution. Also keep in mind that someMethod() may not be called at all. One of possible solution is to make predicate as static field.
2) For method case you will create the predicate once every time for someMethod() call. After GC will discard it.

lock-free synchronization, fences and memory order (store operation with acquire semantics)

I am migrating a project that was run on bare-bone to linux, and need to eliminate some {disable,enable}_scheduler calls. :)
So I need a lock-free sync solution in a single writer, multiple readers scenario, where the writer thread cannot be blocked. I came up with the following solution, which does not fit to the usual acquire-release ordering:
class RWSync {
std::atomic<int> version; // incremented after every modification
std::atomic_bool invalid; // true during write
public:
RWSync() : version(0), invalid(0) {}
template<typename F> void sync(F lambda) {
int currentVersion;
do {
do { // wait until the object is valid
currentVersion = version.load(std::memory_order_acquire);
} while (invalid.load(std::memory_order_acquire));
lambda();
std::atomic_thread_fence(std::memory_order_seq_cst);
// check if something changed
} while (version.load(std::memory_order_acquire) != currentVersion
|| invalid.load(std::memory_order_acquire));
}
void beginWrite() {
invalid.store(true, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
void endWrite() {
std::atomic_thread_fence(std::memory_order_seq_cst);
version.fetch_add(1, std::memory_order_release);
invalid.store(false, std::memory_order_release);
}
}
I hope the intent is clear: I wrap the modification of a (non-atomic) payload between beginWrite/endWrite, and read the payload only inside the lambda function passed to sync().
As you can see, here I have an atomic store in beginWrite() where no writes after the store operation can be reordered before the store. I did not find suitable examples, and I am not experienced in this field at all, so I'd like some confirmation that it is OK (verification through testing is not easy either).
Is this code race-free and work as I expect?
If I use std::memory_order_seq_cst in every atomic operation, can I omit the fences? (Even if yes, I guess the performance would be worse)
Can I drop the fence in endWrite()?
Can I use memory_order_acq_rel in the fences? I don't really get the difference -- the single total order concept is not clear to me.
Is there any simplification / optimization opportunity?
+1. I happily accept any better idea as the name of this class :)
The code is basically correct.
Instead of having two atomic variables (version and invalid) you may use single version variable with semantic "Odd values are invalid". This is known as "sequential lock" mechanism.
Reducing number of atomic variables simplifies things a lot:
class RWSync {
// Incremented before and after every modification.
// Odd values mean that object in invalid state.
std::atomic<int> version;
public:
RWSync() : version(0) {}
template<typename F> void sync(F lambda) {
int currentVersion;
do {
currentVersion = version.load(std::memory_order_seq_cst);
// This may reduce calls to lambda(), nothing more
if(currentVersion | 1) continue;
lambda();
// Repeat until something changed or object is in an invalid state.
} while ((currentVersion | 1) ||
version.load(std::memory_order_seq_cst) != currentVersion));
}
void beginWrite() {
// Writer may read version with relaxed memory order
currentVersion = version.load(std::memory_order_relaxed);
// Invalidation requires sequential order
version.store(currentVersion + 1, std::memory_order_seq_cst);
}
void endWrite() {
// Writer may read version with relaxed memory order
currentVersion = version.load(std::memory_order_relaxed);
// Release order is sufficient for mark an object as valid
version.store(currentVersion + 1, std::memory_order_release);
}
};
Note the difference in memory orders in beginWrite() and endWrite():
endWrite() makes sure that all previous object's modifications have been completed. It is sufficient to use release memory order for that.
beginWrite() makes sure that reader will detect object being in invalid state before any futher object's modification is started. Such garantee requires seq_cst memory order. Because of that reader uses seq_cst memory order too.
As for fences, it is better to incorporate them into previous/futher atomic operation: compiler knows how to make the result fast.
Explanations of some modifications of original code:
1) Atomic modification like fetch_add() is intended for cases, when concurrent modifications (like another fetch_add()) are possible. For correctness, such modifications use memory locking or other very time-costly architecture-specific things.
Atomic assignment (store()) does not use memory locking, so it is cheaper than fetch_add(). You may use such assignment because concurrent modifications are not possible in your case (reader does not modify version).
2) Unlike to release-acquire semantic, which differentiate load and store operations, sequential consistency (memory_order_seq_cst) is applicable to every atomic access, and provide total order between these accesses.
The accepted answer is not correct. I guess the code should be something like "currentVersion & 1" instead of "currentVersion | 1". And subtler mistake is that, reader thread can go into lambda(), and after that, the write thread could run beginWrite() and write value to non-atomic variable. In this situation, write action in payload and read action in payload haven't happens-before relationship. concurrent access (without happens-before relationship) to non-atomic variable is a data race. Note that, single total order of memory_order_seq_cst does not means the happens-before relationship; they are consistent, but two kind of things.

Fetch 1M records in orientdb: why is it 6x slower than bare SQL+MySQL

For some graph algorithm I need to fetch a lot of records from a database to memory (~ 1M records). I want this to be done fast and I want the records to be objects (that is: I want ORM). To crudely benchmark different solutions I created a simple problem of one table with 1M Foo objects like I did here: Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor? .
One can see that fetching them using bare SQL is extremely fast; also converting the records to objects using a simple for-loop is fast. Both execute in around 2-3 seconds. However using ORM's like SQLAlchemy and Hibernate, this takes 20-30 seconds: a lot slower if you ask me, and this is just a simple example without relations and joins.
SQLAlchemy gives itself the feature "Mature, High Performing Architecture," (http://www.sqlalchemy.org/features.html). Similarly for Hibernate "High Performance" (http://hibernate.org/orm/). In a way both are right, because they allow for very generic object oriented data models to be mapped back and forth to a MySQL database. On the other hand they are awfully wrong, since they are 10x slower than just SQL and native code. Personally I think they could do better benchmarks to show this, that is, a benchmark comparing with native SQL + java or python. But that is not the problem at hand.
Of course, I don't want SQL + native code, as it is hard to maintain. So I was wondering why there does not exist something like an object oriented database, which handles the database->object mapping native. Someone suggested OrientDB, hence I tried it. The API is quite nice: when you have your getters and setters right, the object is insertable and selectable.
But I want more than just API-sweetness, so I tried the 1M example:
import java.io.Serializable;
public class Foo implements Serializable {
public Foo() {}
public Foo(int a, int b, int c) { this.a=a; this.b=b; this.c=c; }
public int a,b,c;
public int getA() { return a; }
public void setA(int a) { this.a=a; }
public int getB() { return b; }
public void setB(int b) { this.b=b; }
public int getC() { return c; }
public void setC(int c) { this.c=c; }
}
import com.orientechnologies.orient.object.db.OObjectDatabaseTx;
public class Main {
public static void insert() throws Exception {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
int N=1000000;
long time = System.currentTimeMillis();
for(int i=0; i<N; i++) {
Foo foo = new Foo(i, i*i, i+i*i);
db.save(foo);
}
db.close();
System.out.println(System.currentTimeMillis() - time);
}
public static void fetch() {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
long time = System.currentTimeMillis();
for (Foo f : db.browseClass(Foo.class).setFetchPlan("*:-1")) {
if(f.getA() == 345234) System.out.println(f.getB());
}
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
public static void main(String[] args) throws Exception {
//insert();
fetch();
}
}
Fetching 1M Foo's using OrientDB takes approximately 18 seconds. The for-loop with the getA() is to force the object fields to be actually loaded into memory, as I noticed that by default they are fetched lazily. I guess this may also be the reason fetching the Foo's is slow, because it has db-access each iteration instead of db-access once when it fetches everything (including the fields).
I tried to fix that using setFetchPlan("*:-1"), I figured it may also apply on fields, but that did not seem to work.
Question: Is there a way to do this fast, preferably in the 2-3 seconds range? Why does this take 18 seconds, whilst the bare SQL version uses 3 seconds?
Addition: Using a ODatabaseDocumentTX like #frens-jan-rumph suggested only gave ma a speedup of approximately 5, but of approximatelt 2. Adjusting the following code gave me a running time of approximately 9 seconds. This is still 3 times slower than raw sql whilst no conversion to Foo's was executed. Almost all time goes to the for-loop.
public static void fetch() {
ODatabaseDocumentTx db = new ODatabaseDocumentTx ("plocal:/opt/orientdb-community-1.7.6/databases/pits2").open("admin", "admin");
long time = System.currentTimeMillis();
ORecordIteratorClass<ODocument> it = db.browseClass("Foo");
it.setFetchPlan("*:0");
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
time = System.currentTimeMillis();
for (ODocument f : it) {
//if((int)f.field("a") == 345234) System.out.println(f.field("b"));
}
System.out.println("Iterating all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
The answer lies in convenience.
During an interview, when I asked a candidate what they thought of LINQ (C# I know, but pertinent to your question), they quite rightly answered that it was a sacrifice of performance, over convenience.
A hand-written SQL statement (whether or not it calls a stored procedure) is always going to be faster than using an ORM that auto-magically converts the results of the query in to nice, easy-to-use POCOs.
That said, the difference should not be that great as you have experienced. Yes, there is overhead in doing it the auto-magical way, but it shouldn't be that great. I do have experience here, and within C# I have had to use special reflection classes to reduce the time it takes to do this auto-magical mapping.
With large swabs of data, I would expect an initial slow-down from an ORM, but then it would be negligible. 3 seconds to 18 seconds is huge.
If you profile your test, you would discover that around 60 - 80% of the CPU time is taken by execution of the following four methods:
com.orienttechnologies...OObjectEntitySerializer.getField(...)
com.orienttechnologies...OObjectEntityEnhancer.getProxiedInstance(...)
com.orienttechnologies...OObjectMethodFilter.isScalaClass(...)
javaassist...SecurityActions.getDeclaredMethods(...)
So yes, in this setup the bottleneck is in the ORM layer. Using ODatabaseDocumentTx provides a speedup of around 5x. Might just get you where you want to be.
Still a lot of time (close to 50%) is spent in com.orientechnologies...OJNADirectMemory.getInt(...). That's expensive for just reading an integer from a memory location. Don't understand why not just the java nio bytebuffers are used here. Saves a lot of crossing the Java / native border, etc.
Apart from these micro benchmarks and remarkable behaviour in OrientDB I think that there are at least two other things to consider:
Does this test reflect your expected workload?
I.e. you read a straightforward list of records. If so, why use a database? If not, then test on the actual workload, e.g. your searches, graph traversals, etc.
Does this test reflect your expected setup?
E.g. you are reading from a plocal database while reading from any database over tcp/ip might just as well have its bottleneck somewhere else. Also, you are reading from one thread / process; if you expect concurrent use of the database, this probably throws things off considerably (disk seeks, more book keeping overhead, etc.)
P.S. I would recommend warming up code before benchmarking
What you do here is a worst case scenario. As you wrote (or should have wrote) for your database your test is just reading a table and writes it directly to a stream of whatever.
So what you see is the complete overhead of alot of magic. Usually if you do something more complex like joining, selecting, filtering and ordering the overhead of your ORM comes down to a more reasonable share of 5 to 10%.
Another thing you should think about - I guess orient is doing the same - the ORM solution is creating new objects multiplying memory consumption and Java is really bad on memory consumption and the reason why I use custom in memory tables all the time I handle a lot of data / objects.
You know where an object is a row in a table.
Another thing your objects get also inserted into a list / map (at least Hibernate is doing it). It tracks the dirtiness of the objects once you change them. This insertion also takes a lot of time when you rescale it and is a reason why we use paginated lists or maps. copying 1M references is dead slow if the area grows.

Why is Document.html() so slow?

I was under the impression that the most costly method in Jsoup's API is parse().
But I just discovered that Document.html() could be even slower.
Given that the Document is the output of parse() (i.e. this is after parsing), I find this surprising.
Why is Document.html() so slow?
Answering myself. The Element.html() method is implemented as:
public String html() {
StringBuilder accum = new StringBuilder();
html(accum);
return accum.toString().trim();
}
Using StringBuilder instead of String is already a good thing, and the use of StringBuilder.toString() and String.trim() may not explain the slowness of Document.html(), even for a relatively large document.
But in the middle, our method calls an overloaded version, Element.html(StringBuilder) which loops through all child nodes in the document:
private void html(StringBuilder accum) {
for (Node node : childNodes)
node.outerHtml(accum);
}
Thus if the document contains lots of child nodes, it will be slow.
It would be interesting to see whether there could be a faster implementation of this.
For example, if Jsoup stores a cached version of the raw html that was provided to it via Jsoup.parse(). As an option of course, to maintain backward compatibility and small footprint in memory.

hello world example for ehcache?

ehcache is a hugely configurable beast, and the examples are fairly complex, often involving many layers of interfaces.
Has anyone come across the simplest example which just caches something like a single number in memory (not distributed, no XML, just as few lines of java as possible). The number is then cached for say 60 seconds, then the next read request causes it to get a new value (e.g. by calling Random.nextInt() or similar)
Is it quicker/easier to write our own cache for something like this with a singleton and a bit of synchronization?
No Spring please.
EhCache comes with a failsafe configuration that has some reasonable expiration time (120 seconds). This is sufficient to get it up and running.
Imports:
import net.sf.ehcache.CacheManager;
import net.sf.ehcache.Element;
Then, creating a cache is pretty simple:
CacheManager.getInstance().addCache("test");
This creates a cache called test. You can have many different, separate caches all managed by the same CacheManager. Adding (key, value) pairs to this cache is as simple as:
CacheManager.getInstance().getCache("test").put(new Element(key, value));
Retrieving a value for a given key is as simple as:
Element elt = CacheManager.getInstance().getCache("test").get(key);
return (elt == null ? null : elt.getObjectValue());
If you attempt to access an element after the default 120 second expiration period, the cache will return null (hence the check to see if elt is null). You can adjust the expiration period by creating your own ehcache.xml file - the documentation for that is decent on the ehcache site.
A working implementation of jbrookover's answer:
import net.sf.ehcache.CacheManager;
import net.sf.ehcache.Element;
import net.sf.ehcache.Cache;
public class EHCacheDemo {
public static final void main(String[] igno_red) {
CacheManager cchm = CacheManager.getInstance();
//Create a cache
cchm.addCache("test");
//Add key-value pairs
Cache cch = cchm.getCache("test");
cch.put(new Element("tarzan", "Jane"));
cch.put(new Element("kermit", "Piggy"));
//Retrieve a value for a given key
Element elt = cch.get("tarzan");
String sPartner = (elt == null ? null : elt.getObjectValue().toString());
System.out.println(sPartner); //Outputs "Jane"
//Required or the application will hang
cchm.removeAllCaches(); //alternatively: cchm.shutdown();
}
}

Resources