levelDB performance differences between python and java - leveldb

I've been trying with levelDB data load with python and java and got a somewhat big surprise when a simple test showed much better and consistent performance in python than in java.
With java tried:
org.iq80.leveldb version version 0.9 (native java impl) => The
preferred option
org.fusesource.leveldbjni version 1.8 (jni bridge to
C++ native impl)
With Python:
happynear/py-leveldb-windows
Inserting (just that) a RDF dataset of 1M triples, i got these results (the values are unimportant, I guess, just the diffs):
Java native: 26 secs average (big variations of timings for each 1k insert batch)
Java JNI: blocked most of the runs
Python: 6 seconds average (consistently, almost no variations in 1k batches)
I thought it could be a memory problem (gc?) with java so increased the memory, in steps, a lot, to no avail.
Anyone has tested leveldb java interfaces/implementation above and can give me some opinions/inputs?
I would rather use java (and the native java impl if at all possible) since the loading and preparation of the dataset is much faster in java, and it would be easier to touch java impl if needed. But the insert performance and unpredictability is a no go.
Thanks in advance.

Related

Maximize Tensorflow Performance

I'm using Tensorflow 1.2. for image segmentation on an AWS p2 instance (Tesla K80). Is there an easy way for me to find out if I can improve the performance of my code?
Here is what I know:
I measured the execution time of the various parts of my program and
99% of the time is spent calling session run.
sess.run([train_op, loss, labels_modified, output_modified],
feed_dict=feed_dict)
where feed_dict is a mapping from placeholders to tensors.
The session.run method only takes 0.43 seconds to execute for the following parameters: batch_size=1, image_height=512, image_width=512, channels=3.
The network has 14 convolutional layers (no dense layers) with a total of 11 million trainable parameters.
Because I'm doing segmentation I use a batch size of 1 and then compute the pixel-wise loss (512*512 cross entropy losses).
I tried to compile Tensorflow from source and got zero performance improvements.
I read through the performance guide https://www.tensorflow.org/performance/performance_guide but I don't want to spend a lot of time trying all of these suggestions. It already took me 8 hours to compile Tensorflow and it gave me zero benefits!
How can I find out which parts of the session run take most of the time? I have a feeling that it might be the loss calculation.
And is there any clear study that shows how much speedup I can expect from the things mentioned in the performance guide?
You're performing a computationally intensive task that requires a lot of calculations and a lot of memory. Your model has a lot of parameters and each one requires to be computed forward, backward and updated.
The suggestions in the page you linked are OK and if you followed them all there's nothing else you can do, except creating another (1 or more) instance and run the train in parallel. This will give you a Nx speed up (where N is the number of instances that compute the gradients for your input batch) but it's extremely expensive and not always applicable (moreover it requires to change you code in order to make it follow the client-server architecture for the gradient computation and weight updates)
Based on your small piece of code, I see you're using a feed dictionary. Generally it's best to avoid using feed dictionaries if queues can be used (see https://github.com/tensorflow/tensorflow/issues/2919). The Tensorflow documentation covers the use of queues here. Switching to queues will definitely improve your performance.
Maybe you can run your code with tfprof to do some profiling to find out where the bottleneck is.
For just guessing, the performance problem may caused by feeding data. Don't how did you prepare your feed_dict, if you have to read you data from disk for preparing your feed_dict for every sess.run, it will slow the program for reading data and training is in synchronous. you can try to covert you data to tfrecords, make loading data and training in asynchronous by using tf.FIFOQueue

performance drop and strange GC behaviour after one hour in jboss 5.0.1GA

We upgraded our software from jboss 4.0.5GA to 5.0.1GA and noticed that after one hour or so (or 90 minutes in some cases) performance drops dramatically.
At the same moment, the garbage collector logs show minor garbage collection times jumping from 0.01s to ~1.5s, with the amount of the heap being cleared each time reducing from ~400MB before to ~300MB after. (see GC viewer graph 1)
We think these are both symptoms of the the same underlying root cause.
jvm settings are:
-server -Xms2048m -Xmx2048m -XX:NewSize=384m -XX:MaxNewSize=384m
-XX:SurvivorRatio=4 -XX:MinHeapFreeRatio=11 -XX:PermSize=80m -verbose:gc
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+DisableExplicitGC
-Djava.awt.headless=TRUE -DUseSunHttpHandler=TRUE
-Dsun.net.client.defaultConnectTimeout=25000
-Dsun.net.client.defaultReadTimeout=50000 -Dfile.encoding=UTF-8
-Dvzzv.log.dir=${ercorebatch.log.dir} -Xloggc:${ercorebatch.log.dir}/gc.log
-Duser.language=it -Duser.region=IT -Duser.country=IT -DVFjavaWL=er.core.it
The production environment is T5220 or T2000 hardware, with 32 bit SPARC, running a Solaris 10 virtual machine. jboss 5.0.1.GA, java 1.6.0_17
We set up a test environment consisting of 2 identical boxes, running the same software but one using jboss 4.0.5GA and one using jboss 5.0.1.GA. They are VMWare VMs running on a HP ProLiant DL560 Gen8 with 4 x 2.2GHz Intel Xeon CPU E5-4620 and 64GB RAM. Guest VMs are 4 vCPU, 4096MB RAM, CentOS 6.4.
We found that we could easily reproduce the problem in our environment. The box which was running on 4.0.5 ran fine, but on jboss 5.0.1GA we saw the same strange GC behaviour. Performance can't easily be tested in our environment since we don't have the same amount of load as production.
We don't think it's a memory leak, since after each major GC, the used heap size returns to the same size:
Analysing heap dumps taken pre- and post-apocalypse, we discovered the number of the following objects was different:
org.jboss.virtual.plugins.context.file.FileSystemContext
during the first hour, there are about 8 of them, and after the apocalypse hits, we see between 100 and 800.
Other than that, the heap dumps look quite similar, and the top objects are either java or jboss objects (ie no application classes)
Setting -Djboss.vfs.forceVfsJar=true on our test environment fixed the problem (i.e. the strange GC behaviour disappeared) but when applied in production, both the strange GC pattern and the performance problem remained - although the GC times did not increase so much (to 0.3 seconds rather than 1.5 seconds).
In our test environment, we then deployed the same software in jboss 5.1.0 and found the same behaviour as with 5.0.1.
So the conclusions at this point are that there is something happening in jboss 5.x around the 60 / 90 minute mark which has an impact on both garbage collection and performance.
UPDATE:
We tried upgrading the web services stack to jbossws-native-3.3.1, which fixed the problem in our test environment. However, when deploying to the next test environment (closer to the production environment), the problem was still there (albeit reduced).
UPDATE:
We have resolved this by setting jboss.vfs.cache.TimedPolicyCaching.lifetime to a very large number equivalent to many years.
This feels like a workaround for a bug in jboss. The default cache lifetime is 30 minutes (see org.jboss.util.TimedCachePolicy), and we saw problems after either 60 or 90 minutes.
The VFS cache implementation is CombinedVFSCache and I think it's using a TimedVFSCache underneath.
It seems like a better fix would be to change the cache implementation to a permanent cache, but we've wasted enough time on this problem and our workaround will have to do.
It is hard to determine the root cause of this problem just looking at the Gc graphs. So how does the stacks looks like when this happens? Is there any hyperactive threads? Is there any nasty threads creating a huge pile of objects forcing the garbage collector to work as hell to get rid of them? I think that more analysis must be performed to determine the root cause of the problem.

What makes OpenCV so large on Windows? Anything I can do about it?

The OpenCV x64 distribution (through emgucv) for Windows has almost half a gigabyte of DLLs, including a single 224Mb opencv_gpu.dll. It seems unlikely that any human could have produced that amount of code, so what gives? Large embedded resources? Code generation bloat (this doesn't seem likely given that it's a native c/c++ project)
I want to use it for face recognition, but it's a problem to have such a large binary dependency in git, and it's a hassle to manage outside of source control.
[Update]
There are no embedded resources (at least the kind Windows DLLs usually have, but since this is a cross-platform product, I'm not sure that's significant.) Maybe lots of initialized C table structures to perform matrix operations?
The size of opencv_gpu is result of numerous template instantiations compiled for several CUDA architecture versions.
For example for convolution:
7 data types (from CV_8U to CV_64F)
~30 hadcoded sizes of convolution kernel
8 CUDA architectures (bin: 1.1 1.2 1.3 2.0 2.1(2.0) 3.0 + ptx: 2.0 3.0)
This produces about 1700 variants of convolution.
This way opencv_gpu can grow up to 1 Gb for the latest OpenCV release.
If you are not going to use any CUDA acceleration then you can safely drop the opencv_gpu.dll

Clojure vs Node.js RAM consumption

Recently I've been stumbling upon a lot of benchmarks between Node.js and Clojure, such as this, and this, and this. It seems to me, that compared to languages like Ruby, both Node.js and Clojure are about equally fast (which means a lot faster).
The question is, how does Clojure compare to Node.js in terms of RAM consumption? Say that I was about to write a simple live chat app.
If I was about to compare Rails vs Node.js, I can basically expect Node.js to be 100 times faster and take 10 times less memory than Rails ... but how does Clojure fit in here?
How would Clojure compare here in terms of memory consumption? Can I expect it to take a lot more memory than Node.js, because it is running on the JVM? Or is this just a stereotype that isn't true anymore?
For a simple application on modern hardware, you should have no memory usage issues with either Node.js or Clojure.
Of course, as Niklas points out it will ultimately depend on what frameworks you use and how well written your app is.
Clojure has quite a significant base memory requirement (because the java runtime environment / JVM is pretty large), but I've found it to be pretty memory efficient beyond than point - Clojure objects are just Java objects under the hood so that probably shouldn't be too surprising.
It's also worth noting that directly measuring the memory usage of a JVM app is usually misleading, since the JVM typically pre-allocates more memory than it needs and only garbage collects in a lazy (as needed) fashion. So while the apparent total memory usage looks high, the actual working set can be quite small (which is what you really care about for performance purposes).

What do I need to offset the performance setback induced by use of the Spring framework?

I am using Spring with Hibernate to create an Enterprise application.
Now, due to the abstractions given by the framework to the underlying J2EE
architecture, there is obviously going to be a runtime performance hit on my app.
What I need to know is a set of factors that I need to consider to make a decision about the minimum specs(Proc speed + RAM etc) that I need for a single host server of the application running RedHat Linux 3+ and devoted to running this application only, that would produce an efficiency score of say 8 out of 10 given a simultaneous-access-userbase increase of 100 per month.
No clustering is to be used.
No offense, but I'd bet that performance issues are more likely to be due to your application code than Spring.
If you look at the way they've written their source code, you'll see that they pay a great deal of attention to quality.
The only way to know is to profile your app, see where the time is being spent, analyze to determine root cause, correct it, rinse, repeat. That's science. Anything else is guessing.
I've used Spring in a production app that's run without a hitch for three years and counting. No memory leaks, no lost connections, no server bounces, no performance issues. It just runs like butter.
I seriously doubt that using Spring will significantly affect your performance.
What particular aspects of Spring are you expecting to cause performance issues?
There are so many variables here that the only answer is to "suck it and see", but, in a scientific manner.
You need to build a server than benchmark this. Start of with some "commodity" setup say 4 core cpu and 2 gig ram, then run a benchmark script to see if it meets your needs. (which most likely it will!).
If it doesnt you should be able to calculate the required server size from the nulbers you get out of the benchmark -- or -- fix the performance problem so it runs on hte hardware youve got.
The important thing is to identiffy what is limmiting your performance. Is you server using all the cores or are your processes stuck on a single core, is your JVM getting enough memory, are you IO bound or database bound.
Once you know the limiting factors its pretty easy to work out the solution -- either improve the efficiency of your programs or buy more of the right hardware.
Two thing to watch out for with J2EE -- most JVMs have default heap sizes from the last decade, make sure your JVM has enough Heap and Stack (at least 1G each!), -- it takes time for all the JIT compiling, object cacheing, module loading etc to settle down -- exercise your system for at least an hour before you start benchmarking.
As toolkit, I don't see Spring itself affecting the performance after initialization, but I think Hibernate will. How big this effect is, depends on a lot of details like the DB-Schema and how much relational layout differs from the OO layer and of course how DB-access is organized and how often DB-access happens etc. So I doubt, there is a rule of thumb to this. Just try out by developing significant prototypes using alternative applications servers or try a own small no-ORM-use-JDBC-version.
I've never heard that Spring creates any type of runtime performance hit. Since it uses mainly POJOs I'd be surprised if there was something wrong with it. Other than parsing a lot of XML on startup maybe, but that's solved by using annotations.
Just write your app first and then tune accordingly.
Spring is typically used to create long-lived objects shortly after the application starts. There is virtually no performance cost over the life of the process.
Which performance setback? In relation to what?
Did you measure the performance before using the framework?
If the Spring framework causes inacceptable performance issues the obvious solution is not to use it.

Resources