POS Tagging too slow - using OpenNLP - opennlp

I am just playing around with Part-of-speech Tagging, and started using OpenNLP.
I am using the following code to load the model (Java):
m_modelFile = new FileInputStream("c:\\DATA\\en-parser-chunking.bin");
m_model = new ParserModel(m_modelFile);
m_parser = ParserFactory.create(m_model);
...
Parse topParses[] = ParserTool.parseLine(sentence, m_parser, 1);
I am noticing that the call to create the ParserModel object is insanely slow. Could be b/c en-parser-chunking.bin is 35MB in size. Is there a better way to use this so that it's not this slow? Alternatively, is there a POS tagger you recommend or a way of calling the API that's faster?
I've been playing around with the accuracy, and it's pretty good. But, I am not happy with the performance when loading the model...
Thanks guys.

If you are looking for a fast Java (or Python) POS tagger, you might consider to use RDRPOSTagger. RDRPOSTagger is a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. It obtains fast performance in both learning and tagging process. For example in Java, tagging speed is 90K English words/second using a computer with Core2Duo 2.4 GHz. And it achieves a very competitive accuracy in comparison to the state-of-the-art results. See experimental results including performance speed and tagging accuracy on 13 languages in this paper.

Related

How does a computer reproduce the SIFT paper method on its own in deep learning

Let me begin by saying that I am struggling to understand what is going on in deep learning. From what I gather, it is an approach to try to have a computer engineer different layers of representations and features to enable it learn stuff on its own. SIFT seems to be a common way to sort of detect things by tagging and hunting for scale invariant things in some representation. Again I am completely stupid and in awe and wonder about how this magic is achieved. How does one have a computer do this by itself? I have looked at this paper https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf and I must say at this point I think it is magic. Can somebody help me distill the main points of how this works and why a computer can do it on its own?
SIFT and CNN are both methods to extract features from images in different ways and outputs.
SIFT/SURF/ORB or any similar feature extraction algorithms are "Hand-made" feature extraction algorithms. It means, independent from the real world cases, they are aiming to extract some meaningful features. This approach has some advantages and disadvantages.
Advantages :
You don't have to care about input image conditions and probably you don't need any pre-processing step to extract those features.
You can directly get SIFT implementation and integrate it to your application
With GPU based implementations (i.e. GPU-SIFT), you can achieve high inference speed.
Disadvantages:
It has limitation about finding the features. You will have trouble about getting features over quite plain surfaces.
SIFT/SURF/ORB cannot solve all problems that requires feature classification / matching. Think face recognition problem. Do you think that extracting & classifying SIFT features over face will be enough to recognize people?
Those are hand-made feature extraction techniques, they cannot be improved over time (of course unless a better technique is being introduced)
Developing such a feature extraction technique requires a lot of research work
In the other hand, in deep learning, you can start analyzing much complex features which are impossible by human to recognize. CNNs are perfect approach as today to analyze hierarchical filter responses and much complex features which are created by combining those filter responses (going deeper).
The main power of CNNs are coming from not extracting features by hand. We only define "how" PC has to look for features. Of course this method has some pros and cons too.
Advantages :
More data, better success! It is all depending on data. If you have enough data to explain your case, DL outperforms hand-made feature extraction techniques.
As soon as you extract the features from image, you can use it for many purposes like to segment image, to create description words, to detect objects inside image, to recognize them. The better part is, all of them can be obtained in one shot, rather than complex sequential processes.
Disadvantages:
You need data. Probably a lot.
It is better to use supervised or reinforcement learning methods in these days. As unsupervised learning is still not good enough yet.
It takes time and resource to train a good neural net. A complex hierarchy like Google Inception took 2 weeks to be trained on 8 GPU server rack. Of course not all the networks are so hard to train.
It has some learning curve. You don't have to know how SIFT is working to use it for your application but you have to know how CNNs are working to use them in your custom purposes.

Computationally intensive algorithms

I am planning to write a bunch of programs on computationally intensive algorithms. The programs would serve as an indicator of different compiler/hardware performance.
I would want to pick up some common set of algorithms which are used in different fields, like Bioinformatics, Gaming, Image Processing, et al. The reason I want to do this would be to learn the algorithms and have a personal mini benchmark suit that would be small | useful | easy to maintain.
Any advice on algorithm selection would be tremendously helpful.
Benchmarks are not worthy of your attention!
The very best guide is to processor performance is: http://www.agner.org/optimize/
And then someone will chuck it in a box with 3GB of RAM defeating dual-channel hopes and your beautifully tuned benchmark will give widely different results again.
If you have a piece of performance critical code and you are sure you've picked the winning algorithm you can't then go use a generic benchmark to determine the best compiler. You have to actually compile your specific piece of code with each compiler and benchmark them with that. And the results you gain, whilst useful for you, will not extrapolate to others.
Case in point: people who make compression software - like zip and 7zip and the high-end stuff like PPMs and context-mixing and things - are very careful about performance and benchmark their programs. They hang out on www.encode.ru
And the situation is this: for engineers developing the same basic algorithm - say LZ or entropy coding like arithmetic-coding and huffman - the engineers all find very different compilers are better.
That is to say, two engineers solving the same problem with the same high-level algorithm will each benchmark their implementation and have results recommending different compilers...
(I have seen the same thing repeat itself repeatedly in competition programming e.g. Al Zimmermann's Programming Contests which is an equally performance-attentive community.)
(The newer GCC 4.x series is very good all round, but that's just my data-point, others still favour ICC)
(Platform benchmarks for IO-related tasks is altogether another thing; people don't appreciate how differently Linux, Windows and FreeBSD (and the rest) perform when under stress. And benchmarks there - on the same workload, same machine, different machines or different core counts - would be very generally informative. There aren't enough benchmarks like that about sadly.)
There was some work done at Berkeley about this a few years ago. The identified 13 common application paterns for parallel programming, the "13 Dwarves". The include things like linear algebra, n-body models, FFTs etc
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
See page 10 onwards.
There are some sample .NET implementations here:
http://paralleldwarfs.codeplex.com/
The typical one is Fast Fourier Transform, perhaps you can also do something like the Lucas–Lehmer primality test.
I remember a guy who tested computational performance of machines, compiler versions by inverting Hilbert matrices.
For image processing, median filtering (used for removing noise, bad pixels) is always too slow. It might make a good test, given a large enough image say 1000x1000.

Clojure number crunching performance

I'm not sure whether this belongs on StackOverflow or in the Clojure Google group. But the group seems to be busy discussing numeric improvements for Clojure 1.2, so I'll try here:
http://shootout.alioth.debian.org/ has a number of performance benchmarks for various languages.
I noticed that Clojure was missing, so I made a Clojure version of the n-body problem.
The fastest code I was able to produce can be found here, and benchmarking it seems to be saying that for number crunching Clojure is
factor ~10 quicker than Python/Ruby/Perl
factor ~4 slower than C/Java/Scala/Ada
approximately on par with OCaml, Erlang and Go
I'm quite happy with that level of performance.
My question to the Clojure gurus is
Are there obvious improvements I have missed, either in terms of speed or in terms of code brevity or readability (without sacrificing speed)?
Do you consider this to be representative of Clojure performance vs Python/Ruby/Perl on one hand and Java/C on the other?
Update
More Clojure 1.1 benchmark programs for the shootout here, including the n-body problem.
Not a flood of responses here :) but apparently some interest, so I'll try to answer my own question with what I've learned over the past few days:
With the 1.1 optimization approach (Java primitives and mutable arrays) ~4x slower than optimized Java is about as fast as it goes.
The 1.2 constructs definterface and deftype are more than twice as fast, coming within ~1.7x (+70%) of Java with shorter, simpler and cleaner code than for 1.1.
Here are the implementations:
Clojure 1.1 approach
Clojure 1.2 approach
More details including "lessons learned", JVM version and profiling screenshots.
Subjectively speaking, optimizing the 1.2 code was a breeze compared to optimizing 1.1, so this is very good news for Clojure number crunching. (Actually close to amazing :)
The 1.2 testing used the current master branch, I did not try any of the new numeric branches. From what I can gather the new ideas currently being discussed
may make non-optimized numerics faster
may speed up the 1.1 version of this benchmark
will probably not speed up the 1.2 version, it is already as "close to the metal" as it is likely to get.
Disclaimers:
Clojure 1.2 is not released yet, so 1.2 benchmark results are preliminary.
This is one particular benchmark on physics calculations. It is relevant to floating point number crunching, but irrelevant to performance in areas like string parsing, concurrency or web request handling.
I wonder if Cantor might be of use to you -- it's a high performance math library for Clojure. Also see this thread on the Google group, which is about a similar project in the context of the new primitive arithmetic stuff.
This is a slightly old question and the existing answers are somewhat out of date, so I'd like to add an update as of mid-2013 for those interested in "number crunching" in Clojure
There has been a lot happening in the Clojure Numerical computing space:
Clojure 1.5 is now out, which has a lot of improved support for numerical operations. In most cases, it's now possible to get very close to pure Java speed
A dedicated newsgroup - Numerical Clojure
core.matrix now provides an idiomatic API for matrix maths / numerical computing that supports multiple backend implementations (including native BLAS libraries)
Disclaimer: I'm a maintainer / contributor to several of the above.

Site on OpenGL call performance

I'm searching for reliable data on OpenGL's functions performance. A site that could for example:
...answer me how much more efficient is using glInterleavedArrays compared to gl*Pointer based implementation with strides, or without them. If applicable, show the comparisions on nVidia vs. ATI cards vs. embedded systems.
...answer me how much of a boost is gained in using VBO's vs. non-buffered data in the cases of static, dynamic and stream data.
I'd like to find a site that has "no-bullshit" performance data, not just vague statements like "glInterleavedArrays are usually faster than direct gl*Pointer usage".
Is there such a dream-site? Or at least somewhere where I can get answers to the forementioned questions?
(yes, I know that nothing will beat hand-profiling, but the fact that something works faster on my machine, doesn't mean it's faster generally on all cards...)
It's more about application level benchmarking than measuring performance of individual features, but it might be possible to learn something from specviewperf, especially if it's possible to discover more about what OpenGL mode each benchmark uses to perform it's rendering. The benchmark seems to include some options to tweak usage of display lists, vertex arrays etc, but I don't think SPECs published results go into any analysis of the effects of changing these from the defaults. They don't seem to have any VBO coverage yet.

Image Recognition

I'd like to do some work with the nitty-gritties of computer imaging. I'm looking for a way to read single pixels of data, analyze them programatically, and change them. What is the best language to use for this (Python, c++, Java...)? What is the best fileformat?
I don't want any super fancy software/APIs... I'm looking for the bare basics.
If you need speed (you'll probably always want speed with image processing) you definitely have to work with raw pixel data.
Java has some real disadvantages as you cannot access memory directly which makes pixel access quite slow compared to accessing the memory directly.
C++ is definitely the language of choice for production use image processing. But you can, for example, also use C# as it allows for unsafe code in specific areas. (Take a look at the scan0 pointer property of the bitmapdata class.)
I've used C# successfully for image processing applications and they are definitely much faster than their java counterparts.
I would not use any scripting language or java for such a purpose.
It's very east to manipulate the large multi-dimensional or complex arrays of pixel information that are pictures using high-level languages such as Python. There's a library called PIL (the Python Imaging Library) that is quite useful and will let you do general filters and transformations (change the brightness, soften, desaturate, crop, etc) as well as manipulate the raw pixel data.
It is the easiest and simplest image library I've used to date and can be extended to do whatever it is you're interested in (edge detection in very little code, for example).
I studied Artificial Intelligence and Computer Vision, thus I know pretty well the kind of tools that are used in this field.
Basically: you can use whatever you want as long as you know how it works behind the scene.
Now depending on what you want to achieve, you can either use:
C language, but you will lose a lot of time in bugs checking and memory management when implementing your algorithms. So theoretically, this is the fastest language to do that kind of job, but if your algorithms are not computationnally efficient (in terms of complexity) or if you lose too much time in bugs checking, this is clearly not worth it. So I would advise to first implement your application in another language, and then later you can always optimize small parts of your code with C bindings.
Octave/MatLab: very efficient language, almost as much as C, and you can make very elegant and succinct algorithms. If you are into vectorization, matrix and linear operations, you should go with that. However, you won't be able to develop a whole application with this language, it's more focused on algorithms, but then you can always develop an interface using another language later.
Python: all-in-one elegant and accessible language, used in gigantically large scale applications such as Google and Facebook. You can do pretty much everything you want with Python, any kind of application. It will be perfectly adapted if you want to make a full application (with client interaction and all, not only algorithms), or if you want to quickly draft a prototype using existent libraries since Python has a very large set of high quality libraries, like OpenCV. However if you only want to make algorithms, you should better use Octave/MatLab.
The answer that was selected as a solution is very biaised, and you should be careful about this kind of archaic comment.
Nowadays, hardware is cheaper than wetware (humans), and thus, you should use languages where you will be able to produce results faster, even if it's at the cost of a few CPU cycles or memory space.
Also, a lot of people tends to think that as long as you implement your software in C/C++, you are making the Saint Graal of speedness: this is just not true. First, because algorithms complexity matters a lot more than the language you are using (a bad algorithm will never beat a better algorithm, even if implemented in the slowest language in the universe), and secondly because high-level languages are nowadays doing a lot of caching and speed optimization for you, and this can make your program run even faster than in C/C++.
Of course, you can always do everything of the above in C/C++, but how much of your time are you willing to waste to reinvent the wheel?
Not only will C/C++ be faster, but most of the image processing sample code you find out there will be in C as well, so it will be easier to incorporate things you find.
if you are looking to numerical work on your images (think matrix) and you into Python check out http://www.scipy.org/PyLab - this is basically the ability to do matlab in python, buddy of mine swears by it.
(This might not apply for the OP who only wanted the bare basics -- but now that the speed issue was brought up, I do need to write this, just for the record.)
If you really need speed, it's better to forget about working on the pixel-by-pixel level, and rather see whether the operations that you need to perform could be vectorized. For example, for your C/C++ code you could use the excellent Intel IPP library (no, I don't work for Intel).
It depends a little on what you're trying to do.
If runtime speed is your issue then c++ is the best way to go.
If speed of development is an issue, though, I would suggest looking at java. You said that you wanted low level manipulation of pixels, which java will do for you. But the other thing that might be an issue is the handling of the various file formats. Java does have some very nice APIs to deal with the reading and writing of various image formats to file (in particular the java2d library. You choose to ignore the higher levels of the API)
If you do go for the c++ option (or python come to think of it) I would again suggest the use of a library to get you over the startup issues of reading and writing files. I've previously had success with libgd
What language do you know the best? To me, this is the real question.
If you're going to spend months and months learning one particular language, then there's no real advantage in using Python or Java just for their (to be proven) development speed.
I'm particularly proficient in C++ and I think that for this particular task I can be as speedy as a Java programmer, for example. With the aid of some good library (OpenCV comes to mind) you can create anything you need in a matter of a couple of lines of C++ code, really.
Short answer: C++ and OpenCV
Short answer? I'd say C++, you have far more flexibility in manipulating raw chunks of memory than Python or Java.

Resources