Custom Language Stemmer for Elasticsearch - elasticsearch

Is there any way how to create new stemmer? There is for example analyzer for czech language already built in with czech language stemmer. This algorithm was made by some guys in Netherlands. It's not that bad, but for the native speaker it is clear that those honorable guys does not speak the language. If I would like to create my own stemming algorithm, how can I do it in the Elasticsearch?
Thanks.

Elasticsearch is based on Lucene, so this answer is about how to add a custom stemmer to Lucene.
This is how I implemented Lucene's Analyzer interface based on a custom stemmer (or lemmatizer, to be more precise):
https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/analysis/StemmerAnalyzer.java
See also these two classes:
https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/analysis/CompoundStemmerTokenFilter.java
https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/jmorph/LemmatizerWrapper.java
Note, that this is for an older version of Lucene, 3.2/3.3. The same implementation would probably be more simple for new versions.
https://code.google.com/p/hunglish-webapp/source/browse/trunk/pom.xml

Related

Partial word tokenizers vs Word oriented tokenizers Elasticsearch

reading the link below I am looking for some use case/example in which will be better using Ngram-tokenizing or standard tokenizer doing some comperison.
I hope elastic documentation will include more examples and comparisons in future.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
Can someone help me?
Thank you.
The Elastic documentation does include more examples. You can find them in the dedicated page of each tokenizer (here is the standard, here is the ngram).
In general, you might want to use an ngram tokenizer to implement a search-as-you-type functionality, such as the auto-suggest in a search input.

Elasticsearch scripting engine

I am using elasticsearch 5.5 and I am trying to implement my own scoring algorithm. I use match query with fuzziness 2 and I need terms that match to my query in my java custom acore algorithm to calculate the edit distance and come up with a custom score.
I found lot of good examples of native scripts to do this,but since native since are deperecated in 5.5 I need to do this in scrpting engine.Does anyone have good examples of such custom score implementation via scripting engine?
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-engine.html
Native script example:
https://github.com/ixxi-mobility/elasticsearch-editdistance-scoring
https://github.com/thomasheckmann/elasticsearch-hamming/blob/master/elasticsearch-hamming/src/main/java/dk/kolbeck/elastic/plugin/hamming/HammingDistanceScriptFactory.java

When are Stemmers used in ElasticSearch?

I am confused about when stemmers are used in ElasticSearch.
In the Dealing with Human Language/Reducing Words to Their Root Form section I see that stemmers are used to strip words into their root forms. This lead me to believe that Stemmers were used as a token filter on an analyzer.
But a token filter only filters the token, does not actually reduce words to their root forms.
So, where are stemmers used?
In fact, you can do stemming with a token filter in an analyzer. That is exactly how stemming works in ES. Have a look at the documentation for Stemmer Token Filter.
ES also provides the Snowball Analyzer, which is a convenient analyzer to use for stemming.
Otherwise, if there is a different type of stemming you would like to use, you can always build your own Custom Analyzer. This gives you complete control over the stemming solution that works best for you, as discussed here in the guide.
Hope this helps!

Stemming - code examples or open source projects?

Stemming is something that's needed in tagging systems. I use delicious, and I don't have time to manage and prune my tags. I'm a bit more careful with my blog, but it isn't perfect. I write software for embedded systems that would be much more functional (helpful to the user) if they included stemming.
For instance:
Parse
Parser
Parsing
Should all mean the same thing to whatever system I'm putting them into.
Ideally there's a BSD licensed stemmer somewhere, but if not, where do I look to learn the common algorithms and techniques for this?
Aside from BSD stemmers, what other open source licensed stemmers are out there?
-Adam
Snowball stemmer (C & Java)
I've used it's Python binding, PyStemmer
Check out the nltk toolkit written in python. It has a very functional stemmer.
Another option for stemming would be WordNet, along with one of its APIs. Some basic information on stemming and lemmatization, including a description of the Porter stemming algorithm, can be found online in Introduction to Information Retrieval.
Lucene has a stemmer in, I believe (and IIRC it lets you use your own one if you want).
EDIT: Just checked, and Lucence refers to the Snowball site which is an open source stemming library as far as I can tell.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources