Elasticsearch scripting engine - elasticsearch

I am using elasticsearch 5.5 and I am trying to implement my own scoring algorithm. I use match query with fuzziness 2 and I need terms that match to my query in my java custom acore algorithm to calculate the edit distance and come up with a custom score.
I found lot of good examples of native scripts to do this,but since native since are deperecated in 5.5 I need to do this in scrpting engine.Does anyone have good examples of such custom score implementation via scripting engine?
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-engine.html
Native script example:
https://github.com/ixxi-mobility/elasticsearch-editdistance-scoring
https://github.com/thomasheckmann/elasticsearch-hamming/blob/master/elasticsearch-hamming/src/main/java/dk/kolbeck/elastic/plugin/hamming/HammingDistanceScriptFactory.java

Related

Implement Fuzzy Prefix Full-text matching query in Elasticsearch

I understand that Elasticsearch tries to avoid using Fuzzy search with prefix matching, and this is also why it doesn't natively support such feature due to its complexity. However, we have a directory search system that solely relies on elasticsearch as a blackbox search engine, and we need the following logic:
E.g. Say the terms are "Michael Pierce Chem". We want to support full text search on the first two terms (with match query) and we also want to do fuzzy on the last term first and then do a prefix match, as if "Chem" matches "chemistry", "chen", and even "YouTube Chen" due to full-text support.
Please give me some advice on the implementation and design suggestions. Current stack is a NodeJS web app with Elasticsearch.

Search algorithm options for ontology querying?

I have developed a tool that enables searching of an ontology I authored. It submits the searches as SPARQL queries.
I have received some feedback that my search implementation is all-or-none, or "binary". In other words, if a user's input doesn't exactly match a term in the ontology, they won't get any hit at all.
I have been asked to add some more flexible, or "advanced" search algorithms. Indexing and bag-of-words searching were suggested.
Can anyone give some examples of implementing search methods on an ontology that don't require a literal match?
FIrst of all, what kind of entities are you trying to match (literals, or string casts of URIs?), and what kind of SPARQL queries are you running now? Something like this?
?term ?predicate "user input" .
If you are searching across literals, you can make the search more flexible right off the bat by using case-insensitive regular expression filtering, although this will probably make your searches slower, and it won't catch cases where some of the word tokens are present but in a different order. In the following example, your should probably constrain the types of ?term and ?predicate first, or even filter on a string datatype on ?userInput
?term ?predicate ?someLiteral .
FILTER(regex(?someLiteral), "user input", "i"))
Several triplestores offer support for full-text searching and result scoring. These are often extensions to the SPARQL language.
For example, Virtuoso and some others offer a bif:contains predicate. Virtuoso also offers the faceted search web interface (plus a service, I think.) I have been pleased with the web-based full text search in Blazegraph and Stardog, but I can't say anything at this point about using them with a SPARQL query to get a score on a search pattern. Some (GraphDB) even support explicit integration with Lucene or Solr*, so you may be able to take advantage of their search languages.
Finally... are you using a library like the OWL API or RDF4J to access your ontology? If so, you could certainly save the relationships between your terms and any literals in a Java native data structure, and then directly use a fuzzy search component like Lucene to index each literal as a "document" and then search the user input across the index.
Why don't you post your ontology and give an example of a search you would like to peform in a non-binary way. I (or someone else) can try to show you a minimal implementation.
*Solr integration only appears to be offered in the commercially-licensed version of GraphDB

Custom Language Stemmer for Elasticsearch

Is there any way how to create new stemmer? There is for example analyzer for czech language already built in with czech language stemmer. This algorithm was made by some guys in Netherlands. It's not that bad, but for the native speaker it is clear that those honorable guys does not speak the language. If I would like to create my own stemming algorithm, how can I do it in the Elasticsearch?
Thanks.
Elasticsearch is based on Lucene, so this answer is about how to add a custom stemmer to Lucene.
This is how I implemented Lucene's Analyzer interface based on a custom stemmer (or lemmatizer, to be more precise):
https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/analysis/StemmerAnalyzer.java
See also these two classes:
https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/analysis/CompoundStemmerTokenFilter.java
https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/jmorph/LemmatizerWrapper.java
Note, that this is for an older version of Lucene, 3.2/3.3. The same implementation would probably be more simple for new versions.
https://code.google.com/p/hunglish-webapp/source/browse/trunk/pom.xml

a simple filtering language that can be embedded in ruby?

I have a ruby project where part of the operation is to select entities given user-specified constraints. So far, I've been hacking my own filter language, using regular expressions and specifying inclusion/exclusion based on the fields in the entities.
If you are interested in my current approach, here's an example: For instance, given this list of entities:
[{"type":"dog", "name":"joe"}, {"type":"dog", "name":"fuzz"}, {"type":"cat", "name":"meow"}]
A user could specify a filter like so:
{"filter":{
"type":{"included":["dog"] },
"name":{"excluded":["^f.*"] }
}}
Would match all dogs but exclude fuzz.
This is sort of working now. However, I am starting to require more sophisticated selection parameters. I am thinking that rather than continuing to hack on my own filter language, there might be a more general-purpose filter language I can just embed in my application? For instance, is there a parser that can in-app filter using a SQL where clause? Or are there some other general, simple filter languages that I'm not aware of? I would especially like to move away from regexps since I want to do range querying on numbers (like is entity["size"] < 50 ?)
It is a little bit of an extrapolation, but I think you may be looking for a search engine, or at least enough of one that you may as well use one just for the query language.
If so you might want to look at elasticsearch which does have Ruby client bindings, and could be a good fit for what you are trying to do. Especially if you want or need to express the data you want to search as JSON for use by client code, as that format is natively supported by the search engine.
The query language is quite expressive, and there are a variety of built-in and plugin tools available to explore and use it.
in the end, i ended up implementing a ruby dsl. it's easy, fun, and powerful.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources