I am studying any kinds of analyzers that elasticsearch has.
I am doing tests to do something like Netflix search.
When I am typing in netflix search (after login) the results are returned after each keyup.
I noticed that netflix search the typed lettler in anywhere from film that contain this world.
For instance:
If I type: "a", result=> A man, Captain America, The Alien...
Considering that netflix look for only in title (in order to facilitate my example), they return even the typed letter is in the middle of the text. like a "Captain America".
Probably they are using "NGram Tokenizer" or are using another analyzed to do this behavior?
I know that "shingle" is good to autocomplete, but does not recognize letter in the middle of the world.
What is the best analyze configuration to make it look like netflix?
Related
I understand that Elasticsearch tries to avoid using Fuzzy search with prefix matching, and this is also why it doesn't natively support such feature due to its complexity. However, we have a directory search system that solely relies on elasticsearch as a blackbox search engine, and we need the following logic:
E.g. Say the terms are "Michael Pierce Chem". We want to support full text search on the first two terms (with match query) and we also want to do fuzzy on the last term first and then do a prefix match, as if "Chem" matches "chemistry", "chen", and even "YouTube Chen" due to full-text support.
Please give me some advice on the implementation and design suggestions. Current stack is a NodeJS web app with Elasticsearch.
I have developed a tool that enables searching of an ontology I authored. It submits the searches as SPARQL queries.
I have received some feedback that my search implementation is all-or-none, or "binary". In other words, if a user's input doesn't exactly match a term in the ontology, they won't get any hit at all.
I have been asked to add some more flexible, or "advanced" search algorithms. Indexing and bag-of-words searching were suggested.
Can anyone give some examples of implementing search methods on an ontology that don't require a literal match?
FIrst of all, what kind of entities are you trying to match (literals, or string casts of URIs?), and what kind of SPARQL queries are you running now? Something like this?
?term ?predicate "user input" .
If you are searching across literals, you can make the search more flexible right off the bat by using case-insensitive regular expression filtering, although this will probably make your searches slower, and it won't catch cases where some of the word tokens are present but in a different order. In the following example, your should probably constrain the types of ?term and ?predicate first, or even filter on a string datatype on ?userInput
?term ?predicate ?someLiteral .
FILTER(regex(?someLiteral), "user input", "i"))
Several triplestores offer support for full-text searching and result scoring. These are often extensions to the SPARQL language.
For example, Virtuoso and some others offer a bif:contains predicate. Virtuoso also offers the faceted search web interface (plus a service, I think.) I have been pleased with the web-based full text search in Blazegraph and Stardog, but I can't say anything at this point about using them with a SPARQL query to get a score on a search pattern. Some (GraphDB) even support explicit integration with Lucene or Solr*, so you may be able to take advantage of their search languages.
Finally... are you using a library like the OWL API or RDF4J to access your ontology? If so, you could certainly save the relationships between your terms and any literals in a Java native data structure, and then directly use a fuzzy search component like Lucene to index each literal as a "document" and then search the user input across the index.
Why don't you post your ontology and give an example of a search you would like to peform in a non-binary way. I (or someone else) can try to show you a minimal implementation.
*Solr integration only appears to be offered in the commercially-licensed version of GraphDB
Hi I was wondering whether there is any analyzer in Elasticsearch to identify the grammar of the text (noun, verbs etc..)
For example when the user searches for "fast smartphone", the Elasticsearch should be able to put more emphasis on the "smartphone" rather than the "fast"So I would like Elasticsearch to return results in the following order:
1) docs where both words match "fast smartphone"
2) docs where smartphone matches
3) docs where "fast" matches. Or maybe docs with only fast should never come out since the user mainly looks for smartphones
Let's say I have a big corpus (for example in english or an arbitrary language), and I want to perform some semantic search on it.
For example I have the query:
"Be careful: [art] armada of [sg] is coming to [do sg]!"
And the corpus contains the following sentence:
"Be careful: an armada of alien ships is coming to destroy our planet!"
It can be seen that my query string could contain "semantic placeholders", such as:
[art] - some placeholder for articles (for example a / an in English)
[sg], [do sg] - some placeholders for NPs and VPs (subjects and predicates)
I would like to develop a library which would be capable to handle these queries efficiently.
I suspect that some kind of POS-tagging would be necessary for parsing the text, but because I don't want to fully reimplement an already existing full-text search engine to make it work, I'm considering that how could I integrate this behaviour into a search engine like Lucene?
I know there are SpanQueries which could behave similarly in some cases, but as I can see, Lucene doesn't do any semantic stuff with stored texts.
It is possible to implement a behavior like this? Or do I have to write an own search engine?
With Lucene, you could add additional tokens to a single item in a TokenStream, but I wouldn't know how to deal with tags that span more than one word.
I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.