Elasticsearch: exclude queries with only one term - elasticsearch

I was wondering if it is possible in Elasticsearch to exclude queries where the query is a single term? I am trying to use "minimum_should_match" as 2, which works well when the query has 2 or more terms. However, if the number of terms in the query is 1, ES will still return results. It seems that ES is using the logic of "well you asked for a minimum of matching two terms, yet there is only one term to match; we'll lower the minimum to 1". Is there a way to turn this functionality off, or otherwise do what I am looking for?
For those wondering why this can't be done at the API level, I am using a query analyzer that excludes stop words. So a query like "a ipad" would end up being 1 term, while the API would see 2. The API could do stopword filtering but that seems to be a waste of resources.

Before doing a query you can first analyze the input by your custom analyzer.
You can use the Analyze API for this (be sure to set the analyzer property to be equal to your custom analyzer name).
The result would be a list of analyzed tokens. If your analyzer removes stopwords, it would return only ipad for a ipad.
So if the Analyze API returns only one token you actually don't need to query Elasticsearch, because you don't want any results if number of tokens is less than 2 (if I understood you correctly)

Related

How to return frequencies of matched terms in Elasticsearch query-string searches?

I am trying to adapt an existing boolean query that runs query-string searches against multiple fields so that it returns the number of hits for each matched term, for each search result.
This seems like it should be a straightforward request since the default relevancy scoring takes the matched term frequencies into account on a doc-by-doc basis, and highlighting must parse the fields to identify the positions of matched terms, but from scouring the docs there doesn't seem to be an easy way to do this that doesn't require some additional parsing of the results returned by Elasticsearch.
I would really like to avoid having to do more than one call to Elasticsearch per query for performance reasons, so would like to adapt the existing search queries if possible.
I know that the search API has an "explain" option that, when set to "true", makes a nested "_explanation" object be returned for each search result, but most of the information in these objects is irrelevant to what I want to know (the matched term frequencies), and I haven't found a way to exclude any of that information from being returned in the search results. I am reluctant to use this option because a) I've seen advice that it should only be used for debugging purposes and not in production, and b) the queries I'm running are not for individual terms but for query strings that can contain an arbitrary number of matched terms for each query, thus making the explanation objects much larger in some cases (therefore increasing the response payload) and more complex to parse. It's also not clear if the "_explanation" object has a well-defined structure anyway.
I've also considered parsing highlighted text fragments to determine matched term frequencies, since I'm already returning highlighted fields as part of the query. However, again this would require some additional parsing of the API response which would uncomfortably couple the method for obtaining matched term frequencies with the custom pre- and post-tags of the highlighted fields.
Edit: Just to clarify, I would be open to a separate Elasticsearch call per search if necessary, but not any that would require submitting a set of document IDs matched from the first query, because this would mean the API calls couldn't be done in parallel and because the number of returned results in the first call could be quite high, which I presume would impact performance of the second call.

Performance of match vs term query in elasticsearch?

I've been using a lot of match queries in my project. Now, I have just faced with term query in Elasticsearch. It seems the term query is more faster in case that keyword of your query is specified.
Now I have a question there..
Should I refactor my codes (it's a lot) and use term instead of match?
How much is the performance of using term better than match?
using term in my query:
main_query["query"]["bool"]["must"].append({"term":{object[..]:object[...]}})
using match query in my query:
main_query["query"]["bool"]["must"].append({"match":{object[..]:object[...]}})
Elastic discourages to use term queries for text fields for obvious reasons (analysis!!), but if you know you need to query a keyword field (not analyzed!!), definitely go for term/terms queries instead of match, because the match query does a lot more things aside from analyzing the input and will eventually end up executing a term query anyway because it notices that the queried field is a keyword field.
As far as I know when you use the match query it means your field is mapped as "text" and you use an analyzer. With that, your indexed word will generate tokens and when you run the query you go through an analyzer and the correspondence will be made for each of them.
Term will do the exact match, that is, it does not go through any analyzer, it will look for the exact term in the inverted index.
Because of this I believe that by not going through analyzers, Term is faster.
I use Term match to search for keywords like categories, tag, things that don't make sense use an analyzer.

Elasticsearch questions: search, performance and caching

I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.

How to write fast Elastic Search queries

Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)

retaining case in elasticsearch faceted search

Is there a way to do faceted searches using the elasticsearch Search API maintaining case (as opposed to having the results be converted to lowercase).
Thanks in advance, Chuck
Assuming you are using the "terms" facet, the facet entries are exactly the terms in the index. Briefly, analysis is the process of converting a field value into a sequence of terms, and lowercasing is a step in the default analyzer; that's why you're seeing lowercased terms. So you will want to change your analysis configuration (and perhaps introduce a multi_field if you want to run several different analyzers.)
There's a great explanation in Lucene in Action (2nd Ed.); it's applicable to ElasticSearch, too.

Resources