How can I find the true score from Elasticsearch query string with a wildcard? - elasticsearch

My ElasticSearch 2.x NEST query string search contains a wildcard:
Using NEST in C#:
var results = _client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq.Query("Micro*")))
.From(pageNumber)
.Size(pageSize));
Comes up with something like this:
$ curl -XGET 'http://localhost:9200/_all/_search?q=Micro*'
This code was derived from the ElasticSearch page on using Co-variants. The results are co-variant; they are of mixed type coming from multiple indices. The problem I am having is that all of the hits come back with a score of 1.
This is regardless of type or boosting. Can I boost by type or, alternatively, is there a way to reveal or "explain" the search result so I can order by score?

Multi term queries like wildcard query are given a constant score equal to the boosting by default. You can change this behaviour using .Rewrite().
var results = client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq
.Query("Micro*")
.Rewrite(RewriteMultiTerm.ScoringBoolean)
)
)
.From(pageNumber)
.Size(pageSize)
);
With RewriteMultiTerm.ScoringBoolean, the rewrite method first translates each term into a should clause in a bool query and keeps the scores as computed by the query.
Note that this can be CPU intensive and there is a default limit of 1024 bool query clauses that can be easily hit for a large document corpus; running your query on the complete StackOverflow data set (questions, answers and users) for example, hits the clause limit for questions. You may want to analyze some text with an analyzer that uses an edgengram token filter.

Wildcard searches will always return a score of 1.
You can boost by a particular type. See this:
How to boost index type in elasticsearch?

Related

Elastic Search - Tokenization and Multi Match query

I need to perform tokenization and multi match in a single query in Elastic Search.
Currently,
1)I am using the analyzer to get the tokens like below
String text = // 4 line log data;
List<AnalyzeToken> analyzeTokenList = new ArrayList<AnalyzeToken>();
AnalyzeRequestBuilder analyzeRequestBuilder = this.client.admin().indices().prepareAnalyze();
for (String newIndex : newIndexes) {
analyzeRequestBuilder.setIndex(newIndex);
analyzeRequestBuilder.setText(text);
analyzeRequestBuilder.setAnalyzer(analyzer);
Response analyzeResponse = analyzeRequestBuilder.get();
analyzeTokenList.addAll(analyzeResponse.getTokens());
}
then, I will iterate through the AnalyzeToken and get the list of tokens,
List<String> tokens = new ArrayList<String>();
for (AnalyzeToken token : tokens)
{
tokens.addAll(token.getTerm().replaceAll("\\s+"," "));
}
then use the tokens and frame the multi-match query like below,
String query = "";
for(string data : tokens) {
query = query + data;
}
MultiMatchQueryBuilder multiMatchQueryBuilder = new MultiMatchQueryBuilder(query, "abstract", "title");
Iterable<Document> result = documentRepository.search(multiMatchQueryBuilder);
Based on the result, I am checking whether similar data exists in the database.
Is it possible to combine as single query - the analyze and multi match query as single query?
Any help is appreciated!
EDIT :
Problem Statement : Say I have 90 entries in one index, In which each 10 entries in that index are identical (not exactly but will have 70% match) so I will have 9 pairs.
I need to process only one entry in each pair, so I went in the following approach (which is not the good way - but as of now I end up with this approach)
Approach :
Get each entry from the 90 entries in the index
Tokenize using the analyzer (this removes the unwanted keywords)
Search in the same index (It checks whether the same kind of data is there in the index) and also filters the flag as processed. --> this flag will be updated after the first log gets processed.
If there is no flag available as processed for the similar kind of data (70% match) then I will process these logs and update the current log flag as processed.
If any data already exist with the flag as processed then I will consider this data is already processed and I will continue with the next one.
So Ideal goal is to, process only one data in the 10 unique entries.
Thanks,
Harry
Multi-match queries internally uses the match queries which are analyzed means they apply the same analyzer which is defined in the fields mapping(standard) if there is no analyzer defined.
From the multi-match query doc
The multi_match query builds on the match query to allow multi-field
queries:
Also, accepts analyzer, boost, operator, minimum_should_match,
fuzziness, lenient, as explained in match query.
So what you are trying to do is overkill, even if you want to change the analyzer(need different tokens during search time) then you can use the search analyzer instead of creating tokens and then using them in multi-match query.

Simple query without a specified field searching in whole ElasticSearch index

Say we have an ElasticSearch instance and one index. I now want to search the whole index for documents that contain a specific value. It's relevant to the search for this query over multiple fields, so I don't want to specify every field to search in.
My attempt so far (using NEST) is the following:
var res2 = client.Search<ElasticCompanyModelDTO>(s => s.Index("cvr-permanent").AllTypes().
Query(q => q
.Bool(bo => bo
.Must( sh => sh
.Term(c=>c.Value(query))
)
)
));
However, the query above results in an empty query:
I get the following output, ### ES REQEUST ### {} , after applying the following debug on my connectionstring:
.DisableDirectStreaming()
.OnRequestCompleted(details =>
{
Debug.WriteLine("### ES REQEUST ###");
if (details.RequestBodyInBytes != null) Debug.WriteLine(Encoding.UTF8.GetString(details.RequestBodyInBytes));
})
.PrettyJson();
How do I do this? Why is my query wrong?
Your problem is that you must specify a single field to search as part of a TermQuery. In fact, all ElasticSearch queries require a field or fields to be specified as part of the query. If you want to search every field in your document, you can use the built-in "_all" field (unless you've disabled it in your mapping.)
You should be sure you really want a TermQuery, too, since that will only match exact strings in the text. This type of query is typically used when querying short, unanalyzed string fields (for example, a field containing an enumeration of known values like US state abbreviations.)
If you'd like to query longer full-text fields, consider the MultiMatchQuery (it lets you specify multiple fields, too.)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html
Try this
var res2 = client.Search<ElasticCompanyModelDTO>(s =>
s.Index("cvr-permanent").AllTypes()
.Query(qry => qry
.Bool(b => b
.Must(m => m
.QueryString(qs => qs
.DefaultField("_all")
.Query(query))))));
The existing answers rely on the presence of _all. In case anyone comes across this question at a later date, it is worth knowing that _all was removed in ElasticSearch 6.0
There's a really good video explaining the reasons behind this and the way the replacements work from ElasticOn starting at around 07:30 in.
In short, the _all query can be replaced by a simple_query_string and it will work with same way. The form for the _search API would be;
GET <index>/_search
{
"query": {
"simple_query_string" : {
"query": "<queryTerm>"
}
}
}
The NEST pages on Elastic's documentation for this query are here;

How to use the elasticseach java api for dynamic searches?

So I'm trying to use elasticsearch for dynamic query building. Imagine that I can have a query like:
a = "something" AND b >= "other something" AND (c LIKE "stuff" OR c LIKE "stuff2" OR d BETWEEN "x" AND "y");
or like this:
(c>= 23 OR d<=43) AND (a LIKE "text" OR a LIKE "text2") AND f="text"
Should I use the QueryBuilder or the FilterBuilder, how do you match both? The official documentation says that for exact values we should use the filter approach? I assume I should use filters for equal comparisons? what about dates and numbers? Should I use the Filter or Query?
For the Like/Equals for the number/number problem I tried this:
#Field(type = String, index = FieldIndex.analyzed, pattern = "(\\d+\\/\\d+)|(\\d+\\/)|(\\d+)|(\\/\\d+)")
public String processNumber;
The pattern would deal with the structure number + slash + number, but also number and number + slash.
But when using either the term filter or the match_query I can't get only hits with the exact structure like 20/2014, if I type 20 I would still get hits on the term filter.
Query is the main component when you search for something, it takes into consideration ranking and other features such as stemming, synonyms and other things. Filter, on the other hand, just filters the result set you get from your query.
I suggest that if you don't care about the ranking use filters because they are faster. Otherwise, use query.

How to use wildcards with ngrams in ElasticSearch

Is it possible to combine wildcard matches and ngrams in ElasticSearch? I'm already using ngrams of length 3-11.
As a very small example, I have records C1239123 and C1230123. The user wants to return both of these. This is the only info they know: C123?12
The above case won't work on my full match analyzer because the query is missing the 3 on the end. I was under the impression wildcard matches would work out of the box, but if I perform a search similar to the above I get gibberish.
Query:
.Search<ElasticSearchProject>(a => a
.Size(100)
.Query(q => q
.SimpleQueryString(query => query
.OnFieldsWithBoost(b => b
.Add(f => f.Summary, 2.1)
.Add(f => f.Summary.Suffix("ngram"), 2.0)
.Query(searchQuery))));
Analyzer:
var projectPartialMatch = new CustomAnalyzer
{
Filter = new List<string> { "lowercase", "asciifolding" },
Tokenizer = "ngramtokenizer"
};
Tokenizer:
.Tokenizers(t=>t
.Add("ngramtokenizer", new NGramTokenizer
{
TokenChars = new[] {"letter","digit","punctuation"},
MaxGram = 11,
MinGram = 3
}))
EDIT:
The main purpose is to allow the user to tell the search engine exactly where the unknown characters are. This preserves the match order. I do not ngram the query, only the indexed fields.
EDIT 2 with more test results:
I had simplified my prior example a bit too much. The gibberish was being caused by punctuation filters. With a proper example there's no gibberish, but results aren't returned in a relevant order. Seeing below, I'm unsure why the first 2 results match at all. Ngram is not applied to the query.
Searching for c.a123?.7?0 gives results in this order:
C.A1234.560
C.A1234.800
C.A1234.700 <--Shouldn't this be first?
C.A1234.950
To anyone looking for a resolution to this, wildcards are used on ngrammed tokens by default. My problem was due to my queries having punctuation in them and using a standard analyzer on my query (which breaks on punctuation).
Duc.Duong's suggestion to use the Inquisitor plugin helped show exactly how data would be analyzed.

How to do facet search with mpdreamz Nest

does anybody know how to do facet search with Nest?
My index is https://gist.github.com/3606852
would like to search for some keyword in 'NumberEvent' and dispaly the result if the keyword exist.Please help me !!!
This is using the assumption that the MyPoco class exists and maps to your elasticsearch document. If it doesn't you can use dynamic but you'l have to swap the lambda based field selectors with strings.
var result = client.Search<MyPoco>(s=>s
.From(0)
.Size(10)
.Filter(ff=>ff.
.Term(f=>f.Categories.Types.Events.First().NumberEvent.event, "keyword")
)
.FacetTerm(q=>q.OnField(f=>f.Categories.Types.Facets.First().Person.First().entity))
);
result.Documents now holds your documents
result.Facet<TermFacet>(f => f.Categories.Types.Facets.First().Person.First().entity); now holds your facets
Your document seems a bit strange though in the sense that it already has Facets with counts in them.

Resources