How to use wildcards with ngrams in ElasticSearch - elasticsearch

Is it possible to combine wildcard matches and ngrams in ElasticSearch? I'm already using ngrams of length 3-11.
As a very small example, I have records C1239123 and C1230123. The user wants to return both of these. This is the only info they know: C123?12
The above case won't work on my full match analyzer because the query is missing the 3 on the end. I was under the impression wildcard matches would work out of the box, but if I perform a search similar to the above I get gibberish.
Query:
.Search<ElasticSearchProject>(a => a
.Size(100)
.Query(q => q
.SimpleQueryString(query => query
.OnFieldsWithBoost(b => b
.Add(f => f.Summary, 2.1)
.Add(f => f.Summary.Suffix("ngram"), 2.0)
.Query(searchQuery))));
Analyzer:
var projectPartialMatch = new CustomAnalyzer
{
Filter = new List<string> { "lowercase", "asciifolding" },
Tokenizer = "ngramtokenizer"
};
Tokenizer:
.Tokenizers(t=>t
.Add("ngramtokenizer", new NGramTokenizer
{
TokenChars = new[] {"letter","digit","punctuation"},
MaxGram = 11,
MinGram = 3
}))
EDIT:
The main purpose is to allow the user to tell the search engine exactly where the unknown characters are. This preserves the match order. I do not ngram the query, only the indexed fields.
EDIT 2 with more test results:
I had simplified my prior example a bit too much. The gibberish was being caused by punctuation filters. With a proper example there's no gibberish, but results aren't returned in a relevant order. Seeing below, I'm unsure why the first 2 results match at all. Ngram is not applied to the query.
Searching for c.a123?.7?0 gives results in this order:
C.A1234.560
C.A1234.800
C.A1234.700 <--Shouldn't this be first?
C.A1234.950

To anyone looking for a resolution to this, wildcards are used on ngrammed tokens by default. My problem was due to my queries having punctuation in them and using a standard analyzer on my query (which breaks on punctuation).
Duc.Duong's suggestion to use the Inquisitor plugin helped show exactly how data would be analyzed.

Related

Hibernate Search with nGram | How to instruct that nGram do no make grams during search time

I have defined my Analyzer as below
#AnalyzerDefs({
#AnalyzerDef(name = "ngram",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
//#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = NGramFilterFactory.class, params = {
#Parameter(name = "minGramSize", value = "3"),
#Parameter(name = "maxGramSize", value = "255") }) }),
//-----------------------------------------------------------------------
#AnalyzerDef(name = "ngram_query",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
//#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class)
})
})
#Analyzer(definition = "ngram")
public class EPCAsset extends Asset {
#Field
private String obturatorMaterial;
}
It perfectly makes n-grams term vectors during index time. But it also makes n-gram of search query during search time.
What i want is a way by which search query uses n-gram index to search without breaking the search term into grams.
Note: I have to use n-gram here because the requirement is to search anywhere in the text. either start or in middle. so edge-n-gram is not an option for me.
Example:
Input Data to be index ICQ 234
Then during index time its term vectors are
"234"
" 23"
" 234"
"cq "
"cq 2"
"cq 23"
"cq 234"
"icq"
"icq "
"icq 2"
"icq 23"
"icq 234"
"q 2"
"q 23"
"q 234"
Now when I search icq it works perfectly. But it also works for icqabc As during search time it makes n-grams of search query. So is there a way that during search time it do not break the search term but use n-gram index for searching.
Here is my search query building
FullTextEntityManager fullTextEntityManager = Search
.getFullTextEntityManager(entityManager);
QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(entityClass).get();
Query query = qb.phrase().onField("obturatorMaterial").sentence("icqabc").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query,
entityClass);
fullTextQuery.getResultList()
I am using elastic search as backend for Hibernate search.
EDIT:
I also has applied query time analyzer as per #yrodiere's answer but it give me error.
QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(entityClass).overridesForField("obturatorMaterial","ngram_query").get();
org.hibernate.search.exception.SearchException: HSEARCH000353: Unknown analyzer: 'ngram_query'. Make sure you defined this analyzer.
EDIT
As per this link overriderForField when using elasticsearch backed hibernate search
I am now able to define a query time 2nd analyzer and it solved the problem.
First, you should double check that an ngram filter really is what you want. I'm mentioning this because the ngram analyzer is generally used both at indexing and querying, so that it provides fuzzy matches. It's kind of the whole point of this analyzer.
Do you really need matches when the user types cq 2? Does it make sense? When implementing autocomplete, people generally prefer to only match documents containing words that start with the user input, so i would match, ic and icq would too, but not cq 2. If this seems to be what you want, you should have a look at the "edge_ngram" filter. It tends to improve the relevance of matches and also doesn't require as much disk space.
Now, even with the "edge_ngram" filter you will need to disable ngrams at query time. In Hibernate Search, this is done by "overriding" the analyzer.
First, define a second analyzer, identical to the one you use during indexing, but without the "ngram" or "edge_ngram" filter. Name it "ngram_query".
Then, use this to create your query builder:
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(EPCAsset.class)
.overridesForField( "obturatorMaterial", "ngram_query" )
.get();
Use the query builder to create your query as usual.
Note that, if you rely on Hibernate Search to push the index schema and analyzers to Elasticsearch, you will have to use a hack in order for the query-only analyzer to be pushed: by default only the analyzers that are actually used during indexing are pushed. See https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4
Either you need to use search time analyzer and very likely it would be the keyword analyzer during search time. Or need to use term query instead of match query, which is analyzed means it uses the same analyzer used index time.
Read more about term query and match query for more information.
Edit :- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html clearly talked about the use of search_analyzer, in case of edgeNGram tokenizer and autocomplete search which is exactly your use case.

Creating an Index with English Analyzer using Nest

I am using nest to create my Elasticsearch Index. I have two questions:
Question 1. How can I add the settings to use english analyzer with a fall back for Standard Analyzer?
This is how I am creating my Index:
Uri _node = new Uri("elasticUri");
ConnectionSettings _connectionSettings = new ConnectionSettings(_node)
.DefaultIndex("MyIndexName")
.DefaultMappingFor<POCO>(m => m
.IndexName("MyIndexName")
);
IElasticClient _elasticClient = new ElasticClient(_connectionSettings);
var createIndexResponse = _elasticClient.CreateIndex("MyIndexName", c => c
.Mappings(m => m
.Map<POCO>(d => d.AutoMap())
)
);
Looking at the examples Here, I am also not sure what should I pass for "english_keywords", "english_stemmer", etc
Question 2: If I use English Analyzer, will Elasticsearch automatically realize that the terms: "Barbecue" and "BBQ" are synonyms? Or do I need to explicitly pass a list of Synonyms to ES?
Take a look at the NEST documentation for configuring a built-in analyzer for an index.
The documentation for the english analyzer simply demonstrates how you could reimplement the english analyzer yourself, as a custom analyzer, with the built-in analysis components, if you need to customize any part of the analysis. If you don't need to do this, simply use english as the name for the analyzer for a field
client.CreateIndex("my_index", c => c
.Mappings(m => m
.Map<POCO>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.MyProperty)
.Analyzer("english")
)
)
)
)
);
Will use the built-in english analyzer for the MyProperty field on POCO.
The english analyzer will not perform automatic synonym expansion for you, you'll need to configure the synonyms that are relevant to your search problem. You have two choices with regards to synonyms
Perform synonym expansion at index time on the index input. This will result in faster search at the expense of being a relatively fixed approach.
Perform synonym expansion at query time on the query input. This will result in slower search, but affords the flexibility to more easily add new synonym mappings as and when you need to.
You can always take the approach of using both, that is, indexing the synonyms that you expect to be relevant to your search use case, and adding new synonyms at query time, as you discover them to be relevant to your use case.

Simple query without a specified field searching in whole ElasticSearch index

Say we have an ElasticSearch instance and one index. I now want to search the whole index for documents that contain a specific value. It's relevant to the search for this query over multiple fields, so I don't want to specify every field to search in.
My attempt so far (using NEST) is the following:
var res2 = client.Search<ElasticCompanyModelDTO>(s => s.Index("cvr-permanent").AllTypes().
Query(q => q
.Bool(bo => bo
.Must( sh => sh
.Term(c=>c.Value(query))
)
)
));
However, the query above results in an empty query:
I get the following output, ### ES REQEUST ### {} , after applying the following debug on my connectionstring:
.DisableDirectStreaming()
.OnRequestCompleted(details =>
{
Debug.WriteLine("### ES REQEUST ###");
if (details.RequestBodyInBytes != null) Debug.WriteLine(Encoding.UTF8.GetString(details.RequestBodyInBytes));
})
.PrettyJson();
How do I do this? Why is my query wrong?
Your problem is that you must specify a single field to search as part of a TermQuery. In fact, all ElasticSearch queries require a field or fields to be specified as part of the query. If you want to search every field in your document, you can use the built-in "_all" field (unless you've disabled it in your mapping.)
You should be sure you really want a TermQuery, too, since that will only match exact strings in the text. This type of query is typically used when querying short, unanalyzed string fields (for example, a field containing an enumeration of known values like US state abbreviations.)
If you'd like to query longer full-text fields, consider the MultiMatchQuery (it lets you specify multiple fields, too.)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html
Try this
var res2 = client.Search<ElasticCompanyModelDTO>(s =>
s.Index("cvr-permanent").AllTypes()
.Query(qry => qry
.Bool(b => b
.Must(m => m
.QueryString(qs => qs
.DefaultField("_all")
.Query(query))))));
The existing answers rely on the presence of _all. In case anyone comes across this question at a later date, it is worth knowing that _all was removed in ElasticSearch 6.0
There's a really good video explaining the reasons behind this and the way the replacements work from ElasticOn starting at around 07:30 in.
In short, the _all query can be replaced by a simple_query_string and it will work with same way. The form for the _search API would be;
GET <index>/_search
{
"query": {
"simple_query_string" : {
"query": "<queryTerm>"
}
}
}
The NEST pages on Elastic's documentation for this query are here;

ElasticSearch - Filter stop words from Top words

I have a list of documents I am indexing like this:
ElasticIndex.CreateIndex(IndexName, _ => _
.Mappings(__ => __
.Map<AlbumMetadata>(
M => M.AutoMap()
.Properties(P => P.Text(T => T.Name(N => N.Keywords)
.Analyzer("stop")
.Fields(F => F.Keyword(K => K.Name("keywords"))))))));
In my class AlbumMetaData, the field Keywords is a list:
[Keyword]
public List<string> Keywords { get; set; }
When I want to retrieve the top terms, I do the following query (you can ignore Category and Type, they're not relevant to the problem):
var Match = Driver.Search<AlbumMetadata>(_ => _
.Query(Q => Q
.Term(P => P.Category, (int)Category) && Q
.Term(P => P.Type, (int)Type))
.Source(F => F.Includes(S => S.Fields(L => L.Keywords)))
.Aggregations(A => A
.Terms("Tags", T => T
.Field(E => E.Keywords)
.Size(Limit)
)
));
var Tags = Match.Aggs.Terms("Tags").Buckets.ToDictionary(K => K.Key, V => V.DocCount);
The problem is that in the output, I get some stop words as well as some symbols, like / - & |
What am I doing wrong?
Edit:
In order to clarify the question, here is what I am trying to achieve:
I have documents that have titles (full English sentences) and tags (list of single words, sometimes a tag is a two word tag).
I need to be able to perform a search that will find documents based on the title and tags (and ideally using word stems, ignoring plurals, etc).
I also need to extract the list of top words. The Keywords list is a concatenation of all words from the title and all the entries from the tags list.
Is the way I create the index appropriate in this context? Also, is the way I do the aggregation the right way?
There's a few things:
When you create the index, .AutoMap() on the mapping will infer Elasticsearch field datatypes from the POCO property types and the attributes applied to them. Then, .Properties() overrides any of these inferred mappings. So, the end result of your mapping for Keywords is a text datatype field with the stop analyzer applied, and a multi-field sub field of "keywords" (queryable via "keywords.keywords"), set as a keyword datatype.
The aggregation is running on the "keywords" text field with the stop analyzer applied. The stop analyzer uses English stop words by default, but you can configure the stop analyzer with other stop words by defining a custom stop analyzer in the index. The stop analyzer will not remove symbols like /, -, & and |.
With a terms aggregation, you generally want to get back aggregations on the verbatim terms for a field, which you can get with your mapping by using the "keywords.keywords" field in the aggregation. You can apply a normalizer to a keyword field which is similar to an analyzer, except it produces only one token. This is because a keyword field uses doc_values, an on-disk columnar data structure that is suited for well performing, large scale aggregations.
You can run the aggregation on a text field too as you're doing, but you also need to enable fielddata and be aware of how it works. text fields can't use doc_values.

How can I find the true score from Elasticsearch query string with a wildcard?

My ElasticSearch 2.x NEST query string search contains a wildcard:
Using NEST in C#:
var results = _client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq.Query("Micro*")))
.From(pageNumber)
.Size(pageSize));
Comes up with something like this:
$ curl -XGET 'http://localhost:9200/_all/_search?q=Micro*'
This code was derived from the ElasticSearch page on using Co-variants. The results are co-variant; they are of mixed type coming from multiple indices. The problem I am having is that all of the hits come back with a score of 1.
This is regardless of type or boosting. Can I boost by type or, alternatively, is there a way to reveal or "explain" the search result so I can order by score?
Multi term queries like wildcard query are given a constant score equal to the boosting by default. You can change this behaviour using .Rewrite().
var results = client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq
.Query("Micro*")
.Rewrite(RewriteMultiTerm.ScoringBoolean)
)
)
.From(pageNumber)
.Size(pageSize)
);
With RewriteMultiTerm.ScoringBoolean, the rewrite method first translates each term into a should clause in a bool query and keeps the scores as computed by the query.
Note that this can be CPU intensive and there is a default limit of 1024 bool query clauses that can be easily hit for a large document corpus; running your query on the complete StackOverflow data set (questions, answers and users) for example, hits the clause limit for questions. You may want to analyze some text with an analyzer that uses an edgengram token filter.
Wildcard searches will always return a score of 1.
You can boost by a particular type. See this:
How to boost index type in elasticsearch?

Resources