How to use the elasticseach java api for dynamic searches? - elasticsearch

So I'm trying to use elasticsearch for dynamic query building. Imagine that I can have a query like:
a = "something" AND b >= "other something" AND (c LIKE "stuff" OR c LIKE "stuff2" OR d BETWEEN "x" AND "y");
or like this:
(c>= 23 OR d<=43) AND (a LIKE "text" OR a LIKE "text2") AND f="text"
Should I use the QueryBuilder or the FilterBuilder, how do you match both? The official documentation says that for exact values we should use the filter approach? I assume I should use filters for equal comparisons? what about dates and numbers? Should I use the Filter or Query?
For the Like/Equals for the number/number problem I tried this:
#Field(type = String, index = FieldIndex.analyzed, pattern = "(\\d+\\/\\d+)|(\\d+\\/)|(\\d+)|(\\/\\d+)")
public String processNumber;
The pattern would deal with the structure number + slash + number, but also number and number + slash.
But when using either the term filter or the match_query I can't get only hits with the exact structure like 20/2014, if I type 20 I would still get hits on the term filter.

Query is the main component when you search for something, it takes into consideration ranking and other features such as stemming, synonyms and other things. Filter, on the other hand, just filters the result set you get from your query.
I suggest that if you don't care about the ranking use filters because they are faster. Otherwise, use query.

Related

Elastic Search - Tokenization and Multi Match query

I need to perform tokenization and multi match in a single query in Elastic Search.
Currently,
1)I am using the analyzer to get the tokens like below
String text = // 4 line log data;
List<AnalyzeToken> analyzeTokenList = new ArrayList<AnalyzeToken>();
AnalyzeRequestBuilder analyzeRequestBuilder = this.client.admin().indices().prepareAnalyze();
for (String newIndex : newIndexes) {
analyzeRequestBuilder.setIndex(newIndex);
analyzeRequestBuilder.setText(text);
analyzeRequestBuilder.setAnalyzer(analyzer);
Response analyzeResponse = analyzeRequestBuilder.get();
analyzeTokenList.addAll(analyzeResponse.getTokens());
}
then, I will iterate through the AnalyzeToken and get the list of tokens,
List<String> tokens = new ArrayList<String>();
for (AnalyzeToken token : tokens)
{
tokens.addAll(token.getTerm().replaceAll("\\s+"," "));
}
then use the tokens and frame the multi-match query like below,
String query = "";
for(string data : tokens) {
query = query + data;
}
MultiMatchQueryBuilder multiMatchQueryBuilder = new MultiMatchQueryBuilder(query, "abstract", "title");
Iterable<Document> result = documentRepository.search(multiMatchQueryBuilder);
Based on the result, I am checking whether similar data exists in the database.
Is it possible to combine as single query - the analyze and multi match query as single query?
Any help is appreciated!
EDIT :
Problem Statement : Say I have 90 entries in one index, In which each 10 entries in that index are identical (not exactly but will have 70% match) so I will have 9 pairs.
I need to process only one entry in each pair, so I went in the following approach (which is not the good way - but as of now I end up with this approach)
Approach :
Get each entry from the 90 entries in the index
Tokenize using the analyzer (this removes the unwanted keywords)
Search in the same index (It checks whether the same kind of data is there in the index) and also filters the flag as processed. --> this flag will be updated after the first log gets processed.
If there is no flag available as processed for the similar kind of data (70% match) then I will process these logs and update the current log flag as processed.
If any data already exist with the flag as processed then I will consider this data is already processed and I will continue with the next one.
So Ideal goal is to, process only one data in the 10 unique entries.
Thanks,
Harry
Multi-match queries internally uses the match queries which are analyzed means they apply the same analyzer which is defined in the fields mapping(standard) if there is no analyzer defined.
From the multi-match query doc
The multi_match query builds on the match query to allow multi-field
queries:
Also, accepts analyzer, boost, operator, minimum_should_match,
fuzziness, lenient, as explained in match query.
So what you are trying to do is overkill, even if you want to change the analyzer(need different tokens during search time) then you can use the search analyzer instead of creating tokens and then using them in multi-match query.

Hibernate Search with nGram | How to instruct that nGram do no make grams during search time

I have defined my Analyzer as below
#AnalyzerDefs({
#AnalyzerDef(name = "ngram",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
//#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = NGramFilterFactory.class, params = {
#Parameter(name = "minGramSize", value = "3"),
#Parameter(name = "maxGramSize", value = "255") }) }),
//-----------------------------------------------------------------------
#AnalyzerDef(name = "ngram_query",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
//#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class)
})
})
#Analyzer(definition = "ngram")
public class EPCAsset extends Asset {
#Field
private String obturatorMaterial;
}
It perfectly makes n-grams term vectors during index time. But it also makes n-gram of search query during search time.
What i want is a way by which search query uses n-gram index to search without breaking the search term into grams.
Note: I have to use n-gram here because the requirement is to search anywhere in the text. either start or in middle. so edge-n-gram is not an option for me.
Example:
Input Data to be index ICQ 234
Then during index time its term vectors are
"234"
" 23"
" 234"
"cq "
"cq 2"
"cq 23"
"cq 234"
"icq"
"icq "
"icq 2"
"icq 23"
"icq 234"
"q 2"
"q 23"
"q 234"
Now when I search icq it works perfectly. But it also works for icqabc As during search time it makes n-grams of search query. So is there a way that during search time it do not break the search term but use n-gram index for searching.
Here is my search query building
FullTextEntityManager fullTextEntityManager = Search
.getFullTextEntityManager(entityManager);
QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(entityClass).get();
Query query = qb.phrase().onField("obturatorMaterial").sentence("icqabc").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query,
entityClass);
fullTextQuery.getResultList()
I am using elastic search as backend for Hibernate search.
EDIT:
I also has applied query time analyzer as per #yrodiere's answer but it give me error.
QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(entityClass).overridesForField("obturatorMaterial","ngram_query").get();
org.hibernate.search.exception.SearchException: HSEARCH000353: Unknown analyzer: 'ngram_query'. Make sure you defined this analyzer.
EDIT
As per this link overriderForField when using elasticsearch backed hibernate search
I am now able to define a query time 2nd analyzer and it solved the problem.
First, you should double check that an ngram filter really is what you want. I'm mentioning this because the ngram analyzer is generally used both at indexing and querying, so that it provides fuzzy matches. It's kind of the whole point of this analyzer.
Do you really need matches when the user types cq 2? Does it make sense? When implementing autocomplete, people generally prefer to only match documents containing words that start with the user input, so i would match, ic and icq would too, but not cq 2. If this seems to be what you want, you should have a look at the "edge_ngram" filter. It tends to improve the relevance of matches and also doesn't require as much disk space.
Now, even with the "edge_ngram" filter you will need to disable ngrams at query time. In Hibernate Search, this is done by "overriding" the analyzer.
First, define a second analyzer, identical to the one you use during indexing, but without the "ngram" or "edge_ngram" filter. Name it "ngram_query".
Then, use this to create your query builder:
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(EPCAsset.class)
.overridesForField( "obturatorMaterial", "ngram_query" )
.get();
Use the query builder to create your query as usual.
Note that, if you rely on Hibernate Search to push the index schema and analyzers to Elasticsearch, you will have to use a hack in order for the query-only analyzer to be pushed: by default only the analyzers that are actually used during indexing are pushed. See https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4
Either you need to use search time analyzer and very likely it would be the keyword analyzer during search time. Or need to use term query instead of match query, which is analyzed means it uses the same analyzer used index time.
Read more about term query and match query for more information.
Edit :- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html clearly talked about the use of search_analyzer, in case of edgeNGram tokenizer and autocomplete search which is exactly your use case.

Creating custom fields based on analysis output in ealsticsearch

I am having document where value is raw string :
{ "content" : "field1=1 , field2=foo"}
My intention is to, query by field1, field2 values.
Closest thing I can think of is to use custom analyser which will create tokens based on comma separator, and then I can search with matching exact values like "field1=1" or "field2=foo" . However, ideally I like to search by range values for field1, pattern matching for field2 etc.
Is there any way to achieve this? I could not find any way to store result of analysis which I can query in this way.
How are you ingesting the documents? If you are doing via logstash ,then you can apply the transformation there using a filter processor
I'm having a little difficulty understanding your question. However, I think you are asking if there's a way to make the type of Field1 numeric and the type of Field2 searchable?
Hopefully you are running Kibana so you can use the Dev console to test this out. If you just let Elastic import data it will create aggregateable and searchable fields for both field1 and field2 because they are both set to string values:
PUT /content_default/type/1 {"field1":"1" , "field2":"foo"}
If you instead omit the quotes around the 1, Elastic will create the field as a long (assuming you haven't already imported a document with a string in the same field) - this allows you to search by range. Here I'm creating a new field3 and setting the value to 1, if you query you should see it's a long
PUT /content_default/type/2 {"field1":"1" , "field2":"foo", "field3":1}
You can pre-load a template to allow you to define types up-front before loading any data - that way Elastic doesn't have to guess what types your fields should be. With strings you can also define whether you want them to be just keywords, searchable or both.
Something like this should do the trick for you:
PUT _template\with_template
{
   "template":"content_with_template",
   "mappings":{
      "content_with_template":{
         "properties":{
            "field2":{
               "analyzer":"simple",
               "type":"text"
            },
            "field1":{
               "type":"keyword"
            },
            "field3":{
               "type":"long"
            }
         }
      }
   }
}
Then put a document in the new 'content_with_template' index like this, at this point it doesn't matter if field3 is in quotes or not - as long as it's parses to a number it'll save
PUT /content_with_template/type/1
{ "field1":"a1d" , "field2":"foo", "field3":1}
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html

How can I find the true score from Elasticsearch query string with a wildcard?

My ElasticSearch 2.x NEST query string search contains a wildcard:
Using NEST in C#:
var results = _client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq.Query("Micro*")))
.From(pageNumber)
.Size(pageSize));
Comes up with something like this:
$ curl -XGET 'http://localhost:9200/_all/_search?q=Micro*'
This code was derived from the ElasticSearch page on using Co-variants. The results are co-variant; they are of mixed type coming from multiple indices. The problem I am having is that all of the hits come back with a score of 1.
This is regardless of type or boosting. Can I boost by type or, alternatively, is there a way to reveal or "explain" the search result so I can order by score?
Multi term queries like wildcard query are given a constant score equal to the boosting by default. You can change this behaviour using .Rewrite().
var results = client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq
.Query("Micro*")
.Rewrite(RewriteMultiTerm.ScoringBoolean)
)
)
.From(pageNumber)
.Size(pageSize)
);
With RewriteMultiTerm.ScoringBoolean, the rewrite method first translates each term into a should clause in a bool query and keeps the scores as computed by the query.
Note that this can be CPU intensive and there is a default limit of 1024 bool query clauses that can be easily hit for a large document corpus; running your query on the complete StackOverflow data set (questions, answers and users) for example, hits the clause limit for questions. You may want to analyze some text with an analyzer that uses an edgengram token filter.
Wildcard searches will always return a score of 1.
You can boost by a particular type. See this:
How to boost index type in elasticsearch?

How can I multiply the score of two queries together in Elasticsearch?

In Solr I can use the query function query to return a numerical score for a query and I can user that in the context of a bf parameter something like bf=product(query('cat'),query('dog')) to multiply two relevance scores together.
Elasticsearch has search API that is generally more flexible to work with, but I can't figure out how I would accomplish the same feat. I can use _score in a script_function of a function_query but I can only user the _score of the main query. How can I incorporate the score of another query? How can I multiply the scores together?
You could script a TF*IDF scoring function using a function_score query. Something like this (ignoring Lucene's query and length normalization):
"script": "tf = _index[field][term].tf(); idf = (1 + log ( _index.numDocs() / (_index[field][term].df() + 1))); return sqrt(tf) * pow(idf,2)"
You'd take the product of those function results for 'cat' and 'dog' and add them to your original query score.
Here's the full query gist.
Alternately, if you've got something in that bf that's heavyweight enough you'd rather not run it across the entire set of matches, you could use rescore requests to modify the score of the top N ranked ORIGINAL QUERY results using subsequent scoring passes with your (cat, dog, etc...) scoring-queries.

Resources