Spring data elasticsearch repository "StartingWith" throws InvalidDataAccessApiUsageException when searching for string with spaces - spring

I have a repository method that does a "starts with" (prefix) query on field userAccount.userName. When I search for string without space, it returns proper results. But when I search for strings with space in it, it throws an exception.
My repository method:
public List<EsUser> findByUserAccountUserNameStartingWith(String term);
Search String: Tom Cruise
Exception:
org.springframework.dao.InvalidDataAccessApiUsageException: Cannot
constructQuery '*"Tom Cruise"'. Use expression or multiple clauses
instead.

Queries against elasticsearch that use wildcards (e.g. *) must be a single token. By default, tokens are split by white space. "Tom Cruise" is two tokens.
If you need to include multiple tokens, consider implementing a custom Spring Data ES repository and use the following Elasticsearch API QueryBuilder. Something like this:
NativeSearchQueryBuilder nativeSearchQueryBuilder = new NativeSearchQueryBuilder();
QueryBuilder matchPhraseQuery = QueryBuilders.matchPhrasePrefixQuery("userName", "Tom Cruise");
QueryBuilder nestedQuery = QueryBuilders.nestedQuery("userAccount", matchPhraseQuery);
nativeSearchQueryBuilder.withQuery(nestedQuery);
NativeSearchQuery nativeSearchQuery = nativeSearchQueryBuilder.build();
//auto wire elastic search template
FacetedPage<EsUser> results = template.queryForPage(nativeSearchQuery, EsUser.class);

Related

Elastic Search - Tokenization and Multi Match query

I need to perform tokenization and multi match in a single query in Elastic Search.
Currently,
1)I am using the analyzer to get the tokens like below
String text = // 4 line log data;
List<AnalyzeToken> analyzeTokenList = new ArrayList<AnalyzeToken>();
AnalyzeRequestBuilder analyzeRequestBuilder = this.client.admin().indices().prepareAnalyze();
for (String newIndex : newIndexes) {
analyzeRequestBuilder.setIndex(newIndex);
analyzeRequestBuilder.setText(text);
analyzeRequestBuilder.setAnalyzer(analyzer);
Response analyzeResponse = analyzeRequestBuilder.get();
analyzeTokenList.addAll(analyzeResponse.getTokens());
}
then, I will iterate through the AnalyzeToken and get the list of tokens,
List<String> tokens = new ArrayList<String>();
for (AnalyzeToken token : tokens)
{
tokens.addAll(token.getTerm().replaceAll("\\s+"," "));
}
then use the tokens and frame the multi-match query like below,
String query = "";
for(string data : tokens) {
query = query + data;
}
MultiMatchQueryBuilder multiMatchQueryBuilder = new MultiMatchQueryBuilder(query, "abstract", "title");
Iterable<Document> result = documentRepository.search(multiMatchQueryBuilder);
Based on the result, I am checking whether similar data exists in the database.
Is it possible to combine as single query - the analyze and multi match query as single query?
Any help is appreciated!
EDIT :
Problem Statement : Say I have 90 entries in one index, In which each 10 entries in that index are identical (not exactly but will have 70% match) so I will have 9 pairs.
I need to process only one entry in each pair, so I went in the following approach (which is not the good way - but as of now I end up with this approach)
Approach :
Get each entry from the 90 entries in the index
Tokenize using the analyzer (this removes the unwanted keywords)
Search in the same index (It checks whether the same kind of data is there in the index) and also filters the flag as processed. --> this flag will be updated after the first log gets processed.
If there is no flag available as processed for the similar kind of data (70% match) then I will process these logs and update the current log flag as processed.
If any data already exist with the flag as processed then I will consider this data is already processed and I will continue with the next one.
So Ideal goal is to, process only one data in the 10 unique entries.
Thanks,
Harry
Multi-match queries internally uses the match queries which are analyzed means they apply the same analyzer which is defined in the fields mapping(standard) if there is no analyzer defined.
From the multi-match query doc
The multi_match query builds on the match query to allow multi-field
queries:
Also, accepts analyzer, boost, operator, minimum_should_match,
fuzziness, lenient, as explained in match query.
So what you are trying to do is overkill, even if you want to change the analyzer(need different tokens during search time) then you can use the search analyzer instead of creating tokens and then using them in multi-match query.

Hibernate Search with nGram | How to instruct that nGram do no make grams during search time

I have defined my Analyzer as below
#AnalyzerDefs({
#AnalyzerDef(name = "ngram",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
//#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = NGramFilterFactory.class, params = {
#Parameter(name = "minGramSize", value = "3"),
#Parameter(name = "maxGramSize", value = "255") }) }),
//-----------------------------------------------------------------------
#AnalyzerDef(name = "ngram_query",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
//#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class)
})
})
#Analyzer(definition = "ngram")
public class EPCAsset extends Asset {
#Field
private String obturatorMaterial;
}
It perfectly makes n-grams term vectors during index time. But it also makes n-gram of search query during search time.
What i want is a way by which search query uses n-gram index to search without breaking the search term into grams.
Note: I have to use n-gram here because the requirement is to search anywhere in the text. either start or in middle. so edge-n-gram is not an option for me.
Example:
Input Data to be index ICQ 234
Then during index time its term vectors are
"234"
" 23"
" 234"
"cq "
"cq 2"
"cq 23"
"cq 234"
"icq"
"icq "
"icq 2"
"icq 23"
"icq 234"
"q 2"
"q 23"
"q 234"
Now when I search icq it works perfectly. But it also works for icqabc As during search time it makes n-grams of search query. So is there a way that during search time it do not break the search term but use n-gram index for searching.
Here is my search query building
FullTextEntityManager fullTextEntityManager = Search
.getFullTextEntityManager(entityManager);
QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(entityClass).get();
Query query = qb.phrase().onField("obturatorMaterial").sentence("icqabc").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query,
entityClass);
fullTextQuery.getResultList()
I am using elastic search as backend for Hibernate search.
EDIT:
I also has applied query time analyzer as per #yrodiere's answer but it give me error.
QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(entityClass).overridesForField("obturatorMaterial","ngram_query").get();
org.hibernate.search.exception.SearchException: HSEARCH000353: Unknown analyzer: 'ngram_query'. Make sure you defined this analyzer.
EDIT
As per this link overriderForField when using elasticsearch backed hibernate search
I am now able to define a query time 2nd analyzer and it solved the problem.
First, you should double check that an ngram filter really is what you want. I'm mentioning this because the ngram analyzer is generally used both at indexing and querying, so that it provides fuzzy matches. It's kind of the whole point of this analyzer.
Do you really need matches when the user types cq 2? Does it make sense? When implementing autocomplete, people generally prefer to only match documents containing words that start with the user input, so i would match, ic and icq would too, but not cq 2. If this seems to be what you want, you should have a look at the "edge_ngram" filter. It tends to improve the relevance of matches and also doesn't require as much disk space.
Now, even with the "edge_ngram" filter you will need to disable ngrams at query time. In Hibernate Search, this is done by "overriding" the analyzer.
First, define a second analyzer, identical to the one you use during indexing, but without the "ngram" or "edge_ngram" filter. Name it "ngram_query".
Then, use this to create your query builder:
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(EPCAsset.class)
.overridesForField( "obturatorMaterial", "ngram_query" )
.get();
Use the query builder to create your query as usual.
Note that, if you rely on Hibernate Search to push the index schema and analyzers to Elasticsearch, you will have to use a hack in order for the query-only analyzer to be pushed: by default only the analyzers that are actually used during indexing are pushed. See https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4
Either you need to use search time analyzer and very likely it would be the keyword analyzer during search time. Or need to use term query instead of match query, which is analyzed means it uses the same analyzer used index time.
Read more about term query and match query for more information.
Edit :- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html clearly talked about the use of search_analyzer, in case of edgeNGram tokenizer and autocomplete search which is exactly your use case.

How to create FiltersAggregation query using JEST builders?

How can I create FiltersAggregation query using JEST AggregationBuilders or similar? I looked at FiltersAggregationIntegrationTest but query part is defined directly by JSON and I need something more like AggregationBuilders (as I'm using this for standard term aggregation for example)
Link to FiltersAggregationIntegrationTest:
https://github.com/searchbox-io/Jest/blob/master/jest/src/test/java/io/searchbox/core/search/aggregation/FiltersAggregationIntegrationTest.java
I've one possibility:
FilterAggregationBuilder testFilter = AggregationBuilders.filter("test");
testFilter.filter(FilterBuilders.typeFilter("typeName"));
new SearchSourceBuilder().aggregation(testFilter);
This is a filter by Type, but the FilterBuilders has a termFilter too.
Here is a solution that may work, I've cross posted this to the Jest github issue as well.
QueryBuilder filterTermsQuery = QueryBuilders.termsQuery("fieldName", "value1", "value2", "value3");
SearchSourceBuilder searchSourceBuilder = SearchSourceBuilder.searchSource()
.query(boolQueryBuilder)
.size(0)
.aggregation(
AggregationBuilders
.filter("filterAggName") // returns FilterAggregationBuilder
.filter(filterTermsQuery));
The gist is that you want to create a search source builder to use in the jest client, and supply that with and aggregation (which could also include sub-aggregations by chaining sub-aggregations on the AggregationBuilders method). Then define an aggregation of filter type available in AggregationBuilders. This returns a new builder of FilterAggregationBuilder where you can provide any QueryBuilder as the filter aggregation type. According to documentation, the .filter(termsQuery) call will cause only the documents that match the filter to fall into the bucket of this filter.
Hopefully this will resolve your issue unless I misunderstood your use case.

NotQueryBuilder elasticsearch 2.4 execution modes

what is the alternative for .execution("and") in elastic search 2.4? and what exactly its usage –
NotFilterBuilder excVariantsFilter = FilterBuilders.notFilter(FilterBuilders.termsFilter("products", productIds.toArray()).execution("and"));
Filters and queries have been merged in ES 2.0 and the execution mode was only useful in a filter context, so there's no need anymore for that execution parameter in terms queries.
So if you want an equivalent behavior to this
NotFilterBuilder excVariantsFilter = FilterBuilders.notFilter(FilterBuilders.termsFilter("products", productIds.toArray()).execution("and"));
you can now write it like this:
BoolQueryBuilder excVariantsFilter = QueryBuilders.boolQuery();
for (String productId : productIds.toArray()) {
excVariantsFilter.mustNot(QueryBuilders.termQuery("products", productId));
}
It will produce a bool/must_not query containing a term query for each productId, which is equivalent to the previous not filter containing a terms query with and execution mode

Fuzzify existing ElasticSearch Java API query

I have an existing ElasticSearch query that uses the Java API:
BoolQueryBuilder queryBuilder =
boolQuery().should(queryStringQuery(theUsersQueryString));
SearchResponse response = client.prepareSearch(...).setQuery(queryBuilder);
Now I want to add fuzziness to this, to allow minor misspellings to still return something to the user. My guess was that adding fuzziness parameters to the QueryBuilders object would be fruitful:
boolQuery().should(queryStringQuery(theUsersQueryString)
.fuzziness(Fuzziness.ONE)
.fuzzyMaxExpansions(4)
.fuzzyPrefixLength(2));
Unfortunately this doesn't seem to work and I have so far been unable to find good documentation for this. For example, I have the string John Deere in my database. If I use the query string deere I get a match, but not if I use query strings Deeree or Deeer.
My question is: how should I correctly fuzzify my query?
I opted to create a new query rather than modifying my existing one.
MultiMatchQueryBuilder fuzzyMmQueryBuilder = multiMatchQuery(
theUsersQueryString, "field1", "field2", ... , "fieldn").fuzziness("AUTO");
BoolQueryBuilder b = boolQuery().should(fuzzyMmQueryBuilder);
SearchRequestBuilder srb = client.prepareSearch(...).setQuery(b)...
SearchResponse res = srb.execute().actionGet();
This query exhibits fuzzy behaviour.

Resources