Spring Data Elastic Search with special characters - spring

As part of our project we are using Spring Data on top of Elastic Search.
We found very interesting issue with findBy queries. If we pass string that contains space it didn't find the right element unless we pad the string with quotes. For example: for getByName(String name) we should pass getByName("\"John Do\"").
Is there any way to eliminate such redundant padding?

I'm trying my first steps with Spring (Boot Starter) Data ES and stumbled upon the same issue as you have, only in my case it was a : that 'messed things up'. I've learned that this is part of the reserved characters (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_reserved_characters). The quoting that you mention is exactly the solution I use for now. It results in a query like this:
{
"from": 0,
"query": {
"bool": {
"must": {
"query_string": {
"query": "\"John Do\"",
"fields": ["name"]
}
}
}
}
}
(You can use this in a rest console or in ElasticHQ to check the result.)
A colleague suggested that switching to a 'term' query:
{
"from": 0,
"size": 100,
"query": {
"term" : {
"name": "John Do"
}
}
}
might help to avoid the quoting. I have tried this out by use of the #Query annotation on the method findByName in your repository. It would go something like this:
#Query(value = "{\"term\" : {\"name\" : \"?0\"}}")
List<Person> findByName(String name);

Related

Elastic search wildcard query crashes cluster

I run the query below on a large elastic search cluster. The cluster bcomes unresponsive
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"regexp": {
"message": {
"value": ".*exception.*"
}
}
},
{
"bool": {
"should": [
{
"term": {
"beat.hostname": "ip-xxx-xx-xx-xx"
}
}
]
}
},
{
"range": {
"#timestamp": {
"lt": 1518459660000,
"format": "epoch_millis",
"gte": 1518459600000
}
}
}
]
}
}
}
When I remove the wildcarded .*exception.* and replace it with any non wildcarded string like xyz it returns fast. Though the query uses a wildcarded expression, it also looks for a small time range and a specific host. I would think this is a very simple query. Any reason why elasticsearch server can't handle this query? The cluster has 10 nodes and 20 TB of data.
See the documentation for Regexp Query. It clearly states the following:
Note: The performance of a regexp query heavily depends on the regular
expression chosen. Matching everything like .* is very slow
What would be ideal is to change the text analysis on the message field with a WordDelimiterTokenFilter and set split_on_case_change to true. Then something like NullPointerException will get indexed as three separate tokens [Null, Pointer, Exception]. This can help you search on exception without using a regex. Caveat is you need to reindex all your documents.
Another quick thing to try might be to keep your filter conditions on the hostname and timestamp in a filter context, which will prefilter documents before running your regexp query. This may be a short-term solution for you until you fix the text analysis.

How do I not match a bare hyphen in Elasticsearch?

I am querying apache logs stored in Elasticsearch. I want to return log entries from a given hostname that has a hyphen and with a populated auth field.
These strings should be an exact match: "hostname": "example-dev" and not "auth": "-".
My questions are:
How do I correctly remap a type in Elasticsearch to allow a hyphen to be part of the matched string.
How do I correctly query a type in Elasticsearch with a bare hyphen.
The hyphen is a reserved character in Elasticsearch, so I understand it takes special effort. However, I'm having what seems like a lot of trouble figuring out how to include it in my query.
I have tried to remap the type to be not_analysed. It looks like the format has recently changed. The old way of defining the index ("analysed", "not_analysed", and "no") makes sense to me. The new way (true or false) does not. In either case, I cannot seem to get remapping to work.
Here is my attempt at remapping:
DELETE /search
PUT search
{
"mappings" : {
"beat" : {
"properties" : {
"hostname" : {
"type" : "text",
"norms" : false,
"index" : false
}
}
}
}
}
I have not included the remapping of the auth field because it only returns a mapper_parsing_exception.
I am using json to query Elasticsearch. Here is my query:
GET _search
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"match": {
"beat.hostname": "example-dev"
}
}
],
"must_not": [
{
"match": {
"auth.keyword": "-"
}
}
]
}
}
}
}
}
I have tried escaping the hyphen with \\- but that returns results that match "auth": "-". The hostname still does not match exactly. The hostname query also matches something like "example-prod".
I have tried using "term" rather than "match"; that returns no results.
I can match a specific string for "auth", for example "must": { "match": { "auth": "foo" } } returns all entries for auth = "foo". That is opposite of what I need, but it does work. The hostname is still not exactly matched if it includes a hyphen.
The log entries are parsed into Elasticsearch using ELK stack, however this will be a report that is generated outside of Kibana for legacy reasons.
I have read the documentation and examples, but there is a lot to dig through. Many of the examples I have found are for older versions of Elasticsearch, which is understandable, but confusing.
I am new to Elasticsearch. It feels like I am just overlooking something, but it the problem might stem from a basic misunderstanding of how Elasticsearch is doing things.
After spending some more time with ElascticSearch queries, I think I have it figured out.
Splitting the hostname string into two separate string and matching for both filters the hostname as expected. Using an empty string for the negative match also seems to work as expected.
Here is the updated query:
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"match": {
"beat.hostname": "example"
}
},
{
"match": {
"beat.hostname": "dev"
}
}
],
"must_not": [
{
"match_phrase": {
"auth.keyword": ""
}
}
]
}
}
}
}
I will do bit more testing is need to make sure this is actually returning what I need.
I was trying too hard to make ElasticSearch fit what I expected. Instead of working with ElasticSearch, I was trying to fight against it.

Scope Elasticsearch Results to Specific Ids

I have a question about the Elasticsearch DSL.
I would like to do a full text search, but scope the searchable records to a specific array of database ids.
In SQL world, it would be the functional equivalent of WHERE id IN(1, 2, 3, 4).
I've been researching, but I find the Elasticsearch query DSL documentation a little cryptic and devoid of useful examples. Can anyone point me in the right direction?
Here is an example query which might work for you. This assumes that the _all field is enabled on your index (which is the default). It will do a full text search across all the fields in your index. Additionally, with the added ids filter, the query will exclude any document whose id is not in the given array.
{
"bool": {
"must": {
"match": {
"_all": "your search text"
}
},
"filter": {
"ids": {
"values": ["1","2","3","4"]
}
}
}
}
Hope this helps!
As discussed by Ali Beyad, ids field in the query can do that for you. Just to complement his answer, I am giving an working example. In case anyone in the future needs it.
GET index_name/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"field": "your query"
}
},
{
"ids" : {
"values" : ["0aRM6ngBFlDmSSLpu_J4", "0qRM6ngBFlDmSSLpu_J4"]
}
}
]
}
}
}
You can create a bool query that contains an Ids query in a MUST clause:
https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-dsl-ids-query.html
By using a MUST clause in a bool query, your search will be further limited by the Ids you specify. I'm assuming here by Ids you mean the _id value for your documents.
According to es doc, you can
Returns documents based on their IDs.
GET /_search
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
}
}
With elasticaBundle symfony 5.2
$query = new Query();
$IdsQuery = new Query\Ids();
$IdsQuery->setIds($id);
$query->setQuery($IdsQuery);
$this->finder->find($query, $limit);
You have two options.
The ids query:
GET index/_search
{
"query": {
"ids": {
"values": ["1, 2, 3"]
}
}
}
or
The terms query:
GET index/_search
{
"query": {
"terms": {
"yourNonPrimaryIdField": ["1", "2","3"]
}
}
}
The ids query targets the document's internal _id field (= the primary ID). But it often happens that documents contain secondary (and more) IDs which you'd target thru the terms query.
Note that if your secondary IDs contain uppercase chars and you don't set their field's mapping to keyword, they'll be normalized (and lowercased) and the terms query will appear broken because it only works with exact matches. More on this here: Only getting results when elasticsearch is case sensitive

Filter items which array contains any of given values

I have a set of documents like
{
tags:['a','b','c']
// ... a bunch properties
}
As stated in the title: Is there a way to filter all documents containing any of given tags using Nest ?
For instance, the record above would match ['c','d']
Or should I build multiple "OR"s manually ?
elasticsearch 2.0.1:
There's also terms query which should save you some work. Here example from docs:
{
"terms" : {
"tags" : [ "blue", "pill" ],
"minimum_should_match" : 1
}
}
Under hood it constructs boolean should. So it's basically the same thing as above but shorter.
There's also a corresponding terms filter.
So to summarize your query could look like this:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"terms": {
"tags": ["c", "d"]
}
}
}
}
With greater number of tags this could make quite a difference in length.
Edit: The bitset stuff below is maybe an interesting read, but the answer itself is a bit dated. Some of this functionality is changing around in 2.x. Also Slawek points out in another answer that the terms query is an easy way to DRY up the search in this case. Refactored at the end for current best practices. —nz
You'll probably want a Bool Query (or more likely Filter alongside another query), with a should clause.
The bool query has three main properties: must, should, and must_not. Each of these accepts another query, or array of queries. The clause names are fairly self-explanatory; in your case, the should clause may specify a list filters, a match against any one of which will return the document you're looking for.
From the docs:
In a boolean query with no must clauses, one or more should clauses must match a document. The minimum number of should clauses to match can be set using the minimum_should_match parameter.
Here's an example of what that Bool query might look like in isolation:
{
"bool": {
"should": [
{ "term": { "tag": "c" }},
{ "term": { "tag": "d" }}
]
}
}
And here's another example of that Bool query as a filter within a more general-purpose Filtered Query:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"bool": {
"should": [
{ "term": { "tag": "c" }},
{ "term": { "tag": "d" }}
]
}
}
}
}
Whether you use Bool as a query (e.g., to influence the score of matches), or as a filter (e.g., to reduce the hits that are then being scored or post-filtered) is subjective, depending on your requirements.
It is generally preferable to use Bool in favor of an Or Filter, unless you have a reason to use And/Or/Not (such reasons do exist). The Elasticsearch blog has more information about the different implementations of each, and good examples of when you might prefer Bool over And/Or/Not, and vice-versa.
Elasticsearch blog: All About Elasticsearch Filter Bitsets
Update with a refactored query...
Now, with all of that out of the way, the terms query is a DRYer version of all of the above. It does the right thing with respect to the type of query under the hood, it behaves the same as the bool + should using the minimum_should_match options, and overall is a bit more terse.
Here's that last query refactored a bit:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"terms": {
"tag": [ "c", "d" ],
"minimum_should_match": 1
}
}
}
}
Whilst this an old question, I ran into this problem myself recently and some of the answers here are now deprecated (as the comments point out). So for the benefit of others who may have stumbled here:
A term query can be used to find the exact term specified in the reverse index:
{
"query": {
"term" : { "tags" : "a" }
}
From the documenation https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
Alternatively you can use a terms query, which will match all documents with any of the items specified in the given array:
{
"query": {
"terms" : { "tags" : ["a", "c"]}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
One gotcha to be aware of (which caught me out) - how you define the document also makes a difference. If the field you're searching in has been indexed as a text type then Elasticsearch will perform a full text search (i.e using an analyzed string).
If you've indexed the field as a keyword then a keyword search using a 'non-analyzed' string is performed. This can have a massive practical impact as Analyzed strings are pre-processed (lowercased, punctuation dropped etc.) See (https://www.elastic.co/guide/en/elasticsearch/guide/master/term-vs-full-text.html)
To avoid these issues, the string field has split into two new types: text, which should be used for full-text search, and keyword, which should be used for keyword search. (https://www.elastic.co/blog/strings-are-dead-long-live-strings)
For those looking at this in 2020, you may notice that accepted answer is deprecated in 2020, but there is a similar approach available using terms_set and minimum_should_match_script combination.
Please see the detailed answer here in the SO thread
You should use Terms Query
{
"query" : {
"terms" : {
"tags" : ["c", "d"]
}
}
}

elasticsearch - confused on how to searching items that a field contains string

This query is returning fine only one item "steve_jobs".
{
"query": {
"constant_score": {
"filter": {
"term": {
"name":"steve_jobs"
}
}
}
}
}
So, now I want to get all people with name prefix steve_. So I try this:
{
"query": {
"constant_score": {
"filter": {
"term": {
"name": "steve_"
}
}
}
}
}
This is returning nothing. Why?
I'm confused about when to use term query / term filter / terms filter / querystring query.
What you need is Prefix Query.
If you are indexing your document like so:
POST /testing_nested_query/class/
{
"name": "my name is steve_jobs"
}
And you are using the default analyzer, then the problem is that the term steve_jobs will be indexed as one term. So your Term Query will never be able to find any docs matching the term steve as there is no term like in the index. Prefix Query helps you solve your problem by searching for a prefix in all the indexed terms.
You can solve the same problem by making your custom analyzers (read this and this) so that steve_jobs is stored as steve and jobs.

Resources