Elastic Search Case Insensitive query with prefix query - elasticsearch

I am new to elastic search. I have below query
GET deals2/_search
{
"size": 200,
"_source": ["acquireInfo"],
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": ["acquireInfo.company_name.keyword"],
"query": "az*"
}
}
]
}
}
}
Here I want Elastic should gives results like case insensitive Like string start with below like
"Az"
"AZ"
"az"
"aZ"
"Az"
But I am not getting all results like this way. So Anyone can please help me on that.
Example:- I have 4 documents
1)Aziia Avto Ust-Kamenogorsk OOO
2)AZ Infotech Inc
3)AZURE Midstream Partners LP
4)State Oil Fund of the Republic of Azerbaijan
Now searching on az , should return only first 3 docs as they start with az ignoring case here and not the 4th one, which also has az but not at the beginning.

This is happening as you are using the keyword field to index the company_name in your application.
The keyword analyzer is a “noop” analyzer which returns the entire input string as a single token for example, company name, consist of foo, Foo, fOo will be stored with case only and searching for foo, will only match foo as elastic search ultimately works on tokens match(which is case sensitive).
What you need is to use a standard analyzer or some other custom analyzer which solves your other use-cases as well and uses lowercase token filter on the field and use the match query which is analyzed, and uses the same analyzer which is used to index the field, this way your search query will generate the same tokens, which is stored in the index and your search will become case-insensitive.
Edit: Had a discussion with the user in chat and updating the answer to suit his requirements, which are below:-
Step 1:- Define settings and mapping for index.
Endpoint :- http://{{hostname}}:{{port}}/{{index}}
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"company_name": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
Step 2: Index all the documents
Endpoint: http://{{hostname}}:{{port}}/{{index}}/_doc/ --> 1,2,3,4 etc
{
"company_name" : "State Oil Fund of the Republic of Azerbaijan"
}
Step3 :- Search query
Endpoint:- http://{{hostname}}:{{port}}/{{index}}/_search
{ "query": {
"prefix" : { "company_name" : "az" }
}
}
This would bring the below expected results:-
{
"took": 870,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "prerfixsearch",
"_type": "_doc",
"_id": "2ec9df0fc-dc04-47bb-914f-91a9f20d09efd15f2506-293f-4fb2-bdc3-925684a930b5",
"_score": 1,
"_source": {
"company_name": "AZ Infotech Inc"
}
},
{
"_index": "prerfixsearch",
"_type": "_doc",
"_id": "160d01183-a308-4408-8ac1-a85da950f285edefaca2-0b68-41c6-ba34-21bbef57f84f",
"_score": 1,
"_source": {
"company_name": "Aziia Avto Ust-Kamenogorsk OOO"
}
},
{
"_index": "prerfixsearch",
"_type": "_doc",
"_id": "1da878175-7db5-4332-baa7-ac47bd39b646f81c1770-7ae1-4536-baed-0a4f6b20fa38",
"_score": 1,
"_source": {
"company_name": "AZURE Midstream Partners LP"
}
}
]
}
}
Explanation:, As earlier OP didn;t mention the exclusion of 4th doc in the search result, that's the reason I suggested creating a text field, so that individuals tokens are generated but now as requirement is only the prefix search, we don't need the individual tokens and we would want only 1 token but it should be lowercased to support the case insensitive search, that's the reason I applied the custom normalizer on company_name field.

Related

Query on Elastic Search on multiple criterias

I have this document in elastic search
{
"_index" : "master",
"_type" : "_doc",
"_id" : "q9IGdXABeXa7ITflapkV",
"_score" : 0.0,
"_source" : {
"customer_acct" : "64876457056",
"ssn_number" : "123456789",
"name" : "Julie",
"city" : "NY"
}
I wanted to query the master index , with the customer_acct and ssn_number to retrive the entire document. I wanted to disable scoring and relevance , I have used the below query
curl -X GET "localhost/master/_search/?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"term": {
"customer_acct": {
"value":"64876457056"
}
}
}
}'
I need to include the second criteria in the term query as well which is the ssn_number, how would I do that? , I want to turn off scoring and relevance would that be possible, I am new to Elastic Search and how would I fit the second criteria on ssn_number in the above query that I have tried?
First, you need to define the proper mapping of your index. your customer_acct and ssn_number are of numeric type but you are storing it as a string. Also looking at your sample I can see you have to use long to store them. and then you can just use filter context in your query as you don't need score and relevance in your result. Read more about filter context in official ES doc as well as below snippet from the link.
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data,
which is exactly your use-case.
1. Index Mapping
{
"mappings": {
"properties": {
"customer_acct": {
"type": "long"
},
"ssn_number" :{
"type": "long"
},
"name" : {
"type": "text"
},
"city" :{
"type": "text"
}
}
}
}
2. Index sample docs
{
"name": "Smithe John",
"city": "SF",
"customer_acct": 64876457065,
"ssn_number": 123456790
}
{
"name": "Julie",
"city": "NY",
"customer_acct": 64876457056,
"ssn_number": 123456789
}
3. Main search query to filter without the score
{
"query": {
"bool": {
"filter": [ --> only filter clause
{
"term": {
"customer_acct": 64876457056
}
},
{
"term": {
"ssn_number": 123456789
}
}
]
}
}
}
Above search query gives below result:
{
"took": 186,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "so-master",
"_type": "_doc",
"_id": "1",
"_score": 0.0, --> notice score is 0.
"_source": {
"name": "Smithe John",
"city": "SF",
"customer_acct": 64876457056,
"ssn_number": 123456789
}
}
]
}
}

How to store what is generated by the analyser?

Let's say that I use this mapping:
PUT test
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"testtype": {
"properties": {
"content": {
"type": "text",
"analyzer": "english",
"store": true
}
}
}
}
}
Now I can index a document:
PUT test/testtype/0
{
"content": "The Quick Brown Box"
}
And I can retrieve it:
GET test/testtype/0
Which will return me:
{
"_index": "test",
"_type": "testtype",
"_id": "0",
"_version": 1,
"found": true,
"_source": {
"content": "The Quick brown Fox"
}
}
I know that in the source field you are supposed to only have the document that you inserted this is why I specified in my mapping that I want to store my content field. So by querying my store field I would expect to have in it what is generated my the analyser so something like this:
"quick brown fox"
But when I query the stored field:
GET test/testtype/_search
{
"stored_fields": "content"
}
I have exactly what I wrote in my document:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "testtype",
"_id": "0",
"_score": 1,
"fields": {
"content": [
"The Quick brown Fox"
]
}
}
]
}
}
So my question is how can I store in my elasticsearch the result of what is generated by my analyser?
You question is about the difference between the stored text and the generated tokens:
the store attribute of a lucene field
A stored field contains exactly the same as the corresponding field in the "_source"-JSON.
The generated token are in a lucene internal representation. But you can use the _analyze or _termvectors endpoint to have see the token
or you can use the term-aggregation
You can set index or query time analyzer. If you are using index time analyzer, then the analyzed text will be stored.
More details: https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html
Another way is using multifields. That means you have the original, and the processed text as well.
More details: https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

Elasticsearch aggregation turns results to lowercase

I've been playing with ElasticSearch a little and found an issue when doing aggregations.
I have two endpoints, /A and /B. In the first one I have parents for the second one. So, one or many objects in B must belong to one object in A. Therefore, objects in B have an attribute "parentId" with parent index generated by ElasticSearch.
I want to filter parents in A by children attributes of B. In order to do it, I first filter children in B by attributes and get its unique parent ids that I'll later use to get parents.
I send this request:
POST http://localhost:9200/test/B/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "derp2*"
}
},
"aggregations": {
"ids": {
"terms": {
"field": "parentId"
}
}
}
}
And get this response:
{
"took": 91,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "child",
"_id": "AU_fjH5u40Hx1Kh6rfQG",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child2"
}
},
{
"_index": "test",
"_type": "child",
"_id": "AU_fjD_U40Hx1Kh6rfQF",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child1"
}
},
{
"_index": "test",
"_type": "child",
"_id": "AU_fjKqf40Hx1Kh6rfQH",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child3"
}
}
]
},
"aggregations": {
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "au_ffvwm40hx1kh6rfqa",
"doc_count": 3
}
]
}
}
}
For some reason, the filtered key is returned in lowercase, hence not being able to request parent to ElasticSearch
GET http://localhost:9200/test/A/au_ffvwm40hx1kh6rfqa
Response:
{
"_index": "test",
"_type": "A",
"_id": "au_ffvwm40hx1kh6rfqa",
"found": false
}
Any ideas on why is this happening?
The difference between the hits and the results of the aggregations is that the aggregations work on the created terms. They will also return the terms. The hits return the original source.
How are these terms created? Based on the chosen analyser, which in your case is the default one, the standard analyser. One of the things this analyser does is lowercasing all the characters of the terms. Like mentioned by Andrei, you should configure the field parentId to be not_analyzed.
PUT test
{
"mappings": {
"B": {
"properties": {
"parentId": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
I am late from the party but I had the same issue and understood that it caused by the normalization.
You have to change the mapping of the index if you want to prevent any normalization changes the aggregated values to lowercase.
You can check the current mapping in the DevTools console by typing
GET /A/_mapping
GET /B/_mapping
When you see the structure of the index you have to see the setting of the parentId field.
If you don't want to change the behaviour of the field but you also want to avoid the normalization during the aggregation then you can add a sub-field to the parentId field.
For changing the mapping you have to delete the index and recreate it with the new mapping:
creating the index
Adding multi-fields to an existing field
In your case it looks like this (it contains only the parentId field)
PUT /B/_mapping
{
"properties": {
"parentId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
then you have to use the subfield in the query:
POST http://localhost:9200/test/B/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "derp2*"
}
},
"aggregations": {
"ids": {
"terms": {
"field": "parentId.keyword",
"order": {"_key": "desc"}
}
}
}
}

Should I include spaces in fuzzy query fields?

I have this data:
name:
first: 'John'
last: 'Smith'
When I store it in ES, AFAICT it's better to make it one field. However, should this one field be:
name: 'John Smith'
or
name: 'JohnSmith'
?
I'm thinking that the query should be:
query:
match:
name:
query: searchTerm
fuzziness: 'AUTO'
operator: 'and'
Example search terms are what people might type in a search box, like
John
Jhon Smi
J Smith
Smith
etc.
You will probably want a combination of ngrams and a fuzzy match query. I wrote a blog post about ngrams for Qbox if you need a primer: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch. I'll swipe the starter code at the end of the post to illustrate what I mean here.
Also, I don't think it matters much whether you use two fields for name, or just one. If you have some other reason you want two fields, you may want to use the _all field in your query. For simplicity I'll just use a single field here.
Here is a mapping that will get you the partial-word matching you want, assuming you only care about tokens that start at the beginning of words (otherwise use ngrams instead of edge ngrams). There are lots of nuances to using ngrams, so I'll refer to you the documentation and my primer if you want more info.
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
One thing to note here, in particular: "min_gram": 1. This means that single-character tokens will be generated from indexed values. This will cast a pretty wide net when you query (lots of words begin with "j", for example), so you may get some unexpected results, especially when combined with fuzziness. But this is needed to get your "J Smith" query to work right. So there are some trade-offs to consider.
For illustration, I indexed four documents:
PUT /test_index/doc/_bulk
{"index":{"_id":1}}
{"name":"John Hancock"}
{"index":{"_id":2}}
{"name":"John Smith"}
{"index":{"_id":3}}
{"name":"Bob Smith"}
{"index":{"_id":4}}
{"name":"Bob Jones"}
Your query mostly works, with a couple of caveats.
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "John",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
}
this query returns three documents, because of ngrams plus fuzziness:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.90169895,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.90169895,
"_source": {
"name": "John Hancock"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.90169895,
"_source": {
"name": "John Smith"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.6235822,
"_source": {
"name": "Bob Jones"
}
}
]
}
}
That may not be what you want. Also, "AUTO" doesn't work with the "Jhon Smi" query, because "Jhon" is an edit distance of 2 from "John", and "AUTO" uses an edit distance of 1 for strings of 3-5 characters (see the docs for more info). So I have to use this query instead:
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "Jhon Smi",
"fuzziness": 2,
"operator": "and"
}
}
}
}
...
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4219328,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4219328,
"_source": {
"name": "John Smith"
}
}
]
}
}
The other queries work as expected. So this solution isn't perfect, but it will get you close.
Here's all the code I used:
http://sense.qbox.io/gist/ba5a6741090fd40c1bb20f5d36f3513b4b55ac77

ElasticSearch - Match (email value) returns wrong registers

I'm using match to search for a specific email but the result is wrong. The match property brings me results similar. If the result exists, the result displays on first lines but when the results not exists, it brings me result by same domain.
Here is my query:
{
"query": {
"match" : {
"email" : "placplac#xxx.net"
}
}
}
This email doesn't exist in my base but returning values like banana#xxx.net, ronyvon#xxx.net*, etc.
How can i force to return only if the value is equal from the query?
Thank in advance.
You need to put "index":"not_analyzed" on the "email" field. That way, the only terms that are queried against are the exact values that have been stored to that field (as opposed to the case with the standard analyzer, which is the default used if no analyzer is listed).
To illustrate, I set up a simple mapping with the email field not analyzed, and added two simple docs:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"email": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
PUT /test_index/doc/1
{"email": "placplac#xxx.net"}
PUT /test_index/doc/2
{"email": "placplac#nowhere.net"}
Now your match query will return only the document that matches the query exactly:
POST /test_index/_search
{
"query": {
"match" : {
"email" : "placplac#xxx.net"
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"email": "placplac#xxx.net"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/12763f63f2a75bf30ff956c25097b5955074508a
PS: What you actually probably want here is a term query or even term filter, since you don't want any analysis on the query text. So maybe something like:
POST /test_index/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"email": "placplac#xxx.net"
}
}
}
}
}

Resources