ElasticSearch: Searching fields in nested arrays - elasticsearch

I'm fairly new to ES and am using it for a new project of mine. Starting off, I have a simple mapping for a customer, which has a first and last name, and a list of payment information objects. If I were doing this in SQL, it would be something like a customer table, and a payment info table with a 1:many relationship.
Here's a simple example of what I'm trying to do: https://gist.github.com/anonymous/6109593
I'm hoping to find any customer based on any match in the nested array of paymentInfos, i.e. finding any users who've had a paymentInfo with billingZip 10101. This query returns no results, and I'm not sure why. Can anyone point me in the right direction as to why this query doesn't work, and if there are any changes I can make to either my query or mapping to have it return the user properly?
Thanks!

Nested fields should be searched using nested query:
echo "Deleting old ElasticSearch index..."
curl -XDELETE 'localhost:9200/arrtest'
echo
echo "Creating new ElasticSearch index..."
curl -XPUT 'localhost:9200/arrtest/?pretty=1' -d '{
"mappings" : {
"cust2" : {
"properties" : {
"firstName" : {
"type" : "string",
"analyzer" : "string_lowercase"
},
"lastName" : {
"type" : "string",
"analyzer" : "string_lowercase"
},
"paymentInfos": {
"properties": {
"billingZip": {
"type": "string",
"analyzer": "string_lowercase"
},
"paypalEmail": {
"type": "string",
"analyzer": "string_lowercase"
}
},
"type": "nested"
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"uax_url_email" : {
"filter" : [ "standard", "lowercase" ],
"tokenizer" : "uax_url_email"
},
"string_lowercase": {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}
'
echo
echo "Index recreation finished"
echo "Inserting one record..."
curl -XPUT 'localhost:9200/arrtest/cust2/1' -d '{
"firstName": "john",
"lastName": "smith",
"paymentInfos": [{
"billingZip": "10101",
"paypalEmail": "foo#bar.com"
}, {
"billingZip": "20202",
"paypalEmail": "foo2#bar2.com"
}]
}
'
echo
echo "Refreshing index to make new records searchable"
curl -XPOST 'localhost:9200/arrtest/_refresh'
echo
echo "Searching for record..."
curl -XGET 'localhost:9200/arrtest/cust2/_search?pretty=1' -d '{
"sort": [],
"query": {
"bool": {
"should": [],
"must_not": [],
"must": [{
"nested": {
"query": {
"query_string": {
"fields": ["paymentInfos.billingZip"],
"query": "10101"
}
},
"path": "paymentInfos"
}
}]
}
},
"facets": {},
"from": 0,
"size": 25
}'
echo

Related

How can I use query_string to match both nested and non-nested fields at the same time?

I have an index with a mapping something like this:
"email" : {
"type" : "nested",
"properties" : {
"from" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
},
"subject" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
},
"to" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
}
}
},
"textExact" : {
"type" : "text",
"analyzer" : "lowercase_standard",
"fielddata" : true
}
I want to use query_string to search for matches in both the nested and the non-nested field at the same time, e.g.
email.to:foo#example.com AND textExact:bar
But I can't figure out how to write a query that will search both fields at once. The following doesn't work, because query_string searches do not return nested documents:
"query": {
"query_string": {
"fields": [
"textExact",
"email.to"
],
"query": "email.to:foo#example.com AND textExact:bar"
}
}
I can write a separate nested query, but that will only search against nested fields. Is there any way I can use query_string to match both nested and non-nested fields at the same time?
I am using Elasticsearch 6.8. Cross-posted on the Elasticsearch forums.
Nested documents can only be queried with the nested query.
You can follow below two approaches.
1. You can combine nested and normal query in must clause, which works like "and" for different queries.
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "email",
"query": {
"term": {
"email.to": "foo#example.com"
}
}
}
},
{
"match": {
"textExact": "bar"
}
}
]
}
}
}
2. copy-to
The copy_to parameter allows you to copy the values of multiple fields into a group field, which can then be queried as a single field.
{
"mappings": {
"properties": {
"textExact":{
"type": "text"
},
"to_email":{
"type": "keyword"
},
"email":{
"type": "nested",
"properties": {
"to":{
"type":"keyword",
"copy_to": "to_email" --> copies to non-nested field
},
"from":{
"type":"keyword"
}
}
}
}
}
}
Query
{
"query": {
"query_string": {
"fields": [
"textExact",
"to_email"
],
"query": "to_email:foo#example.com AND textExact:bar"
}
}
}
Result
"_source" : {
"textExact" : "bar",
"email" : [
{
"to" : "sdfsd#example.com",
"from" : "a#example.com"
},
{
"to" : "foo#example.com",
"from" : "sdfds#example.com"
}
]
}

"[nested] failed to find nested object under path [steps]"

I have mapped data into this schema:
curl -X PUT "localhost:9200/data?pretty" -H 'Content-Type: application/json' -d
'{
"settings":{
"number_of_shards":"1",
"number_of_replicas":"1"
},
"mappings":{
"properties":{
"routines":{
"type":"nested",
"properties":{
"title":{
"type":"text"
},
"sources":{
"type":"keyword"
},
"flags":{
"type":"keyword",
"null_value":"NULL"
},
"steps":{
"type":"nested",
"properties":{
"time":{
"type":"keyword"
},
"products":{
"type":"nested",
"properties":{
"name":{
"type":"text"
},
"link":{
"type":"keyword",
"null_value":"NULL"
},
"type":{
"type":"keyword",
"null_value":"NULL"
},
"ingredients":{
"type":"keyword",
"null_value":"NULL"
},
"flags":{
"type":"keyword",
"null_value":"NULL"
}
}
}
}
}
}
}
}
}
}'
Now, I am trying to search two fields data.title and data.steps.products.name with this query:
curl -X GET "localhost:9200/data/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "data",
"query": {
"nested": {
"path": "steps",
"query": {
"nested": {
"path": "products",
"query": {
"multi_match": {
"query": "XXX",
"fields": [
"name"
]
}
}
}
}
}
}
}
},
{"multi_match": {
"query": "XXX",
"fields": [
"data.title"
]
}
}
]
}
}
}'
It throws error that it failed to find path under step:
{
"error" : {
"root_cause" : [
{
"type" : "query_shard_exception",
"reason" : "failed to create query: [nested] failed to find nested object under path [steps]",
"index_uuid" : "SjQgt4BHStC_APsMnZk8BQ",
"index" : "data"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "data",
"node" : "L1azdi09QxanYGnP-y0xbQ",
"reason" : {
"type" : "query_shard_exception",
"reason" : "failed to create query: [nested] failed to find nested object under path [steps]",
"index_uuid" : "SjQgt4BHStC_APsMnZk8BQ",
"index" : "data",
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "[nested] failed to find nested object under path [steps]"
}
}
}
]
},
"status" : 400
}
Could you help to find an error with mapping/query?
UPDATE:
My json data: https://jsonkeeper.com/b/PSVS
Have a look at the Elastic documentation for multi-level nested queries.
What you forgot in your query is the full path of the nested object in each sub-level nested query. (also, you used a data field not existing in the mapping, while you wanted to have routines instead)
So your query would look like
curl -X GET "localhost:9200/data/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "routines",
"query": {
"nested": {
"path": "routines.steps",
"query": {
"nested": {
"path": "routines.steps.products",
"query": {
"multi_match": {
"query": "XXX",
"fields": [
"routines.steps.products.name"
]
}
}
}
}
}
}
}
},
{"multi_match": {
"query": "XXX",
"fields": [
"routines.title"
]
}
}
]
}
}
}'
Anyhow, please reconsider whether it is a good idea to have multi-level nested fields in the first place, as they have a significant impact on performance. E.g., would having an index for routines, maybe with a data identifier, make sense?
Edit: added the full path to the multi-matched field of the first should block

How do I get Elasticsearch to ignore terms emptied by a char_filter?

I have a set of US street addresses that I've indexed. The source data is imperfect and sometimes fields contain junk. Specifically, I have zip5 and zip4 fields and a pattern_replace char_filter that strips any non-numeric characters. When that char_filter ends up replacing everything (yielding an empty string), matching still seems to look at that field. The same happens if the original field is just an empty string (as opposed to null). How could I set this up such that it'll just disregard fields that are empty strings (either by source or by the result of a char_filter)?
Example
First, let's create an index with a digits_only pattern replacer and an analyzer that uses it:
curl -XPUT "http://localhost:9200/address_bug" -d'
{
"settings": {
"index": {
"number_of_shards": "4",
"number_of_replicas": "1"
},
"analysis": {
"char_filter" : {
"digits_only" : {
"type" : "pattern_replace",
"pattern" : "([^0-9])",
"replacement" : ""
}
},
"analyzer" : {
"zip" : {
"type" : "custom",
"tokenizer" : "keyword",
"char_filter" : [
"digits_only"
]
}
}
}
}
}'
Now, let's create a mapping that uses the analyzer (NB: I'm using with_positions_offsets for highlighting):
curl -XPUT "http://localhost:9200/address_bug/_mapping/address" -d'
{
"address": {
"properties": {
"zip5": {
"type" : "string",
"analyzer" : "zip",
"term_vector" : "with_positions_offsets"
},
"zip4": {
"type" : "string",
"analyzer" : "zip",
"term_vector" : "with_positions_offsets"
}
}
}
}'
Now that our index and type is set up, let's index some imperfect data:
curl -XPUT "http://localhost:9200/address_bug/address/1234" -d'
{
"zip5" : "02144",
"zip4" : "ABCD"
}'
Alright, let's search for it and ask it to explain itself. In this case the search term is Street because in my actual application I have a single field for full address searching.
curl -XGET "http://localhost:9200/address_bug/address/_search?explain" -d'
{
"query": {
"match": {
"zip4": "Street"
}
}
}'
And, here is the interesting part of the results:
"_explanation": {
"value": 0.30685282,
"description": "weight(zip4: in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
(Full response is in this gist.)
Expected Result
I wouldn't have expected any hits. If I instead index a document with "zip4" : null, it yields the expect results: no hits.
Help? Am I even taking the right approach here? In my full application, I'm using the same technique for a phone field and suspect I'd have the same issues with the results.
As #plmaheu mentioned, you can use the stop token filter to completely remove
empty strings, so for instance, this is a configuration that I tested that
works:
POST /myindex
{
"settings": {
"analysis": {
"char_filter" : {
"digits_only" : {
"type" : "pattern_replace",
"pattern" : "[^0-9]+",
"replacement" : ""
}
},
"filter": {
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
},
"analyzer" : {
"zip" : {
"type" : "custom",
"tokenizer" : "keyword",
"char_filter" : [
"digits_only"
],
"filter": ["remove_empty"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"zip": {
"type": "string",
"analyzer": "zip"
}
}
}
}
}
Here the remove_empty filter removes the stopword "", if you use the analyze
API on the string "abcd", you get back the response {"tokens":[]}, so no
tokens will be indexed if the zip code is entirely invalid.
I also tested this works when searching for "foo", no results are found.
You can use a length token filter like this:
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
}

Elasticsearch not using "default_search" analyzer unless explicitly stated in query

From reading the Elasticsearch documents, I would expect that naming an analyzer 'default_search' would cause that analyzer to get used for all searches unless another analyzer is specified. However, if I define my index like so:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": [
"lowercase"
],
"type" : "custom"
},
"default_search": {
"tokenizer" : "keyword",
"filter" : [
"lowercase"
]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "100",
"token_chars": []
}
}
}
},
"mappings": {
"TestDocument": {
"dynamic_templates": [
{
"metadata_template": {
"match_mapping_type": "string",
"path_match": "*",
"mapping": {
"type": "multi_field",
"fields": {
"ngram": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "my_ngram_analyzer"
},
"{name}": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "standard"
}
}
}
}
}
]
}
}
}'
And then add a 'TestDocument':
curl -XPUT 'http://localhost:9200/test/TestDocument/1' -d '{
"name" : "TestDocument.pdf" }'
My queries are still running through the default analyzer. I can tell because this query gives me a hit:
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf"
}
}
}
}'
But does not if I specify the correct analyzer (using the 'keyword' tokenizer)
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf",
"analyzer" : "default_search"
}
}
}
}'
What am I missing to use "default_search" for searches unless stated otherwise in my query? Am I just misinterpreting expected behavior here?
In your dynamic template, you are setting the search and index analyzer by using "analyzer." It will only use the default as a last resort.
"index_analyzer":"analyzer_name" //sets the index analyzer
"analyzer":"analyzer_name" // sets both search and index
"search_analyzer":"...." // sets the search analyzer.

Multiple properties in facet (elasticsearch)

I have following index:
curl -XPUT "http://localhost:9200/test/" -d '
{
"mappings": {
"files": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"owners": {
"type": "nested",
"properties": {
"name": {
"type":"string",
"index":"not_analyzed"
},
"mail": {
"type":"string",
"index":"not_analyzed"
}
}
}
}
}
}
}
'
With sample documents:
curl -XPUT "http://localhost:9200/test/files/1" -d '
{
"name": "first.jpg",
"owners": [
{
"name": "John Smith",
"mail": "js#example.com"
},
{
"name": "Joe Smith",
"mail": "joes#example.com"
}
]
}
'
curl -XPUT "http://localhost:9200/test/files/2" -d '
{
"name": "second.jpg",
"owners": [
{
"name": "John Smith",
"mail": "js#example.com"
},
{
"name": "Ann Smith",
"mail": "as#example.com"
}
]
}
'
curl -XPUT "http://localhost:9200/test/files/3" -d '
{
"name": "third.jpg",
"owners": [
{
"name": "Kate Foo",
"mail": "kf#example.com"
}
]
}
'
And I need to find all owners that match some query, let's say "mit":
curl -XGET "http://localhost:9200/test/files/_search" -d '
{
"facets": {
"owners": {
"terms": {
"field": "owners.name"
},
"facet_filter": {
"query": {
"query_string": {
"query": "*mit*",
"default_field": "owners.name"
}
}
},
"nested": "owners"
}
}
}
'
This gives me following result:
{
"facets" : {
"owners" : {
"missing" : 0,
"_type" : "terms",
"other" : 0,
"total" : 4,
"terms" : [
{
"count" : 2,
"term" : "John Smith"
},
{
"count" : 1,
"term" : "Joe Smith"
},
{
"count" : 1,
"term" : "Ann Smith"
}
]
}
},
"timed_out" : false,
"hits" : {...}
}
And it's ok.
But what I exaclty need is to get owners with their email addresses (for each entry in facet I need additional field in results).
Is it achievable?
Not possible i think? Depending on your needs I would have
Create a composite field with both name & email and do the facet on that field, or
Run the query in addition to the facet and extract it from the query-result, but this is obviously not scalable
Two step-operation, get the facet, build the needed queries and merge results.

Resources