I have an Elasticsearch index named pollstat with mapping as follows:
{
"pollstat" : {
"mappings" : {
"dynamic" : "false",
"properties" : {
"dt" : {
"properties" : {
"dte" : {
"type" : "date"
},
"is_polled" : {
"type" : "boolean"
}
}
},
"is_profiled" : {
"type" : "boolean"
},
"maid" : {
"type" : "keyword"
}
}
}
}
}
The above index is created using:
curl -XPUT "http://localhost:9200/pollstat" -H 'Content-Type: application/json' -d'
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"maid" : {
"type" : "keyword"
},
"dt" : {
"type" : "object",
"properties": {
"dte" : {"type":"date"},
"is_polled" : {"type":"boolean"}
}
},
"is_profiled" : {
"type" : "boolean"
}
},
"dynamic":false
}
}'
To add data into this index, I am using the following code:
curl -X POST "localhost:9200/pollstat/_doc/?pretty" -H 'Content-Type: application/json' -d'{"maid" : "fans", "dt" : [{"dte": "2022-03-19", "is_polled":true } ], "is_profiled":true } '
This is working.
The requirement is to append the dt field when a particular maid polls data on a specific date. In this case, if the maid fans polls data for another day, I want to append the same to the dt field.
I used the following code, which takes the document id to update the document.
curl -X POST "localhost:9200/pollstat/_doc/hQh4oH8BPfXX63hBUbPN/_update?pretty" -H 'Content-Type: application/json' -d'{"script": {"source": "ctx._source.dt.addAll(params.dt)", "params": {"dt": [{ "dte": "2019-07-16", "is_polled": true }, { "dte": "2019-07-17", "is_polled": false } ] } } } '
This is also working
However, my application does not have visibility to the document id but gets the maid. The maid is also as unique as the document id. Hence to update a specific maid, I was trying to do the same with a query on maid.
I used the following code:
curl -X POST "localhost:9200/pollstat/_update_by_query?pretty" -H 'Content-Type: application/json' -d'"query": {"match": { "maid": "fans" }, "script": {"source": "ctx._source.dt.addAll(params.dt)", "params": {"dt": [{ "dte": "2019-07-18", "is_polled": true }, { "dte": "2019-07-19", "is_polled": false } ] } } }'
This code executes without an error and I am getting the following update status as well:
{
"took" : 8,
"timed_out" : false,
"total" : 1,
"updated" : 1,
"deleted" : 0,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
However my index is not getting updated.
Since the maid field has type keyword, I had to use the query->term instead of query->match. The final query is as follows:
curl -X POST "localhost:9200/pollstat/_update_by_query?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"term": { "maid": "fans" }},
"script": {
"source": "ctx._source.dt.addAll(params.dt)",
"params": {
"dt": [
{ "dte": "2019-07-18", "is_polled": true },
{ "dte": "2019-07-19", "is_polled": false }
]
}
}
}
'
Posting this answer for others reference.
Related
Let's say I have 3 documents, each of them only contains one field (but let's imagine that there are more, and we need to search through all fields).
Field value is "first second"
Field value is "second first"
Field value is "first second third"
Here is a script that can be used to create these 3 documents:
# drop the index completely, use with care!
curl -iX DELETE 'http://localhost:9200/test'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/one' -d '{"name":"first second"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/two' -d '{"name":"second first"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/three' -d '{"name":"first second third"}'
I need to find the only document (document 1) that has exactly "first second" text in one of its fields.
Here is what I tried.
A. Plain search:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second"
}
}
}'
returns all 3 documents
B. Quoting
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "\"first second\""
}
}
}'
gives 2 documents: 1 and 3, because both contain 'first second'.
Here https://stackoverflow.com/a/28024714/7637120 they suggest to use 'keyword' analyzer to analyze the fields when indexing, but I would like to avoid any customizations to the mapping.
Is it possible to avoid them and still only find document 1?
Yes, you can do that by declaring name mapping type as keyword. The key to solve your problem is just simple -- declare name mapping type:keyword and off you go
to demonstrate it, I have done these
1) created mapping with `keyword` for `name` field`
2) indexed the three documents
3) searched with a `match` query
mappings
PUT so_test16
{
"mappings": {
"_doc":{
"properties":{
"name": {
"type": "keyword"
}
}
}
}
}
Indexing the documents
POST /so_test16/_doc
{
"id": 1,
"name": "first second"
}
POST /so_test16/_doc
{
"id": 2,
"name": "second first"
}
POST /so_test16/_doc
{
"id": 3,
"name": "first second third"
}
The query
GET /so_test16/_search
{
"query": {
"match": {"name": "first second"}
}
}
and the result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "so_test16",
"_type" : "_doc",
"_id" : "m1KXx2sB4TH56W1hdTF9",
"_score" : 0.2876821,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
Adding second solution
( if the name is not a keyword type but a text type. Only thing here is fielddata:true also needed to be added for name field)
Mappings
PUT so_test18
{
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fielddata": true
}
}
}
}
}
and the search query
GET /so_test18/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {"name": "first second"}}
],
"filter": {
"script": {
"script": {
"lang": "painless",
"source": "doc['name'].values.length == 2"
}
}
}
}
}
}
and the response
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.3971361,
"hits" : [
{
"_index" : "so_test18",
"_type" : "_doc",
"_id" : "o1JryGsB4TH56W1hhzGT",
"_score" : 0.3971361,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
In Elasticsearch 7.1.0, it seems that you can use keyword analyzer even without creating a special mapping. At least I didn't, and the following query does what I need:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second",
"analyzer": "keyword"
}
}
}'
In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As #AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
I post this workaround in case it can help someone. But if someone has a cleaner way of doing this, I'd be interested.
I added a denormalized field in the children that contains a copy of the parent id (the value already in join/parent):
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"document_id: {
"type": "keyword"
},
"mention": {
"type": "keyword"
}
}
}
}}
Then the cardinality aggregate with this new field works as expected :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"cardinality": {
"field" : "document_id"
}
}
}
}}}'
It responds :
...
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"value" : 2
}
}
]
}
}
I recently ran into the same issue on Elasticsearch 7.1, and this additional field "my_join_field#my_parent" created by elasicsearch solved it. I am glad I didn't have to add the parent_id to the child document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html#_searching_with_parent_join
I created new index in elasticsearch (v6) using command:
curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/sorttest -d '
{
"settings" : {
"index" : {
"sort.field" : ["username", "date"],
"sort.order" : ["asc", "desc"]
}
},
"mappings": {
"_doc": {
"properties": {
"username": {
"type": "keyword",
"doc_values": true
},
"date": {
"type": "date"
}
}
}
}
}
'
The response was
{"acknowledged":true,"shards_acknowledged":true,"index":"sorttest"}
Next I checked out generated mapping
curl -XGET localhost:9200/sorttest/_mapping?pretty
And the result was
{
"sorttest" : {
"mappings" : {
"_doc" : {
"properties" : {
"date" : {
"type" : "date"
},
"username" : {
"type" : "keyword"
}
}
}
}
}
}
The question is: how can I find out what kind of sorting is set for my index?
Just
curl -XGET localhost:9200/sorttest?pretty
and you will see:
"settings" : {
"index" : {
...
"sort" : {
"field" : [
"username",
"date"
],
"order" : [
"asc",
"desc"
]
},
...
}
}
Let's say I make a simple ElasticSearch index:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"char_filter": {
"de_acronym": {
"type": "mapping",
"mappings": [".=>"]
}
},
"analyzer": {
"analyzer1": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": ["de_acronym"]
}
}
}
}
}'
And I make two doc_types that have the same property name but they are analyzed slightly differently from one another:
curl -XPUT 'http://localhost:9200/test/_mapping/docA' -d '{
"docA": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}'
curl -XPUT 'http://localhost:9200/test/_mapping/docB' -d '{
"docB": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer1"
}
}
}
}'
Next, let's say I put a document in each doc_type with the same name:
curl -XPUT 'http://localhost:9200/test/docA/1' -d '{ "name" : "U.S. Army" }'
curl -XPUT 'http://localhost:9200/test/docB/1' -d '{ "name" : "U.S. Army" }'
Let's try to search for "U.S. Army" in both doc types at the same time:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.5,
"hits" : [ {
"_index" : "test",
"_type" : "docA",
"_id" : "1",
"_score" : 1.5,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I only get one result! I get the other result when I specify docB's analyzer:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '
{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army",
"analyzer": "analyzer1"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "docB",
"_id" : "1",
"_score" : 1.0,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I was under the impression that ES would search each doc_type with the appropriate analyzer. Is there a way to do this?
The ElasticSearch docs say that precedence for search analyzer goes:
1) The analyzer defined in the query itself, else
2) The analyzer defined in the field mapping, else
...
In this case, is ElasticSearch arbitrarily choosing which field mapping to use?
Take a look at this issue in github, which seems to have started from this post in ES google groups. I believe it answers your question:
if its in a filtered query, we can't infer it, so we simply pick one of those and use its analysis settings
I am using the jdbc river and I can create the following index:
curl -XPUT 'localhost:9201/_river/email/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy":"simple",
"poll":"10",
"driver" : "org.postgresql.Driver",
"url" : "jdbc:postgresql://localhost:5432/api_development",
"username" : "paulcowan",
"password" : "",
"sql" : "SELECT id, subject, body, personal, sent_at, read_by, account_id, sender_user_id, sender_contact_id, html, folder, draft FROM emails"
},
"index" : {
"index" : "email",
"type" : "jdbc"
},
"mappings" : {
"email" : {
"properties" : {
"account_id" : { "type" : "integer" },
"subject" : { "type" : "string" },
"body" : { "type" : "string" },
"html" : { "type" : "string" },
"folder" : { "type" : "string", "index" : "not_analyzed" },
"id" : { "type" : "integer" }
}
}
}
}'
I can run basic queries using curl like this:
curl -XGET 'http://localhost:9201/email/jdbc/_search?pretty&q=fullcontact'
I get back results
But what I want to do is restrict the results to a particular email account_id and a particular email, I run the following query:
curl -XGET 'http://localhost:9201/email/jdbc/_search' -d '{
"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"folder": "INBOX"
}
},
{
"term": {
"account_id": 1
}
}
]
},
"query": {
"query_string": {
"query": "fullcontact*"
}
}
}
}
}'
I get the following results:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Can anyone tell me what is wrong with my query?
It turns out that you need to use the type_mapping section to specify a field is not_analyzed in the jdbc river the normal mappings node is ignored.
Below is how it turned out:
curl -XPUT 'localhost:9200/_river/your_index/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy":"simple",
"poll":"10",
"driver" : "org.postgresql.Driver",
"url" : "jdbc:postgresql://localhost:5432/api_development",
"username" : "user",
"password" : "your_password",
"sql" : "SELECT field_one, field_two, field_three, the_rest FROM blah"
},
"index" : {
"index" : "your_index",
"type" : "jdbc",
"type_mapping": "{\"your_index\" : {\"properties\" : {\"field_two\":{\"type\":\"string\",\"index\":\"not_analyzed\"}}}}"
}
}'
Strangely or annoyingly, the type_mapping section, takes a json encoded string and not a normal json node:
I can check the mappings by running:
# check mappings
curl -XGET 'http://localhost:9200/your_index/jdbc/_mapping?pretty=true'
Which should give something like:
{
"jdbc" : {
"properties" : {
"field_one" : {
"type" : "long"
},
"field_two" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"index_options" : "docs"
},
"field_three" : {
"type" : "string"
}
}
}
}