Join query in ElasticSearch - elasticsearch

Is there any way (query) to join 2 JSONs below in ElasticSearch
{
product_id: "1111",
price: "23.56",
stock: "100"
}
{
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}
Above 2 JSONs processed (input) under 2 different types in Logstash, so their indexes are available in different 'type' filed in Elasticsearch.
What I want is to join 2 JSONs on product_id field.

It depends what you intend when you say JOIN. Elasticsearch is not like regular database that supports JOIN between tables. It is a text search engine that manages documents within indexes.
On the other hand you can search within the same index over multiple types using a fields that are common to every type.
For example taking your data I can create an index with 2 types and their data like follows:
curl -XPOST localhost:9200/product -d '{
"settings" : {
"number_of_shards" : 5
}
}'
curl -XPOST localhost:9200/product/type1/_mapping -d '{
"type1" : {
"properties" : {
"product_id" : { "type" : "string" },
"price" : { "type" : "integer" },
"stock" : { "type" : "integer" }
}
}
}'
curl -XPOST localhost:9200/product/type2/_mapping -d '{
"type2" : {
"properties" : {
"product_id" : { "type" : "string" },
"category" : { "type" : "string" },
"manufacturer" : { "type" : "string" }
}
}
}'
curl -XPOST localhost:9200/product/type1/1 -d '{
product_id: "1111",
price: "23",
stock: "100"
}'
curl -XPOST localhost:9200/product/type2/1 -d '{
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}'
I effectively created one index called product with 2 type type1 and type2.
Now I can do the following query and it will return both documents:
curl -XGET 'http://localhost:9200/product/_search?pretty=1' -d '{
"query": {
"query_string" : {
"query" : "product_id:1111"
}
}
}'
{
"took" : 95,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5945348,
"hits" : [ {
"_index" : "product",
"_type" : "type1",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
product_id: "1111",
price: "23",
stock: "100"
}
}, {
"_index" : "product",
"_type" : "type2",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}
} ]
}
}
The reason is because Elasticsearch will search over all documents within that index regardless of their type. This is still different than a JOIN in the sense Elasticsearch is not going to do a Cartesian product of the documents that belong to each type.
Hope that helps

isaac.hazan's answer works quite well, but I would like to add a few points that helped me with this kind of situation:
I landed on this page when I was trying to solve a similar problem, in that I had to exclude multiple records of one index based on documents of another index. The lack of relationships is one of the main downsides of unstructured databases.
The elasticsearch documentation page on Handling Relationships explains a lot.
Four common techniques are used to manage relational data in Elasticsearch:
Application-side joins
Data denormalization
Nested objects
Parent/child relationships
Often the final solution will require a mixture of a few of these techniques.
I've used nested objects and application-side joins, mostly. While using the same field name could momentarily solve the problem, I think it is better to rethink and create best-suited mapping for your application.
For instance, you might find that you want to list all products with price greater than x, or list all products that are not in stock anymore. To deal with such scenarios it helps if you are using one of the solutions mentioned above.

To perform joins on Elasticsearch take a look at the Siren "Federate" plugin. It adds join capabilities by extending the Elasticsearch native query syntax.
https://siren.io/federate/

Related

How to make Elastic Engine understand a field is not to be analyzed for an exact match?

The question is based on the previous post where the Exact Search did not work either based on Match or MatchPhrasePrefix.
Then I found a similar kind of post here where the search field is set to be not_analyzed in the mapping definition (by #Russ Cam).
But I am using
package id="Elasticsearch.Net" version="7.6.0" targetFramework="net461"
package id="NEST" version="7.6.0" targetFramework="net461"
and might be for that reason the solution did not work.
Because If I pass "SOME", it matches with "SOME" and "SOME OTHER LOAN" which should not be the case (in my earlier post for "product value").
How can I do the same using NEST 7.6.0?
Well I'm not aware of how your current mapping looks. Also I don't know about NEST as well but I will explain
How to make Elastic Engine understand a field is not to be analyzed for an exact match?
by an example using elastic dsl.
For exact match (case sensitive) all you need to do is to define the field type as keyword. For a field of type keyword the data is indexed as it is without applying any analyzer and hence it is perfect for exact matching.
PUT test
{
"mappings": {
"properties": {
"field1": {
"type": "keyword"
}
}
}
}
Now lets index some docs
POST test/_doc/1
{
"field1":"SOME"
}
POST test/_doc/2
{
"field1": "SOME OTHER LOAN"
}
For exact matching we can use term query. Lets search for "SOME" and we should get document 1.
GET test/_search
{
"query": {
"term": {
"field1": "SOME"
}
}
}
O/P that we get:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"field1" : "SOME"
}
}
]
}
}
So the crux is make the field type as keyword and use term query.

ES curl for email is not returning correct results despite knowing that it does exist

I do a query for a term "owner" and a document showed the email for an owner. I figured to look at all Houses which have this email, to query for email instead of owner.
When I do the following curl request, It doesnt return any actual cases.
curl -X GET "localhost:9200/_search/?pretty" -H "Content-Type: application/json" -d'{"query": {"match": {"email": {"query": "test.user#gmail.com"}}}}'
it does not return the correct information. I wanted to find an exact result. I was also thinking to use the term:
curl -X GET "localhost:9200/_search/?pretty" -H "Content-Type: application/json" -d'{"query": {"term": {"email": "test.user#gmail.com"}}}'
in an attempt to find an exact match. This seems to return no document information. I am thinking that it might have something to do with the periods or maybe the # symbol.
I have also tried match when trying to wrap the email with escaped quotes, escaped periods.
Is there something going on I am unaware of with special characters?
Elasticsearch is not schema free, now they are calling it "schema on write" and that´s a very good name for the schema generation process. When elasticsearch recieves a new document with unknown fields, it tries an "educated guess".
When you index the first document with the field "email", elasticsearch will have a look on the value provided and create a mapping for this field.
The value "test.user#gmail.com" will then be mapped to "Text" mapping type.
Now, let´s see how elastic will process a simple document with a email. Create a document:
POST /auto_mapped_index/_doc
{"email": "nobody#example.com"}
Courious how the mapping look like? Here you go:
GET /auto_mapped_index/_mapping
Will be answered with:
{
"my_first_index" : {
"mappings" : {
"properties" : {
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
You see, the "type" : "text" is indicating the mapping type "text" as assumed before. And there is also a subfield "keyword", automatically created by elastic for text type fields by default.
We have 2 options now, the easy one is to query the keyword subfield (please note the dot notation):
GET /my_first_index/_search
{"query": {"term": {"email.keyword": "nobody#example.com"}}}
Done!
The other option is to create a specific mapping for our index. In order to do so, we need a new and empty index and define the mapping. We can do it with one shot:
PUT /my_second_index/
{
"mappings" : {
"properties" : {
"email" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
Now let us populate the index (here i´m putting two documents):
POST /my_second_index/_doc
{"email": "nobody#example.com"}
POST /my_second_index/_doc
{"email": "anybody#example.com"}
And now your unchanged query should work :
GET /my_second_index/_search
{"query": {"term": {"email": "anybody#example.com"}}}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_second_index",
"_type" : "_doc",
"_id" : "OTf3n28BpmGM8iQdGR4j",
"_score" : 0.2876821,
"_source" : {
"email" : "anybody#example.com"
}
}
]
}
}

Get from ElasticSearch why a result is a hit

In the ElasticSearch below I search for the word Balances in two fields name and notes:
GET /_search
{ "query": {
"multi_match": { "query": "Balances",
"fields": ["name","notes"]
}
}
}
And the result in the name field:
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.673515,
"hits" : [
{
"_index" : "idx",
"_type" : "_doc",
"_id" : "25",
"_score" : 1.673515,
"_source" : {
"name" : "Deposits checking accounts balances",
"notes" : "These are the notes",
"#timestamp" : "2019-04-18T21:05:00.387Z",
"id" : 25,
"#version" : "1"
}
}
]
}
Now, I want to know in which field ElasticSearch found the value. I could evaluate the result and see if the searched text is in name or notes, but I cannot do that if it's a fuzzy search.
Can ElasticSearch tell me in which field the text was found, and in addition provide a snippet with 5 words to the left and to the right of the result to tell the user why the result is a hit?
What I want to achieve is similar to Google highlighting in bold the text that was found within a phrase.
I think the 2 solutions in Find out which fields matched in a multi match query are still the valid solutions:
Highlight to find it.
Split the query up into multiple named match queries.

Elasticsearch index last update time

Is there a way to retrieve from ElasticSearch information on when a specific index was last updated?
My goal is to be able to tell when it was the last time that any documents were inserted/updated/deleted in the index. If this is not possible, is there something I can add in my index modification requests that will provide this information later on?
You can get the modification time from the _timestamp
To make it easier to return the timestamp you can set up Elasticsearch to store it:
curl -XPUT "http://localhost:9200/myindex/mytype/_mapping" -d'
{
"mytype": {
"_timestamp": {
"enabled": "true",
"store": "yes"
}
}
}'
If I insert a document and then query on it I get the timestamp:
curl -XGET 'http://localhost:9200/myindex/mytype/_search?pretty' -d '{
> fields : ["_timestamp"],
> "query": {
> "query_string": { "query":"*"}
> }
> }'
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "myindex",
"_type" : "mytype",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"_timestamp" : 1417599223918
}
} ]
}
}
updating the existing document:
curl -XPOST "http://localhost:9200/myindex/mytype/1/_update" -d'
{
"doc" : {
"field1": "data",
"field2": "more data"
},
"doc_as_upsert" : true
}'
Re-running the previous query shows me an updated timestamp:
"fields" : {
"_timestamp" : 1417599620167
}
I don't know if there are people who are looking for an equivalent, but here is a workaround using shards stats for > Elasticsearch 5 users:
curl XGET http://localhost:9200/_stats?level=shards
As you'll see, you have some informations per indices, commits and/or flushs that you might use to see if the indice changed (or not).
I hope it will help someone.
Just looked into a solution for this problem. Recent Elasticsearch versions have a <index>/_recovery API.
This returns a list of shards and a field called stop_time_in_millis which looks like it is a timestamp for the last write to that shard.

ElasticSearch CouchDB Geo location

I am trying to get elasticsearch to index a couchdb river without luck.
I have a database 'pl1' with only one document '1' in it.
This is a printout of the entire document pretty-printed:
curl -XGET localhost:5984/pl1/1 | python -mjson.tool
{
"_id": "1",
"_rev": "1-0442f3962cffedc2238fcdb28dd77557",
"location": {
"geo_json": {
"coordinates": [
59.70141999133738,
14.162789164118708
],
"type": "point"
},
"lat": 14.162789164118708,
"lon": 59.70141999133738
}
}
I create a couchdb river and index with a catch-all type called all_entries the following way:
curl -XPUT 'localhost:9200/_river/pl1/_meta' -d '
{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"filter" : null,
"db" : "pl1"
},
"index" : {
"index" : "pl1",
"type" : "all_entries",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'
{"ok":true,"_index":"_river","_type":"pl1","_id":"_meta","_version":1}
To test whether the document was indexed I perform the following query:
curl -XGET localhost:9200/pl1/all_entries/_count?pretty=true
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But then nothing. I can't figure out how to index the location using a geo_shape type (I have also tried with the different geo_point format for the data, and indexing that, but also no results)
How do I specify a mapper and query for this?

Resources