ElasticSearch CouchDB Geo location - elasticsearch

I am trying to get elasticsearch to index a couchdb river without luck.
I have a database 'pl1' with only one document '1' in it.
This is a printout of the entire document pretty-printed:
curl -XGET localhost:5984/pl1/1 | python -mjson.tool
{
"_id": "1",
"_rev": "1-0442f3962cffedc2238fcdb28dd77557",
"location": {
"geo_json": {
"coordinates": [
59.70141999133738,
14.162789164118708
],
"type": "point"
},
"lat": 14.162789164118708,
"lon": 59.70141999133738
}
}
I create a couchdb river and index with a catch-all type called all_entries the following way:
curl -XPUT 'localhost:9200/_river/pl1/_meta' -d '
{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"filter" : null,
"db" : "pl1"
},
"index" : {
"index" : "pl1",
"type" : "all_entries",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'
{"ok":true,"_index":"_river","_type":"pl1","_id":"_meta","_version":1}
To test whether the document was indexed I perform the following query:
curl -XGET localhost:9200/pl1/all_entries/_count?pretty=true
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But then nothing. I can't figure out how to index the location using a geo_shape type (I have also tried with the different geo_point format for the data, and indexing that, but also no results)
How do I specify a mapper and query for this?

Related

ES curl for email is not returning correct results despite knowing that it does exist

I do a query for a term "owner" and a document showed the email for an owner. I figured to look at all Houses which have this email, to query for email instead of owner.
When I do the following curl request, It doesnt return any actual cases.
curl -X GET "localhost:9200/_search/?pretty" -H "Content-Type: application/json" -d'{"query": {"match": {"email": {"query": "test.user#gmail.com"}}}}'
it does not return the correct information. I wanted to find an exact result. I was also thinking to use the term:
curl -X GET "localhost:9200/_search/?pretty" -H "Content-Type: application/json" -d'{"query": {"term": {"email": "test.user#gmail.com"}}}'
in an attempt to find an exact match. This seems to return no document information. I am thinking that it might have something to do with the periods or maybe the # symbol.
I have also tried match when trying to wrap the email with escaped quotes, escaped periods.
Is there something going on I am unaware of with special characters?
Elasticsearch is not schema free, now they are calling it "schema on write" and that´s a very good name for the schema generation process. When elasticsearch recieves a new document with unknown fields, it tries an "educated guess".
When you index the first document with the field "email", elasticsearch will have a look on the value provided and create a mapping for this field.
The value "test.user#gmail.com" will then be mapped to "Text" mapping type.
Now, let´s see how elastic will process a simple document with a email. Create a document:
POST /auto_mapped_index/_doc
{"email": "nobody#example.com"}
Courious how the mapping look like? Here you go:
GET /auto_mapped_index/_mapping
Will be answered with:
{
"my_first_index" : {
"mappings" : {
"properties" : {
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
You see, the "type" : "text" is indicating the mapping type "text" as assumed before. And there is also a subfield "keyword", automatically created by elastic for text type fields by default.
We have 2 options now, the easy one is to query the keyword subfield (please note the dot notation):
GET /my_first_index/_search
{"query": {"term": {"email.keyword": "nobody#example.com"}}}
Done!
The other option is to create a specific mapping for our index. In order to do so, we need a new and empty index and define the mapping. We can do it with one shot:
PUT /my_second_index/
{
"mappings" : {
"properties" : {
"email" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
Now let us populate the index (here i´m putting two documents):
POST /my_second_index/_doc
{"email": "nobody#example.com"}
POST /my_second_index/_doc
{"email": "anybody#example.com"}
And now your unchanged query should work :
GET /my_second_index/_search
{"query": {"term": {"email": "anybody#example.com"}}}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_second_index",
"_type" : "_doc",
"_id" : "OTf3n28BpmGM8iQdGR4j",
"_score" : 0.2876821,
"_source" : {
"email" : "anybody#example.com"
}
}
]
}
}

Elasticsearch index last update time

Is there a way to retrieve from ElasticSearch information on when a specific index was last updated?
My goal is to be able to tell when it was the last time that any documents were inserted/updated/deleted in the index. If this is not possible, is there something I can add in my index modification requests that will provide this information later on?
You can get the modification time from the _timestamp
To make it easier to return the timestamp you can set up Elasticsearch to store it:
curl -XPUT "http://localhost:9200/myindex/mytype/_mapping" -d'
{
"mytype": {
"_timestamp": {
"enabled": "true",
"store": "yes"
}
}
}'
If I insert a document and then query on it I get the timestamp:
curl -XGET 'http://localhost:9200/myindex/mytype/_search?pretty' -d '{
> fields : ["_timestamp"],
> "query": {
> "query_string": { "query":"*"}
> }
> }'
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "myindex",
"_type" : "mytype",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"_timestamp" : 1417599223918
}
} ]
}
}
updating the existing document:
curl -XPOST "http://localhost:9200/myindex/mytype/1/_update" -d'
{
"doc" : {
"field1": "data",
"field2": "more data"
},
"doc_as_upsert" : true
}'
Re-running the previous query shows me an updated timestamp:
"fields" : {
"_timestamp" : 1417599620167
}
I don't know if there are people who are looking for an equivalent, but here is a workaround using shards stats for > Elasticsearch 5 users:
curl XGET http://localhost:9200/_stats?level=shards
As you'll see, you have some informations per indices, commits and/or flushs that you might use to see if the indice changed (or not).
I hope it will help someone.
Just looked into a solution for this problem. Recent Elasticsearch versions have a <index>/_recovery API.
This returns a list of shards and a field called stop_time_in_millis which looks like it is a timestamp for the last write to that shard.

Elasticsearch river - no _meta document found after 5 attempts

I am using elasticsearch version 1.3.0. when I create a river using wikipedia plugin version 2.3.0 as thus
PUT _river/my_river/_meta -d
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
},
"index" : {
"index" : "wikipedia",
"type" : "wiki",
"bulk_size" : 1000,
"max_concurrent_bulk" : 3
}
}
the server responds with this message
{
"_index": "_river",
"_type": "my_river",
"_id": "_meta -d",
"_version": 1,
"created": true
}
however, I don't see the wikipedia documents when I run a search. also, when I restart my server I get river-routing no _meta document found after 5 attempts
Remove the -d at the end as it creates a document named _meta -d and not _meta.
PUT _river/my_river/_meta
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
},
"index" : {
"index" : "wikipedia",
"type" : "wiki",
"bulk_size" : 1000,
"max_concurrent_bulk" : 3
}
}

Join query in ElasticSearch

Is there any way (query) to join 2 JSONs below in ElasticSearch
{
product_id: "1111",
price: "23.56",
stock: "100"
}
{
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}
Above 2 JSONs processed (input) under 2 different types in Logstash, so their indexes are available in different 'type' filed in Elasticsearch.
What I want is to join 2 JSONs on product_id field.
It depends what you intend when you say JOIN. Elasticsearch is not like regular database that supports JOIN between tables. It is a text search engine that manages documents within indexes.
On the other hand you can search within the same index over multiple types using a fields that are common to every type.
For example taking your data I can create an index with 2 types and their data like follows:
curl -XPOST localhost:9200/product -d '{
"settings" : {
"number_of_shards" : 5
}
}'
curl -XPOST localhost:9200/product/type1/_mapping -d '{
"type1" : {
"properties" : {
"product_id" : { "type" : "string" },
"price" : { "type" : "integer" },
"stock" : { "type" : "integer" }
}
}
}'
curl -XPOST localhost:9200/product/type2/_mapping -d '{
"type2" : {
"properties" : {
"product_id" : { "type" : "string" },
"category" : { "type" : "string" },
"manufacturer" : { "type" : "string" }
}
}
}'
curl -XPOST localhost:9200/product/type1/1 -d '{
product_id: "1111",
price: "23",
stock: "100"
}'
curl -XPOST localhost:9200/product/type2/1 -d '{
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}'
I effectively created one index called product with 2 type type1 and type2.
Now I can do the following query and it will return both documents:
curl -XGET 'http://localhost:9200/product/_search?pretty=1' -d '{
"query": {
"query_string" : {
"query" : "product_id:1111"
}
}
}'
{
"took" : 95,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5945348,
"hits" : [ {
"_index" : "product",
"_type" : "type1",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
product_id: "1111",
price: "23",
stock: "100"
}
}, {
"_index" : "product",
"_type" : "type2",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}
} ]
}
}
The reason is because Elasticsearch will search over all documents within that index regardless of their type. This is still different than a JOIN in the sense Elasticsearch is not going to do a Cartesian product of the documents that belong to each type.
Hope that helps
isaac.hazan's answer works quite well, but I would like to add a few points that helped me with this kind of situation:
I landed on this page when I was trying to solve a similar problem, in that I had to exclude multiple records of one index based on documents of another index. The lack of relationships is one of the main downsides of unstructured databases.
The elasticsearch documentation page on Handling Relationships explains a lot.
Four common techniques are used to manage relational data in Elasticsearch:
Application-side joins
Data denormalization
Nested objects
Parent/child relationships
Often the final solution will require a mixture of a few of these techniques.
I've used nested objects and application-side joins, mostly. While using the same field name could momentarily solve the problem, I think it is better to rethink and create best-suited mapping for your application.
For instance, you might find that you want to list all products with price greater than x, or list all products that are not in stock anymore. To deal with such scenarios it helps if you are using one of the solutions mentioned above.
To perform joins on Elasticsearch take a look at the Siren "Federate" plugin. It adds join capabilities by extending the Elasticsearch native query syntax.
https://siren.io/federate/

Sort the basic of Number of documents in ElasticSearch

I am saving user relations in ES Index
i.e
{'id' => 1, 'User_id_1' => '2001', 'relation' => 'friend', 'User_id_2' => '1002'}
{'id' => 2, 'User_id_1' => '2002', 'relation' => 'friend', 'User_id_2' => '1002'}
{'id' => 3, 'User_id_1' => '2002', 'relation' => 'friend', 'User_id_2' => '1001'}
{'id' => 4, 'User_id_1' => '2003', 'relation' => 'friend', 'User_id_2' => '1003'}
no suppose i want to get the user_id_2 who has most friends,
in above case its 1002 as 2001, and 2002 are its friends. (Count = 2)
I just can't figure out the query
Thanks.
EDIT:
Well as suggested by #imotov, term facets is very good choice, but
The problem I have is 2 Indexes
1st index is for saving the main docs and 2nd index for saving the relations
now problem is
Suppose I have 100 USER Docs in my main index, only 50 of them has made relations, so I'll have only 50 USER Docs in my relationship index
So when i implement the "term facet", it sorts the results and gives the correct output i want, but I am missing those left 50 users who don't have any relations yet, i need them in my final output after the 50 sorted users.
First of all, we need to ensure that relationships saved in ES are unique. It can be done by replacing arbitrary ids with ids constructed from user_id_1, relation and user_id_2. We also need to make sure that analyzer for user_ids doesn't produce multiple tokens. If ids are strings, they have to be indexed not_analyzed. With these two conditions satisfied, we can simply use terms facet query for the field user_id_2 on the result list limited by relation:friend. This query will retrieve top user_id_2 ids sorted by number of occurrences in the index. All together it could look something like this:
curl -XPUT http://localhost:9200/relationships -d '{
"mappings" : {
"relation" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id_1": { "type": "string", "index" : "not_analyzed"},
"relation": { "type": "string", "index" : "not_analyzed"},
"user_id_2": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/relationships/relation/2001-friend-1002 -d '{"user_id_1": "2001", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1002 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1001 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1001"}'
curl -XPUT http://localhost:9200/relationships/relation/2003-friend-1003 -d '{"user_id_1": "2003", "relation":"friend", "user_id_2": "1003"}'
curl -XPOST http://localhost:9200/relationships/_refresh
echo
curl -XGET 'http://localhost:9200/relationships/relation/_search?pretty=true&search_type=count' -d '{
"query": {
"term" : {
"relation" : "friend"
}
},
"facets" : {
"popular" : {
"terms" : {
"field" : "user_id_2"
}
}
}
}'
Please, note that due to distributed nature of facets calculation, counts reported by the facet query might be lower than the actual number of records if multiple shards are used. See elasticsearch issue 1832
EDIT:
There are two solutions for the edited question. One solution is to use facet on two fields:
curl -XPUT http://localhost:9200/relationships -d '{
"mappings" : {
"relation" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id_1": { "type": "string", "index" : "not_analyzed"},
"relation": { "type": "string", "index" : "not_analyzed"},
"user_id_2": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/users -d '{
"mappings" : {
"user" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/users/user/1001 -d '{"user_id": 1001}'
curl -XPUT http://localhost:9200/users/user/1002 -d '{"user_id": 1002}'
curl -XPUT http://localhost:9200/users/user/1003 -d '{"user_id": 1003}'
curl -XPUT http://localhost:9200/users/user/1004 -d '{"user_id": 1004}'
curl -XPUT http://localhost:9200/users/user/1005 -d '{"user_id": 1005}'
curl -XPUT http://localhost:9200/relationships/relation/2001-friend-1002 -d '{"user_id_1": "2001", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1002 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1001 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1001"}'
curl -XPUT http://localhost:9200/relationships/relation/2003-friend-1003 -d '{"user_id_1": "2003", "relation":"friend", "user_id_2": "1003"}'
curl -XPOST http://localhost:9200/relationships/_refresh
curl -XPOST http://localhost:9200/users/_refresh
echo
curl -XGET 'http://localhost:9200/relationships,users/_search?pretty=true&search_type=count' -d '{
"query": {
"indices" : {
"indices" : ["relationships"],
"query" : {
"filtered" : {
"query" : {
"term" : {
"relation" : "friend"
}
},
"filter" : {
"type" : {
"value" : "relation"
}
}
}
},
"no_match_query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"type" : {
"value" : "user"
}
}
}
}
}
},
"facets" : {
"popular" : {
"terms" : {
"fields" : ["user_id", "user_id_2"]
}
}
}
}'
Another solution is to add "self" relation to the relationships index for every user when user is created. I would prefer the second solution since it seems to be less complicated.

Resources