Sort the basic of Number of documents in ElasticSearch - elasticsearch

I am saving user relations in ES Index
i.e
{'id' => 1, 'User_id_1' => '2001', 'relation' => 'friend', 'User_id_2' => '1002'}
{'id' => 2, 'User_id_1' => '2002', 'relation' => 'friend', 'User_id_2' => '1002'}
{'id' => 3, 'User_id_1' => '2002', 'relation' => 'friend', 'User_id_2' => '1001'}
{'id' => 4, 'User_id_1' => '2003', 'relation' => 'friend', 'User_id_2' => '1003'}
no suppose i want to get the user_id_2 who has most friends,
in above case its 1002 as 2001, and 2002 are its friends. (Count = 2)
I just can't figure out the query
Thanks.
EDIT:
Well as suggested by #imotov, term facets is very good choice, but
The problem I have is 2 Indexes
1st index is for saving the main docs and 2nd index for saving the relations
now problem is
Suppose I have 100 USER Docs in my main index, only 50 of them has made relations, so I'll have only 50 USER Docs in my relationship index
So when i implement the "term facet", it sorts the results and gives the correct output i want, but I am missing those left 50 users who don't have any relations yet, i need them in my final output after the 50 sorted users.

First of all, we need to ensure that relationships saved in ES are unique. It can be done by replacing arbitrary ids with ids constructed from user_id_1, relation and user_id_2. We also need to make sure that analyzer for user_ids doesn't produce multiple tokens. If ids are strings, they have to be indexed not_analyzed. With these two conditions satisfied, we can simply use terms facet query for the field user_id_2 on the result list limited by relation:friend. This query will retrieve top user_id_2 ids sorted by number of occurrences in the index. All together it could look something like this:
curl -XPUT http://localhost:9200/relationships -d '{
"mappings" : {
"relation" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id_1": { "type": "string", "index" : "not_analyzed"},
"relation": { "type": "string", "index" : "not_analyzed"},
"user_id_2": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/relationships/relation/2001-friend-1002 -d '{"user_id_1": "2001", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1002 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1001 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1001"}'
curl -XPUT http://localhost:9200/relationships/relation/2003-friend-1003 -d '{"user_id_1": "2003", "relation":"friend", "user_id_2": "1003"}'
curl -XPOST http://localhost:9200/relationships/_refresh
echo
curl -XGET 'http://localhost:9200/relationships/relation/_search?pretty=true&search_type=count' -d '{
"query": {
"term" : {
"relation" : "friend"
}
},
"facets" : {
"popular" : {
"terms" : {
"field" : "user_id_2"
}
}
}
}'
Please, note that due to distributed nature of facets calculation, counts reported by the facet query might be lower than the actual number of records if multiple shards are used. See elasticsearch issue 1832
EDIT:
There are two solutions for the edited question. One solution is to use facet on two fields:
curl -XPUT http://localhost:9200/relationships -d '{
"mappings" : {
"relation" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id_1": { "type": "string", "index" : "not_analyzed"},
"relation": { "type": "string", "index" : "not_analyzed"},
"user_id_2": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/users -d '{
"mappings" : {
"user" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/users/user/1001 -d '{"user_id": 1001}'
curl -XPUT http://localhost:9200/users/user/1002 -d '{"user_id": 1002}'
curl -XPUT http://localhost:9200/users/user/1003 -d '{"user_id": 1003}'
curl -XPUT http://localhost:9200/users/user/1004 -d '{"user_id": 1004}'
curl -XPUT http://localhost:9200/users/user/1005 -d '{"user_id": 1005}'
curl -XPUT http://localhost:9200/relationships/relation/2001-friend-1002 -d '{"user_id_1": "2001", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1002 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1001 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1001"}'
curl -XPUT http://localhost:9200/relationships/relation/2003-friend-1003 -d '{"user_id_1": "2003", "relation":"friend", "user_id_2": "1003"}'
curl -XPOST http://localhost:9200/relationships/_refresh
curl -XPOST http://localhost:9200/users/_refresh
echo
curl -XGET 'http://localhost:9200/relationships,users/_search?pretty=true&search_type=count' -d '{
"query": {
"indices" : {
"indices" : ["relationships"],
"query" : {
"filtered" : {
"query" : {
"term" : {
"relation" : "friend"
}
},
"filter" : {
"type" : {
"value" : "relation"
}
}
}
},
"no_match_query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"type" : {
"value" : "user"
}
}
}
}
}
},
"facets" : {
"popular" : {
"terms" : {
"fields" : ["user_id", "user_id_2"]
}
}
}
}'
Another solution is to add "self" relation to the relationships index for every user when user is created. I would prefer the second solution since it seems to be less complicated.

Related

Date range search in Elassandra

I have created a index like below.
curl -XPUT -H 'Content-Type: application/json' 'http://x.x.x.x:9200/date_index' -d '{
"settings" : { "keyspace" : "keyspace1"},
"mappings" : {
"table1" : {
"discover":"sent_date",
"properties" : {
"sent_date" : { "type": "date", "format": "yyyy-MM-dd HH:mm:ssZZ" }
}
}
}
}'
I need to search the results pertaining to date range, example "from" : "2039-05-07 11:22:34+0000", "to" : "2039-05-07 11:22:34+0000" both inclusive.
I am trying like this,
curl -XGET -H 'Content-Type: application/json' 'http://x.x.x.x:9200/date_index/_search?pretty=true' -d '
{
"query" : {
"aggregations" : {
"date_range" : {
"sent_date" : {
"from" : "2039-05-07 11:22:34+0000",
"to" : "2039-05-07 11:22:34+0000"
}
}
}
}
}'
I am getting error as below.
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "no [query] registered for [aggregations]",
"line" : 4,
"col" : 22
}
],
"type" : "parsing_exception",
"reason" : "no [query] registered for [aggregations]",
"line" : 4,
"col" : 22
},
"status" : 400
Please advise.
The query seems to be malformed. Please see the date range aggregation documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html and note the differences:
you're introducing a query without defining any - do you need one?
you should use aggs instead of aggregations
you should name your aggregation

Elasticsearch How do I get a metadata using Image Plugin

I defined matadata by the mapping of the Elasticsearch image Plugin.
Mapping:
"photo" : {
"mappings" : {
"scenery" : {
"properties" : {
"my_img" : {
"type" : "image",
"feature" : {"FCTH" : { }, ... },
"metadata" : {
"jpeg.image_height" : {"type" : "string","store" : true},
"jpeg.image_width" : {"type" : "string","store" : true}
}
}
}
}
}
}
After an index, although searched, metadata does not return.
How do I get a metadata?
I tried:
curl -XPOST 'localhost:9200/photo/scenery/_search' -d '{
"query":{
"image":{
"my_img":{
"feature":"CEDD",
"index":"photo",
"type":"scenery",
"id":"0",
"path":"my_img",
"hash":"BIT_SAMPLING"
}
}
}
}'
Result:
{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":5,"max_score":1.0,"hits":[{"_index":"photo","_type":"scenery","_id":"0","_score":1.0, "_source" : {"file_name": "376423.jpg", "my_img": "/9j/4AAQSkZJRgABAQ...
Perhaps, the original data (base64 encoded image) will be returned _source field. You can use that instead, the fields option.
Try this query.
curl -XPOST 'localhost:9200/photo/scenery/_search' -d '{
"query":{
...
},
"fields": ["my_img.metadata.jpeg.image_height","my_img.metadata.jpeg.image_width" ]
}'

elasticsearch: indexing & searching arabic text

When I put the following things into elasticsearch-1.0.1 I expect the search queries to return the posts with id 200 and 201. But I get nothing returned. Hits:0.
What am I doing wrong? I'm searching exactly for what I put in, but get nothing out... (here's the test code for download: http://petoria.de/tmp/arabictest.sh).
But please keep in mind: I want to use the Arabic analyzer, because I want to develop my own analyzer later.
Best,
Koem
curl -XPOST localhost:9200/posts -d '{
"settings" : {
"number_of_shards" : 1
}
}'
curl -XPOST localhost:9200/posts/post/_mapping -d '
{
"post" : {
"properties" : {
"arabic_text" : { "type" : "string", "index" : "analyzed", "store" : true, "analyzer" : "arabic" },
"english_text" : { "type" : "string", "index" : "analyzed", "store" : true, "analyzer" : "english" }
}
}
}'
curl -XPUT 'http://localhost:9200/posts/post/200' -d '{
"english_text" : "palestinian1",
"arabic_text" : "فلسطينيه"
}'
curl -XPUT 'http://localhost:9200/posts/post/201' -d '{
"english_text" : "palestinian2",
"arabic_text" : "الفلسطينية"
}'
search for palestinian1
curl -XGET 'http://localhost:9200/posts/post/_search' -d '{
"query": {
"query_string" : {
"analyzer" : "arabic",
"query" : "فلسطينيه"
}
}
}'
search for palestinian2
curl -XGET 'http://localhost:9200/posts/post/_search' -d '{
"query": {
"query_string" : {
"analyzer" : "arabic",
"query" : "الفلسطينية"
}
}
}'
Just add the encoding to your URL, you can do it by specifying the "Content-Type" header as bellow:
Content-Type:text/html;charset=UTF-8

ElasticSearch CouchDB Geo location

I am trying to get elasticsearch to index a couchdb river without luck.
I have a database 'pl1' with only one document '1' in it.
This is a printout of the entire document pretty-printed:
curl -XGET localhost:5984/pl1/1 | python -mjson.tool
{
"_id": "1",
"_rev": "1-0442f3962cffedc2238fcdb28dd77557",
"location": {
"geo_json": {
"coordinates": [
59.70141999133738,
14.162789164118708
],
"type": "point"
},
"lat": 14.162789164118708,
"lon": 59.70141999133738
}
}
I create a couchdb river and index with a catch-all type called all_entries the following way:
curl -XPUT 'localhost:9200/_river/pl1/_meta' -d '
{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"filter" : null,
"db" : "pl1"
},
"index" : {
"index" : "pl1",
"type" : "all_entries",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}'
{"ok":true,"_index":"_river","_type":"pl1","_id":"_meta","_version":1}
To test whether the document was indexed I perform the following query:
curl -XGET localhost:9200/pl1/all_entries/_count?pretty=true
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But then nothing. I can't figure out how to index the location using a geo_shape type (I have also tried with the different geo_point format for the data, and indexing that, but also no results)
How do I specify a mapper and query for this?

How can I boost certain fields over others in elasticsearch?

My goal is to apply the boost to field "name" (see example below), but I have two problems when I search for "john":
search is also matching {name: "dany", message: "hi bob"} when name is "dany" and
search is not boosting name over message (rows with name="john" should be on the top)
The gist is on https://gist.github.com/tomaspet262/5535774
(since stackoverflow's form submit returned 'Your post appears to contain code that is not properly formatted as code', which was formatted properly).
I would suggest using query time boosting instead of index time boosting.
#DELETE
curl -XDELETE 'http://localhost:9200/test'
echo
# CREATE
curl -XPUT 'http://localhost:9200/test?pretty=1' -d '{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyz_1" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}'
echo
# DEFINE
curl -XPUT 'http://localhost:9200/test/posts/_mapping?pretty=1' -d '{
"posts" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "my_analyz_1"
},
"message" : {
"type" : "string",
"analyzer" : "my_analyz_1"
}
}
}
}'
echo
# INSERT
curl localhost:9200/test/posts/1 -d '{name: "john", message: "hi john"}'
curl localhost:9200/test/posts/2 -d '{name: "bob", message: "hi john, how are you?"}'
curl localhost:9200/test/posts/3 -d '{name: "john", message: "bob?"}'
curl localhost:9200/test/posts/4 -d '{name: "dany", message: "hi bob"}'
curl localhost:9200/test/posts/5 -d '{name: "dany", message: "hi john"}'
echo
# REFRESH
curl -XPOST localhost:9200/test/_refresh
echo
# SEARCH
curl "localhost:9200/test/posts/_search?pretty=1" -d '{
"query": {
"multi_match": {
"query": "john",
"fields": ["name^2", "message"]
}
}
}'
Im not sure if this is relevant in this case, but when testing with such small amounts of data, I always use 1 shard instead of default settings to ensure no issues because of distributed calculation.

Resources