Elasticsearch TermsFacet giving wrong count - elasticsearch

I had problem with Elasticsearch Term Facet
i put data as follows :
curl -X DELETE "http://localhost:9200/articles'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "One", "tags" : "foo","datetime":"2005-12-23 23:10:52"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "Two", "tags" : "bar","datetime":"2005-12-23 23:10:53"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "Three", "tags" : "baz","datetime":"2005-12-23 23:10:54"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "four", "tags" : "baz","datetime":"2005-12-23 23:10:55"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "five", "tags" : "foo","datetime":"2005-12-23 23:10:56"}'
so whenever i query for terms facet it gives correct result following is my Elasticsearch query :
curl 'http://localhost:9200/articles/article/_search?pretty=true' -d '{
"query": {
"match_all": {}
},
"facets" : { "myfacet" : { "terms" : {"field" : "tags"}}
}
}'
But, when i added filter to Facet it won't show any facet count following is query :
curl 'http://localhost:9200/articles/article/_search?pretty=true' -d '{
"query": {
"match_all": {}
},
"facets" : {
"myfacet" : { "terms" : {"field" : "tags"},
"filter" : { "range" :{
"datetime" : {"from" : "2005-12-23 3:10:52","to" : "2005-12-23 23:10:56" }
}
}
}
}
}'
I get result as follows
facets" : {
"myfacet" : {
"_type" : "filter",
"count" : 0
}
}
so, anyone know's why it is giving such count.

The dates are in an invalid format, have a look at the supported date time formats that elasticsearch supports (too long, don't read, any date that is supported by jodatime is supported by elasticsearch).
http://www.elasticsearch.org/guide/reference/mapping/date-format.html
With that being said, you just have to modify your dates in your insert statements and put them in a valid date format, like 2005-12-23T23:10:55Z. Then just change your query to the proper time range in that time format, and that should give you the result.
Also be careful when writing these queries, as I noticed the date you used in your from clause is not valid.
Here are the modified curl scripts:
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "One", "tags" : "foo","datetime":"2005-12-23T23:10:52Z"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "Two", "tags" : "bar","datetime":"2005-12-23T23:10:53Z"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "Three", "tags" : "baz","datetime":"2005-12-23T23:10:54Z"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "four", "tags" : "baz","datetime":"2005-12-23T23:10:55Z"}'
curl -X POST "http://localhost:9200/articles/article" -d '{"title" : "five", "tags" : "foo","datetime":"2005-12-23T23:10:56Z"}'
and the modified search:
curl 'http://localhost:9200/articles/article/_search?pretty=true' -d '{
"query": {
"match_all": {}
},
"facets" : {
"myfacet" : {
"terms" : {"field" : "tags"},
"filter" : { "range" :{
"datetime" : {
"from" : "2005-12-23T23:10:52Z",
"to" : "2005-12-23T23:10:54Z"
}
}
}
}
}
}'
Hope this helps,
Matt

Related

Date range search in Elassandra

I have created a index like below.
curl -XPUT -H 'Content-Type: application/json' 'http://x.x.x.x:9200/date_index' -d '{
"settings" : { "keyspace" : "keyspace1"},
"mappings" : {
"table1" : {
"discover":"sent_date",
"properties" : {
"sent_date" : { "type": "date", "format": "yyyy-MM-dd HH:mm:ssZZ" }
}
}
}
}'
I need to search the results pertaining to date range, example "from" : "2039-05-07 11:22:34+0000", "to" : "2039-05-07 11:22:34+0000" both inclusive.
I am trying like this,
curl -XGET -H 'Content-Type: application/json' 'http://x.x.x.x:9200/date_index/_search?pretty=true' -d '
{
"query" : {
"aggregations" : {
"date_range" : {
"sent_date" : {
"from" : "2039-05-07 11:22:34+0000",
"to" : "2039-05-07 11:22:34+0000"
}
}
}
}
}'
I am getting error as below.
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "no [query] registered for [aggregations]",
"line" : 4,
"col" : 22
}
],
"type" : "parsing_exception",
"reason" : "no [query] registered for [aggregations]",
"line" : 4,
"col" : 22
},
"status" : 400
Please advise.
The query seems to be malformed. Please see the date range aggregation documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html and note the differences:
you're introducing a query without defining any - do you need one?
you should use aggs instead of aggregations
you should name your aggregation

elasticsearch: indexing & searching arabic text

When I put the following things into elasticsearch-1.0.1 I expect the search queries to return the posts with id 200 and 201. But I get nothing returned. Hits:0.
What am I doing wrong? I'm searching exactly for what I put in, but get nothing out... (here's the test code for download: http://petoria.de/tmp/arabictest.sh).
But please keep in mind: I want to use the Arabic analyzer, because I want to develop my own analyzer later.
Best,
Koem
curl -XPOST localhost:9200/posts -d '{
"settings" : {
"number_of_shards" : 1
}
}'
curl -XPOST localhost:9200/posts/post/_mapping -d '
{
"post" : {
"properties" : {
"arabic_text" : { "type" : "string", "index" : "analyzed", "store" : true, "analyzer" : "arabic" },
"english_text" : { "type" : "string", "index" : "analyzed", "store" : true, "analyzer" : "english" }
}
}
}'
curl -XPUT 'http://localhost:9200/posts/post/200' -d '{
"english_text" : "palestinian1",
"arabic_text" : "فلسطينيه"
}'
curl -XPUT 'http://localhost:9200/posts/post/201' -d '{
"english_text" : "palestinian2",
"arabic_text" : "الفلسطينية"
}'
search for palestinian1
curl -XGET 'http://localhost:9200/posts/post/_search' -d '{
"query": {
"query_string" : {
"analyzer" : "arabic",
"query" : "فلسطينيه"
}
}
}'
search for palestinian2
curl -XGET 'http://localhost:9200/posts/post/_search' -d '{
"query": {
"query_string" : {
"analyzer" : "arabic",
"query" : "الفلسطينية"
}
}
}'
Just add the encoding to your URL, you can do it by specifying the "Content-Type" header as bellow:
Content-Type:text/html;charset=UTF-8

Do elasticsearch work on excel binary workbook

I am trying to perform indexing and searching on all product of microsoft office. i had find out that it is not working on excel binary book(.xlsb).
I had perform indexing successfully but it is not able to find words from it.
I had tried following steps:
curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets", "store":"yes" }
}
}
}
}
}'
coded=`cat test.xlsb | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/test/attachment/" -d #json.file
curl "localhost:9200/_search?pretty=true" -d '{
"fields" : ["title"],
"query" : {
"query_string" : {
"query" : "sheet"
}
},
"highlight" : {
"fields" : {
"file" : {}
}
}
}'
We just added streaming/read-only xlsb support in POI (coming in 3.15-beta3). Once that is released, we'll upgrade Apache Tika (1.15?), and then once Elastic upgrades, you should be good to go.
A mere 4 years later!

How can I boost certain fields over others in elasticsearch?

My goal is to apply the boost to field "name" (see example below), but I have two problems when I search for "john":
search is also matching {name: "dany", message: "hi bob"} when name is "dany" and
search is not boosting name over message (rows with name="john" should be on the top)
The gist is on https://gist.github.com/tomaspet262/5535774
(since stackoverflow's form submit returned 'Your post appears to contain code that is not properly formatted as code', which was formatted properly).
I would suggest using query time boosting instead of index time boosting.
#DELETE
curl -XDELETE 'http://localhost:9200/test'
echo
# CREATE
curl -XPUT 'http://localhost:9200/test?pretty=1' -d '{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyz_1" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}'
echo
# DEFINE
curl -XPUT 'http://localhost:9200/test/posts/_mapping?pretty=1' -d '{
"posts" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "my_analyz_1"
},
"message" : {
"type" : "string",
"analyzer" : "my_analyz_1"
}
}
}
}'
echo
# INSERT
curl localhost:9200/test/posts/1 -d '{name: "john", message: "hi john"}'
curl localhost:9200/test/posts/2 -d '{name: "bob", message: "hi john, how are you?"}'
curl localhost:9200/test/posts/3 -d '{name: "john", message: "bob?"}'
curl localhost:9200/test/posts/4 -d '{name: "dany", message: "hi bob"}'
curl localhost:9200/test/posts/5 -d '{name: "dany", message: "hi john"}'
echo
# REFRESH
curl -XPOST localhost:9200/test/_refresh
echo
# SEARCH
curl "localhost:9200/test/posts/_search?pretty=1" -d '{
"query": {
"multi_match": {
"query": "john",
"fields": ["name^2", "message"]
}
}
}'
Im not sure if this is relevant in this case, but when testing with such small amounts of data, I always use 1 shard instead of default settings to ensure no issues because of distributed calculation.

Sort the basic of Number of documents in ElasticSearch

I am saving user relations in ES Index
i.e
{'id' => 1, 'User_id_1' => '2001', 'relation' => 'friend', 'User_id_2' => '1002'}
{'id' => 2, 'User_id_1' => '2002', 'relation' => 'friend', 'User_id_2' => '1002'}
{'id' => 3, 'User_id_1' => '2002', 'relation' => 'friend', 'User_id_2' => '1001'}
{'id' => 4, 'User_id_1' => '2003', 'relation' => 'friend', 'User_id_2' => '1003'}
no suppose i want to get the user_id_2 who has most friends,
in above case its 1002 as 2001, and 2002 are its friends. (Count = 2)
I just can't figure out the query
Thanks.
EDIT:
Well as suggested by #imotov, term facets is very good choice, but
The problem I have is 2 Indexes
1st index is for saving the main docs and 2nd index for saving the relations
now problem is
Suppose I have 100 USER Docs in my main index, only 50 of them has made relations, so I'll have only 50 USER Docs in my relationship index
So when i implement the "term facet", it sorts the results and gives the correct output i want, but I am missing those left 50 users who don't have any relations yet, i need them in my final output after the 50 sorted users.
First of all, we need to ensure that relationships saved in ES are unique. It can be done by replacing arbitrary ids with ids constructed from user_id_1, relation and user_id_2. We also need to make sure that analyzer for user_ids doesn't produce multiple tokens. If ids are strings, they have to be indexed not_analyzed. With these two conditions satisfied, we can simply use terms facet query for the field user_id_2 on the result list limited by relation:friend. This query will retrieve top user_id_2 ids sorted by number of occurrences in the index. All together it could look something like this:
curl -XPUT http://localhost:9200/relationships -d '{
"mappings" : {
"relation" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id_1": { "type": "string", "index" : "not_analyzed"},
"relation": { "type": "string", "index" : "not_analyzed"},
"user_id_2": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/relationships/relation/2001-friend-1002 -d '{"user_id_1": "2001", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1002 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1001 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1001"}'
curl -XPUT http://localhost:9200/relationships/relation/2003-friend-1003 -d '{"user_id_1": "2003", "relation":"friend", "user_id_2": "1003"}'
curl -XPOST http://localhost:9200/relationships/_refresh
echo
curl -XGET 'http://localhost:9200/relationships/relation/_search?pretty=true&search_type=count' -d '{
"query": {
"term" : {
"relation" : "friend"
}
},
"facets" : {
"popular" : {
"terms" : {
"field" : "user_id_2"
}
}
}
}'
Please, note that due to distributed nature of facets calculation, counts reported by the facet query might be lower than the actual number of records if multiple shards are used. See elasticsearch issue 1832
EDIT:
There are two solutions for the edited question. One solution is to use facet on two fields:
curl -XPUT http://localhost:9200/relationships -d '{
"mappings" : {
"relation" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id_1": { "type": "string", "index" : "not_analyzed"},
"relation": { "type": "string", "index" : "not_analyzed"},
"user_id_2": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/users -d '{
"mappings" : {
"user" : {
"_source" : {"enabled" : false },
"properties" : {
"user_id": { "type": "string", "index" : "not_analyzed"}
}
}
}
}'
curl -XPUT http://localhost:9200/users/user/1001 -d '{"user_id": 1001}'
curl -XPUT http://localhost:9200/users/user/1002 -d '{"user_id": 1002}'
curl -XPUT http://localhost:9200/users/user/1003 -d '{"user_id": 1003}'
curl -XPUT http://localhost:9200/users/user/1004 -d '{"user_id": 1004}'
curl -XPUT http://localhost:9200/users/user/1005 -d '{"user_id": 1005}'
curl -XPUT http://localhost:9200/relationships/relation/2001-friend-1002 -d '{"user_id_1": "2001", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1002 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1002"}'
curl -XPUT http://localhost:9200/relationships/relation/2002-friend-1001 -d '{"user_id_1": "2002", "relation":"friend", "user_id_2": "1001"}'
curl -XPUT http://localhost:9200/relationships/relation/2003-friend-1003 -d '{"user_id_1": "2003", "relation":"friend", "user_id_2": "1003"}'
curl -XPOST http://localhost:9200/relationships/_refresh
curl -XPOST http://localhost:9200/users/_refresh
echo
curl -XGET 'http://localhost:9200/relationships,users/_search?pretty=true&search_type=count' -d '{
"query": {
"indices" : {
"indices" : ["relationships"],
"query" : {
"filtered" : {
"query" : {
"term" : {
"relation" : "friend"
}
},
"filter" : {
"type" : {
"value" : "relation"
}
}
}
},
"no_match_query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"type" : {
"value" : "user"
}
}
}
}
}
},
"facets" : {
"popular" : {
"terms" : {
"fields" : ["user_id", "user_id_2"]
}
}
}
}'
Another solution is to add "self" relation to the relationships index for every user when user is created. I would prefer the second solution since it seems to be less complicated.

Resources