Elasticsearch shuffle index sorting - sorting

Thanks in advance. I expose the situation first and in the end the solution.
I have a collection of 2M documents with the following mapping:
{
"image": {
"properties": {
"timestamp": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
I have a webpage which paginates through all the documents with the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"sort" : [
{ "_id" : {"order" : "desc"}}
],
"query" : {
"match_all": {}
}
}
And a hit looks like this(note that the _id value is a hash of the url to prevent duplicated documents):
{
"_index": "images",
"_type": "image",
"_id": "2a750a4817bd1600",
"_score": null,
"_source": {
"url": "http://test.test/test.jpg",
"timestamp": "2014-02-13T17:01:40.442307",
"title": "Test image!"
},
"sort": [
null
]
}
This works pretty well. The only problem I have is that the documents appear sorted chronologically (The oldest documents appear on the first page, and the ones indexed more recently on the last page), but I want them to appear on a random order. For example, page 10 should always show always the same N documents, but they don't have to appear sorted by the date.
I though of something like sorting all the documents by their hash, which is kind of random and deterministic. How could I do it?
I've searched on the docs and the sorting api just works for sorting the results, not the full index. If I don't find a solution I will pick documents randomly and index them on a separated collection.
Thank you.

I solved it using the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"query" : {
"function_score": {
"random_score": {
"seed" : 1
}
}
}
}
Thanks to David from the Elasticsearch mailing list for pointing out the function score with random scoring.

Related

Elasticsearch query nested object

I have this record in elastic:
{
"FirstName": "Winona",
"LastName": "Ryder",
"Notes": "<p>she is an actress</p>",
"Age": "40-50",
"Race": "Caucasian",
"Gender": "Female",
"HeightApproximation": "No",
"Armed": false,
"AgeCategory": "Adult",
"ContactInfo": [
{
"ContactPoint": "stranger#gmail.com",
"ContactType": "Email",
"Details": "Details of tv show",
}
]
}
I want to query inside the contact info object and I used the query below but I dont get any result back:
{
"query": {
"nested" : {
"path" : "ContactInfo",
"query" : {
"match" : {"ContactInfo.Details" : "Details of tv show"}
}
}
}
}
I also tried:
{
"query": {
"term" : { "ContactInfo.ContactType" : "email" }
}
}
here is the mapping for contact info:
"ContactInfo":{
"type": "object"
}
I think I know the issue which is the field is not set as nested in mapping, is there a way to still query nested without changing the mapping, I just want to avoid re-indexing data if its possible.
I'm pretty new to elastic search so need your help.
Thanks in advance.
Elasticsearch has no concept of inner objects.
Some important points from Elasticsearch official documentation on Nested field type
The nested type is a specialized version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.
If you need to index arrays of objects and to maintain the independence of each object in the array, use the nested datatype instead of the object data type.
Internally, nested objects index each object in the array as a separate hidden document, such that that each nested object can be queried independently of the others with the nested query.
Refer to this SO answer, to get more details on this
Adding a working example with index mapping, search query, and search result
You have to reindex your data, after applying nested data type
Index Mapping:
{
"mappings": {
"properties": {
"ContactInfo": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"nested" : {
"path" : "ContactInfo",
"query" : {
"match" : {"ContactInfo.Details" : "Details of tv show"}
}
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64269180",
"_type": "_doc",
"_id": "1",
"_score": 1.1507283,
"_source": {
"FirstName": "Winona",
"LastName": "Ryder",
"Notes": "<p>she is an actress</p>",
"Age": "40-50",
"Race": "Caucasian",
"Gender": "Female",
"HeightApproximation": "No",
"Armed": false,
"AgeCategory": "Adult",
"ContactInfo": [
{
"ContactPoint": "stranger#gmail.com",
"ContactType": "Email",
"Details": "Details of tv show"
}
]
}
}
]

Buckets of documents grouped by term frequency

I want to segment Elasticsearch results in buckets, such that similar documents (with most matching terms) are grouped together (on an analyzed field) in the results. I'm not sure how to go about having aggregated buckets of individual documents this way.
Here's the basic mapping:
PUT movies
{
"mappings": {
"movie": {
"properties": {
"id": { "type": "long" },
"title": { "type" : "text" }
}
}
}
}
Now, for example, if a query is done for hunger then the results should be grouped as buckets of matching documents with most number of similar terms:
{
"buckets": {
"1": [
{
"title": "The Hunger Games"
},
{
"title": "The Hunger Games: Mockingjay"
},
{
"title": "The Hunger Games: Catching Fire"
}
],
"2": [
{
"title": "Aqua Teen Hunger Force"
},
{
"title": "Force of Hunger"
}
],
"3": [
{
"title": "Hunger Pain"
}
],
:
:
:
}
}
In the above example, similar documents are grouped in separate buckets, based on at-least two matching terms. All matching titles without similar terms are still included in the results as separate buckets (e.g. bucket #3).
Any suggestions are appreciated.

how to get total tokens count in documents in elasticsearch

I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.
I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.
curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": {
"match_all": {}
},
"aggs": {
"tk_count": {
"sum": {
"script": "_index[\"body\"].sumttf()"
}
}
},
"size": 0
}
Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).
This worked for me, is it what you need?
Rather than getting token count on query (using tk_count aggregation, as suggested in the other answer), my solution stores the token count on indexing using the token_count datatype., so that I could get "name.stored_length" values returned in query results.
token_count is a "multi-field" it works on one-field-at-a-time (i.e. the "name" field or the "body" field). I modified the example slightly to store the "name.stored_length"
Notice in my example it does not count cardinality of tokens (i.e. distinct values), it counts total tokens; "John John Doe" has 3 tokens in it; "name.stored_length"===3; (even though its count distinct tokens is only 2). Notice I ask for specific "stored_fields" : ["name.stored_length"]
Finally, you may need to re-update your documents (i.e. send a PUT), or any technique to get the values you want! In this case I PUT "John John Doe", even if it was already POST/PUT in elasticsearch; the tokens were not counted until a PUT again, after adding tokens to the mapping.!)
PUT test_token_count
{
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"stored_length": {
"type": "token_count",
"analyzer": "standard",
//------------------v
"store": true
}
}
}
}
}
}
}
PUT test_token_count/_doc/1
{
"name": "John John Doe"
}
Now we can query, or search for results, and configure results to include the name.stored_length field (which is both a multi-field and a stored field!):
GET/POST test_token_count/_search
{
//------------------v
"stored_fields" : ["name.stored_length"]
}
And results to the search should include the total token count as named.stored_length...
{
...
"hits": {
...
"hits": [
{
"_index": "test_token_count",
"_type": "_doc",
"_id": "1",
"_score": 1,
"fields": {
//------------------v
"name.stored_length": [
3
]
}
}
]
}
}
Seems like you want to retrieve cardinality of total tokens in body field.
In such case you can just use cardinality aggregation like below.
curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": {
"match_all": {}
},
"aggs": {
"tk_count": {
"cardinality" : {
"field" : "body"
}
}
},
"size": 0
}
For detailed information, see this official document

Elastic search aggregation sum

Im using elasticsearch 1.0.2 and I want to perform a search on it using a query with aggregation functions like sum()
Suppose my single record data is something like that
{
"_index": "outboxpro",
"_type": "message",
"_id": "PAyEom_mRgytIxRUCdN0-w",
"_score": 4.5409594,
"_source": {
"team_id": "1bf5f3f968e36336c9164290171211f3",
"created_user": "1a9d05586a8dc3f29b4c8147997391f9",
"created_ip": "192.168.2.245",
"folder": 1,
"report": [
{
"networks": "ec466c09fd62993ade48c6c4bb8d2da7facebook",
"status": 2,
"info": "OK"
},
{
"networks": "bdc33d8ca941b8f00c2a4e046ba44761twitter",
"status": 2,
"info": "OK"
},
{
"networks": "ad2672a2361d10eacf8a05bd1b10d4d8linkedin",
"status": 5,
"info": "[unauthorized] Invalid or expired token."
}
]
}
}
Let's say I need to fetch the count of all success messages posted with status = 2 in report field. There will be many record in the collection. I want to take report of all success messages posted.
I have tried the following code
////////////// Edit
{
"size": 2000,
"query": {
"filtered": {
"query": {
"match": {
"team_id": {
"query": "1bf5f3f968e36336c9164290171211f3"
}
}
}
}
},
"aggs": {
"genders": {
"terms": {
"field": "report.status"
}
}
}
}
Please help me to find some solution. Am newbie in elastic search. Is there any other aggregation method to find this one ?. Your help i much appreciate.
Your script filter is slow on big data and doesn't use benefits of "indexing". Did you think about parent/child instead of nested? If you use parent/child - you could use aggregations natively and use calculate sum.
You will have to make use of nested mappings here. Do have a look at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-mapping.html.
And then you will have to do aggregation on nested fields as in https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-nested-aggregation.html.

Trying to search unique results with Elasticsearch and highlighting feature

I am trying to implement an Elasticsearch query that will return highlighted distinct results based on the field queried upon.
I am aware that this isn't initially supported in ES and that most people are satisfied with aggregate features like facets. But since I need more data than what a facet can provide and that result highlighting is key to the project, I have been looking for alternate solutions.
I am using the Tire gem for a Rails project and so far this was my strategy:
Query Elastic Search with on one part the query string with highlighting, and on the other the search
{
"query": {
"match": {
"name": {
"query": "Banana",
"analyzer": "query_analyzer",
"operator": "AND"
}
}
},
"facets": {
"group_by": {
"terms": {
"fields": [
"name"
],
"size": 10,
"all_terms": false
}
}
},
"highlight": {
"fields": {
"name": {}
},
"pre_tags": [
"<span class=\"highlight\">"
],
"post_tags": [
"</span>"
]
},
"size": 100
}
Cross-reference the unique results with the first matching element from the query result to not only retrieve the missing information but also the highlighting for the result.
The problem with this approach is that even if I limit my query results to ten times more than my initial result size, the cross-reference could end up not finding 10 unique results in the query.
Also if I disregard the query and cross-reference the facet results with my database, I will lose the highlight.
I was also thinking that maybe I could even index my data differently a second time to enforce uniqueness server-side but this has proven to be another challenge altogether.
I am running out of ideas right now so if anyone sees something I'm missing I would be very grateful for any help.
Edit:
As an example, lets say I have these documents indexed in ES
[
{
id: 1,
name: 'Banana',
countryOfOrigin: 'Banana land'
},
{
id: 2,
name: 'Banana',
countryOfOrigin: 'Candy mountain'
},
{
id: 3,
name: 'Carrot',
countryOfOrigin: 'United Kingdom'
},
{
id: 4,
name: 'Barrel',
countryOfOrigin: 'Canada'
}
]
And I search for "Ba" in the same fashion as the query above, I would expect to find something like this:
{
"_shards":{
/* ... */
},
"hits":{
"total" : 2,
"hits" : [
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_source" : {
"id": '1',
"name": 'Banana',
"countryOfOrigin": 'Banana land'
}
"highlight": {
"name": ["<span class='highlight'>Ba</span>nana"]
}
},
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "4",
"_source" : {
"id": '4',
"name": 'Barrel',
"countryOfOrigin": 'Canada'
}
"highlight": {
"name": ["<span class='highlight'>Ba</span>rrel"]
}
}
]
}
}
This would basically allow me to search for distinct item names in my records.

Resources