Random document in ElasticSearch - random

Is there a way to get a truly random sample from an elasticsearch index? i.e. a query that retrieves any document from the index with probability 1/N (where N is the number of documents currently indexed)?
And as a follow-up question: if all documents have some numeric field s, is there a way to get a document through weighted random sampling, i.e. where the probability to get document i with value s_i is equal to s_i / sum(s_j for j in index)?

I know it is an old question, but now it is possible to use random_score,
with the following search query:
{
"size": 1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "1477072619038"
}
}
]
}
}
}
For me it is very fast with about 2 million documents.
I use current timestamp as seed, but you can use anything you like. The best is if you use the same seed, you will get the same results. So you can use your user's session id as seed and all users will have different order.

The only way I know of to get random documents from an index (at least in versions <= 1.3.1) is to use a script:
sort: {
_script: {
script: "Math.random() * 200000",
type: "number",
params: {},
order: "asc"
}
}
You can use that script to make some weighting based on some field of the record.
It's possible that in the future they might add something more complicated, but you'd likely have to request that from the ES team.

You can use random_score with a function_score query.
{
"size":1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": 11
}
}
],
"score_mode": "sum",
}
}
}
The bad part is that this will apply a random score to every document, sort the documents, and then return the first one. I don't know of anything that is smart enough to just pick a random document.

NEST Way :
var result = _elastic.Search<dynamic>(s => s
.Query(q => q
.FunctionScore(fs => fs.Functions(f => f.RandomScore())
.Query(fq => fq.MatchAll()))));
raw query way :
GET index-name/_search
"size": 1,
"query": {
"function_score": {
"query" : { "match_all": {} },
"random_score": {}
}
}
}

You can use random_score to randomly order responses or retrieve a document with roughly 1/N probability.
Additional notes:
https://github.com/elastic/elasticsearch/issues/1170
https://github.com/elastic/elasticsearch/issues/7783

Related

Find same text within time range

I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with articles that mention X at least twice within the last 30 days. Is there a simple query to find all blog_ids that have articles with the same word at least n times within a date range?
Is this the right way to model the problem or should I use a nested objects for an easier query?
Can this be made into a report in Kibana?
The simplest query that comes to mind is
{
"_source": "blog_id",
"query": {
"bool": {
"must": [
{
"match": {
"article_text": "xyz"
}
},
{
"range": {
"timestamp": {
"gte": "now-30d"
}
}
}
]
}
}
}
nested objects are most probably not going to simplify anything -- on the contrary.
Can it be made into a Kibana report?
Sure. Just apply the filters either in KQL (Kib. query lang) or using the dropdowns & choose a metric that you want to track (total blog_id count, timeseries frequency etc.)
EDIT re # of occurrences:
I know of 2 ways:
there's the term_vector API which gives you the word frequency information but it's a standalone API and cannot be used at query time.
Then there's the scripted approach whereby you look at the whole article text, treat is as a case-sensitive keyword, and count the # of substrings, thereby eliminating the articles with non-sufficient word frequency. Note that you don't have to use function_score as I did -- a simple script query will do. it may take a non-trivial amount of time to resolve if you have non-trivial # of docs.
In your case it could look like this:
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
def word = 'xyz';
def docval = doc['article_text.keyword'].value;
String temp = docval.replace(word, "");
def no_of_occurences = ((docval.length() - temp.length()) / word.length());
return no_of_occurences >= 2;
"""
}
}
}
]
}
}
}

Advanced kibana / elasticsearch devtools queries

I'm querying my index in the following way:
GET INDEX/_count?q=KEY:VALUE
I want to get data on a list of values, so I run it multiple times:
GET INDEX/_count?q=KEY:VALUE0
GET INDEX/_count?q=KEY:VALUE1
GET INDEX/_count?q=KEY:VALUE2
Additionally, I want to check if the key exists. These options are available in the Discover window, but here I don't know how to access them...
What I eventually want: Query a specific index [I] and count (and, if possible, get advanced stats such as size of the total docs searched) all docs with specific key [K] existing, or having a value out of list of values (and, if possible, do that with regex). Added to that, I want the search / count to be between specific dates. I know how to do so in the Discover window, but Discover have 2 problems:
Gives the actual doc (too heavy, I only want size and count)
Involves GUI
To summarize, I have a few difficulties:
How to add size to the DevToools' count
How to count / search up to one month past
How to find if a key exists (e.g. GET I/_count?K:exists ?)
How to find if value is one of list of allowed values (e.g. GET I/_count?K=x || K=y || K2=z
How to describe value in regex (e.g. GET I/_count?K=abc*)
After count / search is done, how to delete said docs (Just replace GET with DELETE?)
This should get you started:
GET INDEX/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"dateField": {
"gte": "now-1M"
}
}
},
{
"bool": {
"filter": {
"exists": {
"field": "K"
}
}
}
},
{
"query_string": {
"query": "K:(x OR y) OR K2:z"
}
},
{
"regexp": {
"K": "abc*"
}
}
]
}
}
}
Alternatively, you can switch must to should, thereby matching either of those subqueries.
After this, replace _search with _delete_by_query and you're good to go.

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

Elastic Search filter with aggregate like Max or Min

I have simple documents with a scheduleId. I would like to get the count of documents for the most recent ScheduleId. Assuming Max ScheduleId is the most recent, how would we write that query. I have been searching and reading for few hours and could get it to work.
{
"aggs": {
"max_schedule": {
"max": {
"field": "ScheduleId"
}
}
}
}
That is getting me the Max ScheduleId and the total count of documents out side of that aggregate.
I would appreciate if someone could help me on how take this aggregate value and apply it as a filter (like a sub query in SQL!).
This should do it:
{
"aggs": {
"max_ScheduleId": {
"terms": {
"field": "ScheduleId",
"order" : { "_term" : "desc" },
"size": 1
}
}
}
}
The terms aggregation will give you document counts for each term, and it works for integers. You just need to order the results by the term instead of by the count (the default). And since you only want the highest ScheduleID, "size":1 is adequate.
Here is the code I used to test it:
http://sense.qbox.io/gist/93fb979393754b8bd9b19cb903a64027cba40ece

Elasticsearch date range intersection

I'm storing something like the following information in elastic search:
{ "timeslot_start_at" : "2013-02-01", "timeslot_end_at" : "2013-02-03" }
Given that I have another date range (given from user input for example) I am wanting to search for an intersecting time range. Similar to this: Determine Whether Two Date Ranges Overlap Which outlines that the following logic is what i'm after:
(StartDate1 <= EndDate2) and (StartDate2 <= EndDate1)
But I'm unsure of how to fit this into an elastic search query, would I use a range filter and only set the 'to' values, leaving the from blank? Or is there a more efficient way of doing this?
Update: It is now possible to use date_range data type that was added in elasticsearch v5.2. For an earlier version of elasticsearch the following solution still applies.
To test for intersection, you should combine two range queries into a single query using bool query:
{
"bool": {
"must": [
{
"range": {
"timeslot_start_at": {
"lte": "2013-02-28"
}
}
},
{
"range": {
"timeslot_end_at": {
"gte": "2013-02-03"
}
}
}
]
}
}

Resources