Elasticsearch get matching documents after specific document id - elasticsearch

When I search for documents I took the first 10 and give them to the view, if the user scrolls to the end of the list the next 10 elements should be displayed.
I know the last document id of the displayed documents, now I have to get the next 10. Basically I would perform the exact same search with an offset of 10 but it would be much better to be able to search with the same query, putting the document id of the last retrieved document to it and retrieve the matching documents after the document with that id.
Is that possible with elasticsearch?
=== UPDATE
I want to point out my issue a bit more, because it seems it is not clear enough as it is described right now. Sorry for that.
The case:
You have a kind of feed, the feed will grow every second. If a user goes to the feed he gets the most recent 10 entries, if he scrolls down he wants to get the next 10 entries.
Because the feed is growing every second, a usual offset / limit (from / size in elasticsearch) can't solve this problem, you would display already displayed entries or completely newer entries, depending on the time between first request (first 10 entries) and the request for the next entries.
The request to get the next 10 elements AFTER the already displayed entries gives the backend the id of the last entry which got displayed. The backend knows to ignore all entries before this specific one.
At the moment I'm handling this in code, I request the list with all matching entries from Elasticsearch and iterate it, this way I can do everything I want (no surprise) and extract the needed chunk of entires.
My question is: Is there is a build in solution for this issue in elasticsearch. Because solving the problem on my way is not the fastest.

It's an old topic, but it feels that Search After API, which is available since elasticsearch 5.0, does exactly what is needed. Provide an id of your last doc and it's timestamp, for example:
GET twitter/tweet/_search
{
"size": 10,
"query": {
"match": {
"title": "elasticsearch"
}
},
"search_after": [
1463538857,
"tweet#654323"
],
"sort": [
{
"date": "asc"
},
{
"_uid": "desc"
}
]
}
Source: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html

You just have to create your query DSL and a pagination system with
{ "size": 10, "from" : YOUR_OFFSET }

If I understood your question correctly then you can use ES scrolls for such thing.
This is an example on how would one do that in java, note that it uses SearchType.SCAN
SearchRequestBuilder requestBuilder = ....getClient().prepareSearch(indices)
/**
* Set up scroll and go from there.....
* To do that need to change search type to <code>SearchType.SCAN</code>
* and set up scroll it self
* Once search type and scroll are set and search is executed, whoever
* handles the result will need to check and poll the scroll
*/
requestBuilder.setSearchType(SearchType.SCAN);
requestBuilder.setScroll(new TimeValue(SCROLL_WINDOW_IN_MILLISECONDS)); // this is in MILLISECONDS
requestBuilder.setSize(10); // this is how many hits per shrad per scroll will be returned
SearchResponse response = requestBuilder.execute().actionGet();
while (true) {
results = client.prepareSearchScroll(results.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
if (results.getHits().getHits().length == 0) {
break;
}
// do what you need to do w/ scroll result here
}
So every time inside of while loop you would grab 10 consequtive results until you would get all your results

I know this is old, but I encountered the same dilemma and I'd rather think out loud.
In that feed, you seem to care about less and less relevant documents with every request. I'm not saying a timestamp/comment count etc on purpose, in terms of ES you talk about a score that can be calculated by many factors, and what you want is to continue to search down that scoring road.
The solution that came to my mind was: If you also care about more relevant documents (like Facebook shows you on the top "X new stories available") you can first search from the beginning until you reach the first document you encountered (the one that was previously most relevant), and by adding the count of documents before to the count of documents you already displayed in the feed you can determine an estimated offset (you might get a few duplicates in race conditions, just drop them).
So what you actually need to do is search the top until you reach the first document, and then search the estimated bottom and drop everything more relevant than the last document.
This is all assuming the bulks of feeds never change, if document Y was between X and Z, it will stay there forever.
If the score is constant (unlikely as this means it will always rise for the feed to keep changing), you can also filter by everything below the score of the last document.

Related

ElasticSearch random score combined with boost?

I am building a iOS app with Firebase, and using ElasticSearch as a search engine to get more advanced queries.
I am trying to achieve a system where I can get a random record from the index, based on a query. I have already got this working using the "random_score" function with a seed.
So all documents should right now have equal chance of being selected. Is it possible to add a boost or something(sorry, I am new to ES)?
Let's say the the document has the field "boost_enabled" and it set to true, the document will be 3 times more likely to be selected, so "increasing" the chance of being selected as a random?
So in theory it should look like this:
Documents that matches the query:
"document1"
"document2"
"document3"
They all have an equal chance of being selected (33%)
What I wish to achieve is if "document1" has the field "boost_enabled" = true
It should look like this:
"document1"
"document1"
"document1"
"document2"
"document3"
So now "document1" is 3 times more likely to be selected as the random record.
Would really appreciate some help.
EDIT:
I've come up with something like this, is this correct or not? I am pretty sure it's not though...
"query" : {
"function_score": {
"query": {
"bool" : {
"must": {
"match_all": {}
},
"should": [
{ "exists" : {
"field" : "boost_enabled",
"boost" : 3
}
}
]
"filter" : filterArray
}
},
"functions": [
{
"random_score": {"seed": seed}
}
]
}
}
/ Mads
Yes, Elasticsearch has something like that - refer to Elasticsearch: Query-Time Boosting.
In your case, you would have a portion of your query that notes the presence of the flag you described and this "subquery" would have a boost. bool with its should clause will probably be useful.
NB: This is not EXACTLY like being able to say matching document is n times as likely to be a result
EDITS:
--
EDIT 1:
Elasticsearch will tell you how it comes up with the score via the
Explain API which might be helpful in tweaking parameters.
--
EDIT 2:
I apologize for what I had posted above. Upon further thought and exploration, I think the boost parameter is not quite what is required here. function_score already has the notion of weight but even that falls short. I have found other users with requirements similar to yours but it looks like there haven't been any good solutions proposed for this.
References:
Elasticsearch Github Issue on Weighted Random Sampling
Stackoverflow Post with a Request Identical to Github Issue
I do not think the solutions proposed in those posts are quite right. I put together a quick shell script hitting the Elasticsearch REST API and relying on jq (a popular CLI for processing JSON) to demonstrate: Github Gist: Flawed Attempt At Weighed Random Sampling with Elasticsearch
In the script, featured_flag is equivalent to your boost_enabled, and undesired_flag is there to demonstrate how to only consider a subset of documents in the index. You can copy the script tweak global variables at the top of the script like Elasticsearch server, index, etc to try it out.
Some notes on the script:
script creates one document with featured_flag enabled and one document with undesired_flag enabled that should not be ever chosen
TOTAL_DOCUMENTS can be used to adjust how many total documents are created (including the first two created)
FEATURED_FLAG_WEIGHT is the weight applied at query time via function_score
script reruns the same query 1000 times and outputs stats on how many times each of the created documents was returned as the first result
I would imagine your index has many "featured" or "boosted" samples among many that are not. With the described requirements, the probability of choosing a sample depends on weight of the document (let's say 3 for boosted documents, 1 for the rest) and the sum of weights across all valid documents that you want taken into consideration. Therefore, it seems like simple weights, boosts, and randoms are just insufficient
A lot of people have considered and posted solutions for the task of weighted random sampling without Elasticsearch. This appears to be a good stab at explaining a few approaches: electric monk: Weighted Random Distribution. A lot of algorithmic details may not be quite relevant here but I thought they were interesting.
I think the ideal solution would require work to be done outside of Elasticsearch (without delving into creating Elasticsearch plugins, scorers, etc). Here is the best that I can come up with at the moment:
A numeric weight field stored in the documents (can continue with boolean fields but this seems more flexible)
Hit Elasticsearch with an initial query leveraging aggregations for some stats we need
Possibly a sum aggregation for the sum of weights required for document probabilities
A terms aggregation to get counts of documents by weights (ex: m documents with weight 1, n documents with weight 3)
Outside of Elasticsearch (in the app), choose the sample
generate a random number within the range of 0 to sum_of_weights-1
use the aggregation results and the generated random to select an index (see the algorithmic solutions for weighted random sampling outside of Elasticsearch) that is in the range of 0 to total_valid_documents-1 (call this selected_index)
Hit Elasticsearch a second time with appropriate filters for considering only valid documents, a sort parameter that guarantees the document set is ordered the same way each time we run this process (perhaps sorted by weight and by document id), and a from parameter set to the selected_index
Slightly related to all this, I posted a slightly different write up.
In ES 7.15 i use {script_score} key, below you can see my example.
This code "source": "_score + Math.random()" added random 0.0 -> 1.0 number to my native boosted score. For more information you can see this
{
"size": { YOUR_SIZE_LIMIT },
"query": {
"script_score": {
"query": {
"query_string": {
"fields": [
"{ YOUR_FIELD_1 }^6",
"{ YOUR_FIELD_2 }^3",
],
"query": "{ YOUR_SEARCH_QUERY }"
}
},
"script": {
"source": "_score + Math.random()"
}
}
}
}

Get the keyword positions for all matched docs with just ONE search query

I'm been wrestling with this problem for days.
For example, if I have
{doc_id1:"thor marvel"},
{doc_id2:"spiderman thor"},
{doc_id3:"the avengers captain america ironman thor"}
three documents in elastic search, and do a search query for "thor", I want it to tell me where keyword "thor" is found in each document, like
{ doc_id1: 1, doc_id2: 2, doc_id3: 6}
as the desired result.
I have two possible solution on top of my head now:
1.figure out a way to put the _vectorterm info (which includes all the positions/offsets for each token/word of the document) into _source, so that I can directly access _vectorterm info in my normal search result. I can then construct the (doc, position) list outside elasticsearch. Normally, you can only access _vectorterm info for a single document at a time given the index/type/id, which is why it's tricky. That should be the ideal way to achieve the goal
2.figure out a way to trigger an action whenever a new document is added. This action will scan through all the tokens/words in the new document, create a new "token" index(if it doesn't exist) and append (doc_id, position) pair to it like
{ keyword:"thor" [ doc_id1:1,doc_id2:2,doc_id3:6] }.
So that I just need to search for "thor" among keywords indexes and then get the (doc, position) lists. This seems to be even harder and less optimal.
Sadly, I don't know how to do either one. I'll appreciate it if someone can give me some help on this. Many thanks!

Search Console API: Impressions don't add up comparing totals to contains / not contains keywords

We are using the Search Console (webmaster tools) API to download search performance results for our site to compare search performance on people searching using our company name vs non company name searches. We have found a problem where the impressions don't add up when comparing "all search results" to "search results via specific keywords".
For example, if we do a report to show all web results for all devices for our site on a specific date, we get 189,491 impressions. If we then report to show results with the keyword "Our Name" we get 61,046. If we report on "OurName" (same keyword but without spaces) we get 1,086. If we then report not contains "Our Name" and not contains "OurName" we get 65,827, which adds up to 127,959, meaning somewhere we have 61,532 impressions missing.
Interestingly, if we change the filter on not contains to also include device equals DESKTOP, it increases to 65,997, yet I would have expected this to be equal to or less than all device impressions.
From the data we have this seemed to have stopped working on the 27th November 2015 (before this the 3 figures always added up to the total, on this date and afterwards they don't). The impressions add up fine if we only do one contains and one not contains. Clicks always seem to add up correctly, so I'm wondering if one of these queries is excluding data with zero clicks?
We are using the .Net library to access the Search Console data, but we get the same results when using the API Explorer. It is hard to replicate using the search console, as this doesn't allow you to include multi "not contains" keywords. The total figures and the contains "our name" / "ourname" figures match between the API and the search console.
I've found a few other post on here where people are having similar problems but they are dated over a year ago, and we've only just noticed the problem in the last 3 weeks so I don't know if this is a new problem.
The query for the not contains is as follows:
POST https://www.googleapis.com/webmasters/v3/sites/{YOUR_SITE_URL}/searchAnalytics/query?fields=rows&key={YOUR_API_KEY}
{
"startDate": "2015-12-07",
"endDate": "2015-12-07",
"searchType": "web",
"dimensionFilterGroups": [
{
"filters": [
{
"dimension": "query",
"expression": "our name",
"operator": "notContains"
},
{
"dimension": "query",
"expression": "ourname",
"operator": "notContains"
}
]
}
]
}
Many thanks in advance for any help
cross posted from Google Search Console Forum
From the API reference, there is no OR operation available for multiple filter expressions:
"Whether all filters in this group must return true ("and"), or one or more must return true (not yet supported)."
BOTH filters must be passed to get into the total.
Does not contain "our name" AND Does not contain "ourname".
https://developers.google.com/webmaster-tools/v3/searchanalytics/query
Having said that, you probably are even more at a loss to explain some of your results...maybe you have a number of queries that contain both "our name" AND "ourname"??
I working on the same topic at the moment (excluding brand searches); like Google say, they excluding search queries that can contain privat information:
To protect user privacy, Search Analytics doesn't show all data. For example, we might not track some queries that are made a very small number of times or those that contain personal or sensitive information.
https://support.google.com/webmasters/answer/6155685?hl=en#tablegone
With this in mind you have a big block of data with no query information, so if you filter in any way, that whole block isn't included.
For example, we had like 325.000 total impressions on the 01.07., but if I do two separate queries one with including and one with excluding and add the values for clicks and impressions together, I get like the total numbers for that block where my queries living in.
In our case that is around 180.000 impression, so 145k impressions were made with queries I don't know and can't filter them.
In your case the 127,959 could be your total of impressions (depending of your keywords). So your non brand traffic with 65,827 impressions is more like 50% percent than 30%.
I hope it's more or less understandable.

Unique count of terms aggregations

I want to count distinct values of a field from my dataset. For example:
The terms aggregation gives me the number of occurences by username. I want to only count unique usernames, not all.
Here's my request:
POST appzz/messages/_search
{
"aggs": {
"words": {
"terms": {
"field": "username"
}
}
},
"size": 0,
"from": 0
}
Is there a unique option or something like that?
You're looking for the cardinality aggregation which was added in Elasticsearch 1.1. It allows you to request something like this:
{
"aggs" : {
"unique_users" : {
"cardinality" : {
"field" : "username"
}
}
}
}
We had a long discussion about it with one of the ES guys in a recent Elasticsearch meetup we had here. The short answer is no, there isn't. And according to him it's not something to be expected soon.
One option to kind of do it is to get all the terms (give a really big size limit) and count how many terms are returned, but it's expensive and not really valid if you have a lot of unique terms.
#DerMiggel: I tried using cardinality for my project. Surprising on my local system out of a total dump of some 2,00,000 documents, I tried the cardinality with precision_threshold of 100, 0 and 40,000(as the max value). The first two times, result was different(count of 175 and 184 respectively) and for 40,000 got out of memory exception. Also the computation time was huge as compared to other aggs. Hence I feel cardinality is not actually that correct and might crash your system when required high accuracy and precision.
I'm still fairly new to ES, but if I get you correctly, it seems that you should be able to get your answer by simply counting the number of buckets returned in the response? (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)
NOTE, though: contrary to what that doc says right now with size 0 ("It is possible to not limit the number of terms that are returned by setting size to 0."), my testing with the latest version (1.0.1 now) shows that this does not work! On the contrary, setting size to 0 will give you 0 buckets!!! You should set (sigh) size to some arbitrary high figure instead for now if you want to get all the terms.
EDIT: whoops, my bad! I just reread the doc again, and only just noticed that version note there, and realized that this is only coming out in 1.1.0? That note is in the past tense ("Added in 1.1.0."), which is confusing, but I guess 1.1.0 hasn't been released yet....
Oh, and btw, there seems to be something wrong with your url? I hope you know that.

Trouble with facet counts

I'm attempting to use ElasticSearch for analytics -- specifically to track "top content" for hand-rolled Rails CMS. The requirement is quite a bit more complicated than keeping a counter for each piece of content. I won't get into the depth of problem right now, as I can't seem to get even the basics working.
My problem is this: I'm using facets and the counts aren't what I expect them to be. For example:
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":1,"all_terms":false,"order":"count"}}}}
Result:
{"el_ids":{"_type":"terms","missing":0,"total":16672,"other":16657,"terms":[{"term":"quis","count":15}]}}
Ok, great, the piece of content with id "quis" had 15 hits and since the order is count, it should be my top piece of content. Now lets get the top 5 pieces of content.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":5,"all_terms":false,"order":"count"}}}}
Result (just the facet):
[
{"term":"qgz9","count":26},
{"term":"quis","count":15},
{"term":"hnqn","count":15},
{"term":"higp","count":15},
{"term":"csns","count":15}
]
Huh? So the piece of content w/ id "qgz9" had more hits with 26? Why wasn't it the top result in the first query?
Ok, lets get the top 100 now.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":100,"all_terms":false,"order":"count"}}}}
Results (just the facet):
[
{"term":"qgz9","count":43},
{"term":"difc","count":37},
{"term":"zryp","count":31},
{"term":"u65r","count":31},
{"term":"sxsi","count":31},
...
]
So now "qgz9" has 43 hits instead of 26? How can that be? I can assure you there's nothing happening in the background modifying the index. If I repeat these queries, I get the same results.
As I repeat this process of increasing the result size, counts continue to change and new content ids emerge at the top. Can someone explain to me what I'm doing wrong or where my understanding of how this works is flawed?
It turns out that this is a known issue:
...the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results.
By default, my index was being created with 5 shards. By changing this so the index only has a single shard, the counts behave inline with my expectations. Another workaround would be to always set size to a value greater than the number of expected facets and peel off the top N results.

Resources