Trouble with facet counts - elasticsearch

I'm attempting to use ElasticSearch for analytics -- specifically to track "top content" for hand-rolled Rails CMS. The requirement is quite a bit more complicated than keeping a counter for each piece of content. I won't get into the depth of problem right now, as I can't seem to get even the basics working.
My problem is this: I'm using facets and the counts aren't what I expect them to be. For example:
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":1,"all_terms":false,"order":"count"}}}}
Result:
{"el_ids":{"_type":"terms","missing":0,"total":16672,"other":16657,"terms":[{"term":"quis","count":15}]}}
Ok, great, the piece of content with id "quis" had 15 hits and since the order is count, it should be my top piece of content. Now lets get the top 5 pieces of content.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":5,"all_terms":false,"order":"count"}}}}
Result (just the facet):
[
{"term":"qgz9","count":26},
{"term":"quis","count":15},
{"term":"hnqn","count":15},
{"term":"higp","count":15},
{"term":"csns","count":15}
]
Huh? So the piece of content w/ id "qgz9" had more hits with 26? Why wasn't it the top result in the first query?
Ok, lets get the top 100 now.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":100,"all_terms":false,"order":"count"}}}}
Results (just the facet):
[
{"term":"qgz9","count":43},
{"term":"difc","count":37},
{"term":"zryp","count":31},
{"term":"u65r","count":31},
{"term":"sxsi","count":31},
...
]
So now "qgz9" has 43 hits instead of 26? How can that be? I can assure you there's nothing happening in the background modifying the index. If I repeat these queries, I get the same results.
As I repeat this process of increasing the result size, counts continue to change and new content ids emerge at the top. Can someone explain to me what I'm doing wrong or where my understanding of how this works is flawed?

It turns out that this is a known issue:
...the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results.
By default, my index was being created with 5 shards. By changing this so the index only has a single shard, the counts behave inline with my expectations. Another workaround would be to always set size to a value greater than the number of expected facets and peel off the top N results.

Related

Reporting Multiple Values & Sorting

Having a bit of an issue and unsure if it's actually possible to do.
I'm working on a file that I will enter target progression vs actual target reporting the % outcome.
PAGE 1
¦NAME ¦TAR 1 %¦TAR 2 %¦TAR 3 %¦TAR 4 %¦OVERALL¦SUB 1¦SUB 2¦SUB 3¦
¦NAME1¦ 114%¦ 121%¦ 100%¦ 250%¦ 146%¦ 2¦ 0¦ 0%¦
¦NAME2¦ 88%¦ 100%¦ 90%¦ 50%¦ 82%¦ 0¦ 1¦ 0%¦
¦NAME3¦ 82%¦ 54%¦ 64%¦ 100%¦ 75%¦ 6¦ 6¦ 15%¦
¦NAME4¦ 103%¦ 64%¦ 56%¦ 43%¦ 67%¦ 4¦ 4¦ 24%¦
¦NAME5¦ 87%¦ 63%¦ 89%¦ 0%¦ 60%¦ 3¦ 2¦ 16%¦
Now I already have it sorting all rows by the Overall % column so I can quickly see at a glance but I am creating a second page that I need to reference points.
So on the second page I would like to somehow sort and reference different columns for example
PAGE 2
TOP TAR 1¦Name of top %¦Top %¦
TOP TAR 2¦Name of top %¦Top %¦
Is something like this possible to do?
Essentially I'm creating an Employee of the Month form that automatically works out who has topped what.
I'm willing to drop a paypal donation for whoever can figure this out for me as I've been doing it manually every month and would appreciate the time saved
I don't think a complicated array formula is necessary for this - I am suggesting a fairly standard Index/Match approach.
First set up the row titles - you can just copy and transpose them from Page 1, or use a formula in A2 of Page 2 like
=transpose('Page 1'!B1:E1)
The use them in an index/match to get the data in the corresponding column of the main sheet and find its maximum (in C2)
=max(index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)))
Finally look up the maximum in the main sheet to find the corresponding name:
=index('Page 1'!A:A,match(C2,index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)),0))
If you think there could be a tie for first place with two or more people getting the same score, you could use a filter to get the different names:
So if the max score is in B8 this time (same formula)
=max(index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0)))
the different names could be spread across the corresponding row using transpose (in C8)
=ArrayFormula(TRANSPOSE(filter('Page 1'!A:A,index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0))=B8)))
I have changed the test data slightly to show these different scenarios
Results

google search appliance accurate result count parameter not making a difference

We are having a result count issue where the pages have 10 results per page. For pagination we are getting 64 result count on page 1 (ie start=0), 25 for page 2, and 21 for page 3.
I understand as per documentation for estimated vs actual results that it is not guaranteed but the above result count is when I set filter=0 and rc=1. The rc=1 does not appear to make a difference when included or not. We are on version 7.2.0.G.252
filter=0&rc=1 should work for you and you should see the same count even after paginating.
What you need to notice is, when you click on pagination link, make sure the filter=0&rc=1 are carried over. i.e., after pagination, see if you still have the filter and rc parameters intact.
Also check using the default_frontend as your custom frontend may not be handling it?
The problem was related to the collection not the query. The content match pattern did not include a "/" at end which when resolved gave an accurate count. Thanks for the assistance.

Solr query conundrum

I've recently swapped from using Lucene for Sitecore to Solr.
For the most part it has been smooth, but the way I was writing some queries (using Sitecore.ContentSearch.Linq) abstraction now don't seem to be compatible.
Specifically, I have a situation where I've got "global" content and "regional" content, like so:
Home (000)
X
Y
Z
Regions (ID: 111)
Region 1 (ID: 221)
A
B
Region 2 (ID: 222)
D
My code worked on Lucene, but now doesn't on Solr. It should find all "global" and a single region's content, excluding all other region's content. So as an example, if the user's current region was Region 1, I'd want the query to return content X, Y, Z, A, B.
Sitecore's Item Crawler has a field for each item in the index called "_path" which is a multivalued string field of IDs, so as an example, Region 1's _path field value would be [000, 111, 221 ].
When I write this using the Linq abstraction it comes out as below which doesn't return results.
-_path:(111) OR _path:(221)
But _path:(111) does return result. Mind blown.
When I use the Solr interface and wrap each side of the OR in extra brackets like below (which I'd consider redundant) it works! Mind blown v2.
(-_path:(111)) OR (_path:(221))
Firstly, what's the difference between those queries?
Secondly, my real problem is I can't add these extra brackets as I'm working in an abstraction Linq so the brackets will be "optimized" out.
Any advice would be awesome! Cheers.
The problem here is, lucene's negative queries don't work like you think they do. They only remove results from what has been found. -_path:111 doesn't find all documents which aren't in 111, it doesn't find anything at all. It only removes results. So you are finding all results with path "221", then removing any that also have path "111", which from your heirarchy, I assume is all of them. See my answer here for a bit more on that topic.
The OR makes it seem like it ought to work, but really -_path:(111) OR _path:(221) is the same as -_path:(111) _path:(221). The moral here is: Don't use Lucene's AND/OR/NOT syntax, if you can help it. Use +/-. +/- syntax actually expresses how the query operates, AND/OR/NOT doesn't. It attempts to shoehorn it into a different, SQL-like retrieval model and leads to some unexpected behavior like this.
So, what about: (-_path:(111)) OR (_path:(221))
Well, first, does it actually work? Or does it just get some results?
If it just gets some results, but just seems to get the same results as _path:221: The reason is -_path:111 gets no results, so your query is, in practice, something like: (nothing) OR (_path:221), which is equivalent to _path:221
If it really does get the results you expect (I'm guessing it probably does): Something is translating your query into something like: (*:* -_path:111) (_path:221). Solr does have some logic along these lines, though I'm not quite sure in this case. Essentially, it puts a match-all in front of any lonely negative queries it finds, allowing them to do what you were expecting. If the implicit *:* makes you nervous about performance, well, it should. But lucene is an inverted index, it does well with finding matches on a term quickly. Getting everything that doesn't match goes against the grain of that retrieval model, and will pretty much have to do a full scan of the index.

Search Console API: How to get full result set?

everyone. I'm working with the Search Console API. I'm authenticated and getting data -- but not all that I'm hoping for.
The docs say that I can request 5,000 rows at a time. But when I set the setrowLimit parameter like this:
$request->setRowLimit(5000);
I get 127 rows returned, with text at the very bottom of the result set that says 'more elements...' -- almost as if it's a paginated result set.
How do I get to those 'more elements'?
Edit: At the top of my result set, I see this response;
array (size=5000)
So it definitely appears there are 5,000 results in the array, I just don't know how to get them all.
Please, post an example of your code. I set row limit to 5000 bit i don't have any problemi and never see the label "more elements". Where do you get this text? Into the array result?

Elasticsearch get matching documents after specific document id

When I search for documents I took the first 10 and give them to the view, if the user scrolls to the end of the list the next 10 elements should be displayed.
I know the last document id of the displayed documents, now I have to get the next 10. Basically I would perform the exact same search with an offset of 10 but it would be much better to be able to search with the same query, putting the document id of the last retrieved document to it and retrieve the matching documents after the document with that id.
Is that possible with elasticsearch?
=== UPDATE
I want to point out my issue a bit more, because it seems it is not clear enough as it is described right now. Sorry for that.
The case:
You have a kind of feed, the feed will grow every second. If a user goes to the feed he gets the most recent 10 entries, if he scrolls down he wants to get the next 10 entries.
Because the feed is growing every second, a usual offset / limit (from / size in elasticsearch) can't solve this problem, you would display already displayed entries or completely newer entries, depending on the time between first request (first 10 entries) and the request for the next entries.
The request to get the next 10 elements AFTER the already displayed entries gives the backend the id of the last entry which got displayed. The backend knows to ignore all entries before this specific one.
At the moment I'm handling this in code, I request the list with all matching entries from Elasticsearch and iterate it, this way I can do everything I want (no surprise) and extract the needed chunk of entires.
My question is: Is there is a build in solution for this issue in elasticsearch. Because solving the problem on my way is not the fastest.
It's an old topic, but it feels that Search After API, which is available since elasticsearch 5.0, does exactly what is needed. Provide an id of your last doc and it's timestamp, for example:
GET twitter/tweet/_search
{
"size": 10,
"query": {
"match": {
"title": "elasticsearch"
}
},
"search_after": [
1463538857,
"tweet#654323"
],
"sort": [
{
"date": "asc"
},
{
"_uid": "desc"
}
]
}
Source: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
You just have to create your query DSL and a pagination system with
{ "size": 10, "from" : YOUR_OFFSET }
If I understood your question correctly then you can use ES scrolls for such thing.
This is an example on how would one do that in java, note that it uses SearchType.SCAN
SearchRequestBuilder requestBuilder = ....getClient().prepareSearch(indices)
/**
* Set up scroll and go from there.....
* To do that need to change search type to <code>SearchType.SCAN</code>
* and set up scroll it self
* Once search type and scroll are set and search is executed, whoever
* handles the result will need to check and poll the scroll
*/
requestBuilder.setSearchType(SearchType.SCAN);
requestBuilder.setScroll(new TimeValue(SCROLL_WINDOW_IN_MILLISECONDS)); // this is in MILLISECONDS
requestBuilder.setSize(10); // this is how many hits per shrad per scroll will be returned
SearchResponse response = requestBuilder.execute().actionGet();
while (true) {
results = client.prepareSearchScroll(results.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
if (results.getHits().getHits().length == 0) {
break;
}
// do what you need to do w/ scroll result here
}
So every time inside of while loop you would grab 10 consequtive results until you would get all your results
I know this is old, but I encountered the same dilemma and I'd rather think out loud.
In that feed, you seem to care about less and less relevant documents with every request. I'm not saying a timestamp/comment count etc on purpose, in terms of ES you talk about a score that can be calculated by many factors, and what you want is to continue to search down that scoring road.
The solution that came to my mind was: If you also care about more relevant documents (like Facebook shows you on the top "X new stories available") you can first search from the beginning until you reach the first document you encountered (the one that was previously most relevant), and by adding the count of documents before to the count of documents you already displayed in the feed you can determine an estimated offset (you might get a few duplicates in race conditions, just drop them).
So what you actually need to do is search the top until you reach the first document, and then search the estimated bottom and drop everything more relevant than the last document.
This is all assuming the bulks of feeds never change, if document Y was between X and Z, it will stay there forever.
If the score is constant (unlikely as this means it will always rise for the feed to keep changing), you can also filter by everything below the score of the last document.

Resources