Calculating Relative Recall - elasticsearch

When calculating relative recall using TREC and 'K' pooling, does the total relevant documents reflect relevant documents from all participating systems per query or is it all the queries?
And does this approach not invalidate recall calculations, say I have the 50 top documents between two systems but collectively there are 75 relevant documents, then irrespective of how good either system is they will never be able to reach 100% recall?

When calculating relative recall using TREC and 'K' pooling, does the total relevant documents reflect relevant documents from all participating systems per query or is it all the queries?
The set of relevant documents comprises documents that are judged relevant by human accessors, who are asked to look at the union of top-100 documents retrieved by each participating system. Note the stress on the word union, which indicates that the accessors are not shown this set in any particular order. So, this pool is indeed a set (and not an ordered set).
The set of relevant documents is different for each query. So you might imagine if R represents relevant set of documents, it has an argument q (the query). So, in effect you have R(q) and not just R.
And does this approach not invalidate recall calculations, say I have the 50 top documents between two systems but collectively there are 75 relevant documents, then irrespective of how good either system is they will never be able to reach 100% recall?
They can, in principle, achieve 100% recall if they retrieve at least 75 documents each. Obviously, if you're allowed to retrieve 10 documents and there are a total of 20 relevant documents, tha max. recall you can achieve is only 50%.

Related

How does Elasticsearch 7 track_total_hits improve query speed?

I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.
Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.
But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?
Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."
Hopefully someone can explain how things work under the hood and which important point I am missing.
There are at least two different contexts in which not all documents need to be sorted:
A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.
B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.
If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.
If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.
Relevant links:
https://github.com/elastic/elasticsearch/pull/24864
https://github.com/elastic/elasticsearch/issues/33028
https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand

Optimizing Redis-Graph query performance (match)

I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks
Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.
In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.

Kibana, filter on count greater than or equal to X

I'm using Kibana to visualize some (Elasticsearch) data but I'd like to filter out all the results with "Count" less than 1000 (X).
I am using an Y-axis with a "count Aggregation", this is the count I'd like to filter on. I tried adding in a min_document_count as suggested by several online resources but this didn't change anything. Any help would be greatly appreciated.
My entire Kibana "data" tab:
Using min_doc_count with order: ascending does not work as you would except.
TL;DR: Increasing shard_size and/or shard_min_doc_count should do the trick.
Why the aggregation is empty
As stated by the documentation:
The min_doc_count criterion is only applied after merging local terms
statistics of all shards.
This mean that when you use a terms aggregations with the parameters size and min_doc_count and descending order, Elasticsearch retrieve the size less frequent terms in your data set and filter this list to keep only the terms with doc_count>min_doc_count.
If you want an example, given this dataset:
terms | doc_count
----------------
lorem | 3315
ipsum | 2487
olor | 1484
sit | 1057
amet | 875
conse | 684
adip | 124
elit | 86
If you perform the aggregation with size=3 and min_doc_count=100 Elasticsearch will first compute the 3 less frequents terms:
conse: 684
adip : 124
elit : 86
and then filter for doc_count>100, so the final result would be:
conse: 684
adip : 124
Even though you would expect "amet" (doc_count=875) to appear in the list. Elasticsearch loose this field while computing the result and cannot retrieve it at the end.
If your case, you have so many terms with doc_count<1000 that they fill your list and then, after the filtering phase, you have no results.
Why is Elasticsearch behaving like this?
Everybody would like to apply a filter and then sort the results. We are able to do that with older datastore and it was nice. But Elasticsearch is designed to scale, so it turn off by default some of the magic that was used before.
Why? Because with large datasets it would break.
For instance, imagine that you have 800,000 different terms in your index, data is distributed over different shards (by default 4), that can be distributed other machine (at most 1 machine per shard).
When requesting terms with doc_count>1000, each machine has to compute several hundreds of thousands of counters (more than 200,000 since some occurrence of a term can be in one shard, others in another, etc). And since even if a shard saw a result only once, it may have been seen 999 times by the other shards, it cannot drop the information before merging the result. So we need to send more than 1 million counters over the network. So it is quite heavy, especially if it is done often.
So, by default, Elasticsearch will:
Compute doc_count for each term in each shard.
Not apply a filter on doc_count on a shard (loss in terms of speed and resource usage but better for accuracy): No shard_min_doc_count.
Send the size * 1.5 + 10 (shard_size) terms to a node. It will be the less frequent terms if order is ascending, most frequent terms otherwise.
Merge the counters in this node.
Apply the min_doc_count filter.
Return the size most/less frequent results.
Could it be simple for once?
Yes, sure, I said that this behavior was by default. If you do not have a huge dataset you can tune those parameters :)
Solution
If you are not OK with some loss of accuracy:
Increase the shard_size parameter to be greater than [your number of terms with a doc_count below your threshold] + [the number of values you want if you want exact results].
If you want all the results with doc_count>=1000, set it to the cardinality of the field (number of different terms), but then I do not see the point of order: ascending.
It has a massive memory impact if you have many terms, and a network if you have multiple ES nodes.
If you are OK with some loss of accuracy (often minor)
Set shard_size between this sum and [the number of values you want if you want exact results]. It is useful if you want more speed or if you do not have enough RAM to perform the exact computation. The good value for this one depends of your dataset.
Use the shard_min_doc_count parameter of the term aggregation to partially pre-filter the less frequent values. It is an efficient way to filter your data, especially if they are randomly distributed between your shards (default) and/or you do not have a lot of shards.
You can also put your data in one shard. There is no loss in term of accuracy but it is bad for performance and scaling. Yet you may not need the full power of ES if you have a small dataset.
NB: Descending order for terms aggregations is deprecated (because it cost a lot in terms of time and hardware to be accurate), it will most likely be removed in the future.
PS: You should add the Elasticsearch request generated by Kibana, it is often useful when Kibana is returning data but not the ones you want? You can find it in the "Request" tab when you click on the arrow that should be below your graph in your screenshot (ex: http://imgur.com/a/dMCWE).

Paging elasticsearch aggregation results

Imagine i have two kind of records: a bucket and an item, where item is contained in a bucket, and bucket may have relatively small amount of items (normally not more than 4, never more than 10). Those records are squashed into one (an item with extra bucket information) and placed inside Elasticsearch.
The task i am trying to solve is to find 500 buckets (at max) with all related items at once by filtered query that relies on item's attributes, and i'm stuck on limiting / offsetting aggregations. How do i perform such kind of task? I see top_hits aggregation which allows me to control size of related items amount, but i can't find a clue how can i control size of returned buckets.
update: okay, i'm terribly stupid. The size parameter of terms aggregation provides me with limiting. Is there any way to perform offset task? I don't need 100% precision and probably won't ever page those results, but anyway i'd like to see this functionality.
I don't think we'll be seeing this feature any time soon, see relevant discussion at GitHub.
Paging is tricky to implement because document counts for terms
aggregations are not exact when shard_size is less than the field
cardinality and sorting on count desc. So weird things may happen like
the first term of the 2nd page having a higher count than the last
element of the first page, etc.
There an interesting approach is mentioned, you could request like top 20 results on 1st page, then on 2nd page you run the same aggregation but exclude those 20 terms you already saw on the previous page and so forth. But this doesn't allow you "random" access to arbitrary page, you must go through pages in-order.
...if you only have a limited number of unique values compared to the
number of matched documents, doing the paging on client-side would be
more efficient. On the other hand, on high-cardinality-fields, your
first approach based on an exclude would probably be better.

Mongodb Performance issue

I am using mongodb and I need to update my documents say total 1000 are there. My document has a basic structure like:
{
People:[1,2,3,4],
Place:"Auckland"
Event:"Music Show"
}
I have 10,000 threads running concurrently in another VM. Each thread looks for these documents(1000), see if these 1000 documents matches the query and push a number in People array . Suppose if thread 100 found say 500 out of these 1000 documents relevant, then it pushes the number 100 in People array of all the 500 documents.
For this,
I am using for each thread(10000) the command
update.append("$push",new BasicDBObject("People",serial_number));
updateMulti(query,update);
I observe poor performance for these in-place updates (multi-query).
Is this a problem due to a write lock?
Every thread(10000) updates the document that is relevant to the query ? - so there seems to be a lot of "waiting"
Is there a more efficient way to do these "push" operations?
Is "UpdateMulti" the right way to approach this?
Th‎ank you for a great response - Editing and Adding some more information
Some design background :
Yes your reading of our problem is correct. We have 10000 threads each representing one "actor" updating upto 1000 entities ( based on the appropriate query ) at a time with a $push .
Inverting the model leads us to a few broken usecases ( from our domain perspective ) leading us to joins across "states" of the primary entity ( which will now be spread across many collections ) - ex: each of these actions is a state change for that entity - E has states ( e1,e2,e3,e4,e5 ) - So e1 to e5 is represented as an aggregate array which gets updated by the 10,000 threads/processes which represent actions of external apps.
We need close to real-time aggregation as another set of "actors" look at these "states" of e1 to e5 and then respond appropriately via another channel to the "elements in the array".
What should be the "ideal" design strategy in such a case - to speed up the writes.
Will sharding help - is there a "magnitude" heuristic for this - at what lock% should we shard etc..
This is a problem because of your schema design.
It is extremely inefficient to $push multiple values to multiple documents, especially from multiple threads. It's not so much that the write lock is the problem, it's that your design made it the problem. In addition, you are continuously growing documents which means that the updates are not "in place" and your collection is quickly getting fragmented.
It seems like your schema is "upside down". You have 10,000 threads looking to add numbers representing people (I assume a very large number of people) to a small number of documents (1000) which will grow to be huge. It seems to me that if you want to embed something in something else, you might consider collections representing people and then embedding events that those people are found at - at least then you are limiting the size of the array for each person to 1,000 at most, and the updates will be spread across a much larger number of documents, reducing contention significantly.
Another option is simply to record the event/person in attendance and then do aggregation over the raw data later, but without knowing exactly what your requirements for this application are, it's hard to know which way will produce the best results - the way you have picked is definitely one that's unlikely to give you good performance.

Resources