Get users rank from couchDB - view

I'm trying to get the users rank from a couchDB database. The issue I'm having is I have multiple users and multiple games. I want to be able to pass 2 keys
The app id
The users score
I would like to then see how many records have the same app id and a lower score then the one I passed. This would return the users current rank. This is how my document structure is
{
"_id": "c68d16e1d8ba65accf97230dfbf7c2cb",
"_rev": "114-2aea3eef75c73e1079ed9c8d945723e1",
"credits": 2125,
"appName": "someApp"
}
I've tried setting views up but the multiple keys are really confusing me. This is what I've tried but hasn't worked
VIEW
"getrank": {
"map": "function(doc) { emit([doc.appName, doc.credits],{credits:doc.credits}) }"
}
URL CALLS I'VE TRIED
/players/_design/views/_view/getrank?key=["someApp","2000"]&startkey=["credits",2000]
/players/_design/views/_view/getrank?key=someApp"&startkey=["credits",2000]

I would like to then see how many records have the same app id and a lower score then the one I passed.
If I understand your question correctly your view looks good. Maybe instead of emitting an object you can just do doc.credits or simply null and query it with &include_docs..
Any way what you need to do is to query over a range. startkey and endkey should work.
_view/getrank?startkey=["someApp",minima]&endkey=["someapp",maxima]
what this query does is give you records for someapp between minima and maxima. Now we need to build upon this.
lower score then the one I passed.
first we need to query it in a descending manner. The only interesting thing here is that the order of keys will reverse:-
_view/getrank?startkey=["someApp",maxima]&endkey=["someapp",minima]&descending=true
now suppose you want everything lower that 9000. Here is the final query that will do the trick
_view/getrank?startkey=["someApp",9000]&endkey=["someapp",{}]
This gives you all the scores for some app less than 9000.
I have not actually run these queries but this should give you something to work with.
If you need all the records over a range you need range queries.
Range queries are done with startkey and endkey.
They are reversed when descending=true.
Hope this helps.

Related

Solr boost query sort by whether result is boosted then by another field

I'm using Solr to run a query on one of our cores. Suppose my documents have two fields: ID, and Name. I also have a separate list of IDs I'm grabbing from a database and passing into the query to boost certain results.
If the document gets returned in the query and the ID is in the list it goes to the top of the results, and if it gets returned in the query and the ID is not in the list then it goes below those that are in the list. The former is from the "boost". My query is something like this -
http://mysolrserver:8983/solr/MyCore/MyQueryHandler?q=Smith&start=0&rows=25&bq=Id%3a(36+OR+76+OR+90+OR+224+OR+391)
I am able to get the boost query working but I need the boosted results to be in alphabetical order by name, then the non boosted results under that also in alphabetical order by name. I need to know what to user for the &sort= parameter.
&sort=score%20desc,Name+asc does not work.
I've looked over a lot of documentation, but I still don't know if this even possible. Any help is appreciated. Thanks!
Solr version is 6.0.1. I am actually using SolrNet to interface with Solr, but I think I can figure out the SolrNet part if I know what the url's &sort= parameter value needs to be.
I figured it out, by doing away with the boost query. I added a sort query using the "exists" function and passing it a sub-query for the ID. The exists returns a boolean value to sort on, then I added the name as a second sort. It works perfect!!
The URL looks like this:
http://mysolrserver:8983/solr/MyCore/MyQueryHandler?q=Smith&start=0&rows=25&sort=exists(query({!v=%27Id:(36+OR+76+OR+90+OR+224+OR+391)%27}))%20DESC,%20Name%20ASC
The closest match to your requirement is the query elevation component[1] .
In your particular case I would first sort my Ids according to my requirements ( sorting them by name for example), then maintain them in the elevate.xml.
At query time you can use the "forceElevation" parameter to force the elevation and then sort the remaining results by name.
[1] https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.
We collect data in the following simplified structure
{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}
{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}
{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}
For this sample the result of the analysis should be:
2/3 of the users get lost at the checkout page.
1/3 of the users get lost at the confirm page
More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:
group documents by a field
sort each group (bucket) by a second field and reduce to the topmost document
group all these remaining documents by a third field
sort groups by number of documents
My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a
terms_pipeline aggregation to group the pages.
(simplified aggregation structure)
aggs
terms
field: sessionid
aggs
top_hits
sort:timestamp desc
size: 1
terms_pipeline
bucket_path: terms>top_hits
field: page
... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.
Any ideas for an alternative approach?
Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.
Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.
In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

Sorting by a non-key (arbitrary) field in CouchDB

I have a fairly large CouchDB database (approximately 3 million documents). I have various view functions returning slices of the data that can't be modified (or at least, should only be modified as a last resort).
I need the ability to sort on an arbitrary field for reporting purposes. For smaller DBs, I return the entire object, json_parse it in our PHP backend, then sort there. However, we're often getting Out Of Memory errors when doing this on our largest DBs.
After some research, I'm leaning towards accessing a sort key (via URL parameter) in a list function and doing the sort there. This is an idea I've stolen from here. Excerpt:
function(head, req) {
var row
var rows=[]
while(row = getRow()) {
rows.push(row)
}
rows.sort(function(a,b) {
return b.value-a.value
})
send(JSON.stringify({"rows" : rows}))
}
It seems to be working for smaller DBs, but it still needs a lot of work to be production ready.
Is this:
a) a good solution?
b) going to work with 3, 5, or 10 million rows?
You can't avoid loading everything into memory by using a list function. So with enough data, eventually, you'll get an out of memory error, just as you're getting with PHP.
If you can live within the memory constrains, it's a reasonable solution, with some advantages.
Otherwise, investigate using something like lucene, elasticsearch, or Cloudant Search (clouseau & dreyfus).
In our environment, we have more than 5 million records. The couch is design such that each and every Document has some specific fields which distinguish it from the other category of documents.
For example, there are number documents with field DocumentType "USer" or DocumentType "XXX"
These DocumentType field allow us to sort various document based on different categories.
So if you have 3 Million doc, and you have around 10 categories so each category will have about 300k Docs.
Now you can design system such that you always pass the DocId you need to be passed to Couch. In that way it will be faster.
so query can be like
function(doc)
{
if(doc.DocumentType=== 'XXX' && doc._id) {emit(doc.FieldYouWant, doc._id)}
}
This is how our backhand is designed in production.

Elastic Search limit results

In MySQL I can do something like:
SELECT id FROM table WHERE field = 'foo' LIMIT 5
If the table has 10,000 rows, then this query is way way faster than if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only returns 1 result, but the query itself isn't any faster than if I had set the size to unlimited and returned all the results. I suspect the query is being run in its entirety and only 1 result is being returned after the query is done processing. This means the "size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds a single record that matches the fuzzy search, rather than processing every record in the index before returning a response? Am I misunderstanding something more fundamental about this?
Thanks in advance.
You are correct the query is being ran entirely. Queries by default return data sorted by score, so your query is going to score each document. The docs state that the fuzzy query isn't going to scale well, so might want to consider other queries.
A limit filter might give you similar behavior to what your looking for.
A limit filter limits the number of documents (per shard) to execute
on
To replicate mysql field='foo' try using a term filter. You should use filters when you don't care about scoring, they are faster and cache-able.

Resources