CouchDB - hierarchical comments with ranking. Hacker News style - algorithm

I'm trying to implement a basic way of displaying comments in the way that Hacker News provides, using CouchDB. Not only ordered hierarchically, but also, each level of the tree should be ordered by a "points" variable.
The idea is that I want a view to return it in the order I except, and not make many Ajax calls for example, to retrieve them and make them look like they're ordered correctly.
This is what I got so far:
Each document is a "comment".
Each comment has a property path which is an ordered list containing all its parents.
So for example, imagine I have 4 comments (with _id 1, 2, 3 and 4). Comment 2 is children of 1, comment 3 is children of 2, and comment 4 is also children of 1. This is what the data would look like:
{ _id: 1, path: ["1"] },
{ _id: 2, path: ["1", "2"] },
{ _id: 3, path: ["1", "2", "3"] }
{ _id: 4, path: ["1", "4"] }
This works quite well for the hierarchy. A simple view will already return things ordered the way I want it.
The issue comes when I want to order each "level" of the tree independently. So for example documents 2 and 4 belong to the same branch, but are ordered, on that level, by their ID. Instead I want them ordered based on a "points" variable that I want to add to the path - but can't seem to understand where I could be adding this variable for it to work the way I want it.
Is there a way to do this? Consider that the "points" variable will change in time.

Because each level needs to be sorted recursively by score, Couch needs to know the score of each parent to make this work the way you want it to.
Taking your example with the following scores (1: 10, 2: 10, 3: 10, 4: 20)
In this case you'd want the ordering to come out like the following:
.1
.1.4
.1.2
.1.2.3
Your document needs a scores array like this:
{ _id: 1, path: [1], scores: [10] },
{ _id: 2, path: [1, 2], scores: [10,10] },
{ _id: 3, path: [1, 2, 3], scores: [10,10,10] },
{ _id: 4, path: [1, 4], scores: [10,20] }
Then you'll use the following sort key in your view.
emit([doc.scores, doc.path], doc)
The path gets used as a tiebreaker because there will be cases where sibling comments have the exact same score. Without the tiebreaker, their descendants could lose their grouping (by chain of ancestry).
Note: This approach will return scores from low-to-high, whereas you probably want scores (high to low) and path/tiebreaker(low to high). So a workaround for this would be to populate the scores array with the inverse of each score like this:
{ _id: 1, path: [1], scores: [0.1] },
{ _id: 2, path: [1, 2], scores: [0.1,0.1] },
{ _id: 3, path: [1, 2, 3], scores: [0.1,0.1,0.1] },
{ _id: 4, path: [1, 4], scores: [0.1,0.2] }
and then use descending=true when you request the view.

Maybe anybody interestingly the thread on this question with variants of solutions:
http://mail-archives.apache.org/mod_mbox/couchdb-dev/201205.mbox/thread -> theme "Hierarchical comments Hacker News style" 16/05/2012

Related

fetch perticular number of documents satisfying multiple conditions - Elasticsearch

I have a Elasticsearch index for an information of fruits as below
GET fruits/fruits_data/_search
[{ id: 1,
name: apple},
{ id: 2,
name: mango},
{ id: 3,
name: apple},
{ id: 4,
name: banana},
{ id: 5,
name: apple},
{ id: 6,
name: mango},
{ id: 7,
name: pineapple},
{ id: 8,
name: jackfruit}]
Now I need to fetch 7 fruits as per the priority (below):
{"apple": 3, "banana": 3, "mango": 2, "guava": 2, "pineapple": 1, "jackfruit": 1}
Here the key indicates the fruit to be fetched and valueindicates the maximum number of the document to be fetched.
This means I need to fetch maximum 3 apple, 3 banana and 1 mango and I can ignore the others in priority hash when I have required number of fruits. But here I have only 1 banana in my ES index so I need to fetch maximum 3 apple, 1 banana, 2 mango and 1 pineapple (Since guava is not present in index we need to ignore it.
Is there a way to fetch fruits like this in ES in a single query. I don't want to use multiple queries.
Thanks
It is not possible to fetch results directly,Try using Aggregation in elasticsearch. You can refer to link below,
[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html]

Searching Elastic Search for a specific index position of an array field

The records in my ES index are of the form:
person: {
firstName: "ABC",
lastName: "Def",
specialValues: [3, 6, null, 9]
}
I want to retrieve all person.speciaValues[1] that have a value 6.
Is is possible to do so, using Elastic Search?

Elasticsearch array scoring

I'm using elasticsearch to search multiple array fields in my type, which looks something like
t1 = { field1: ["foo", "bar"],
field2: ["foo", "foo", "foo", "foo"]
field3: ["foo", "foo", "foo", "foo", "foo", "foo"]
}
And then I'm using a multi_match query to get matches, something along
multi_match: { query: "foo",
fields: "fields*"
}
When computing the score of t1, elasticsearch adds the score of queries in field1, field2 and field3 which is what I want. However, they are not contributing equally, field3 contributes to the score the most since "foo" occurs multiple times there.
I want now to compute the score within each array field by not adding up the score of all array entries, but by just taking the maximum of them. In my example, all fields contained would have the same score then since they all have one exact match.
This question was already asked on the elasticsearch forum, but has not been answered so far.
I've been stumped on this myself, it really seems like there should be a simple, builtin way to just specify max instead of sum.
Not sure if this is exactly what you're going for, because you lose the match score on any particular item in the array. So you're not getting max of the match score of the best particular item, just a boolean value if anything matches. If it's something more nuanced (say a person's full name, where you want a better match for first and last vs just one or the other) this may not be acceptable because you're throwing out your scores.
If it is acceptable, this workaround seems to work:
{function_score: {
query: {bool: {should: [
{term: {field1: 'foo'}},
{term: {field2: 'foo'}},
{term: {field3: 'foo'}},
]}},
functions: [
{filter: {term: {field1: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
],
score_mode: 'sum',
boost_mode: 'replace',
}}
We need the "query" part to give us the results to further filter, even though we discard the score. This seems like it should really be a filter, but just wrapping this same thing in the filtered query doesn't work. There may be a better option here.
Then, the weight functions just basically give a 1 if there's a match on that field and 0 otherwise. The score_mode tells it to sum those weights, so in your case they all match so we get 3. The boost_mode tells how to combine with the original query, "replace" tells it to ignore the original query score (which has the problem you mentioned that multiple matches in an array are being summed). So, the total score of this query is 3 because there are 3 matches.
It seems more complicated to me, but in my relatively limited testing I haven't noticed performance issues or anything. I'd love to see a better answer if someone more familiar with elasticsearch has one.

Paginating Distinct Values in Mongo

I would like to retrieve district values for a field from a collection in my database. The distinct command is the obvious solution. The problem is that some fields have a large number of possible values and are not simple primitive values (i.e. are complex sub-documents instead of just a string). This means the results are large causing it to overload the client I am delivering the results to.
The obvious solution is to paginate the resulting distinct values. But I cannot find a good way to optimally do this. Since distinct doesn't have pagination options (limit, skip, etc) I am turning to the aggregation framework. My basic pipeline is:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$limit: 10},
]
This gives me the first 10 unique values for myfield. To get the next page a $skip operator is added to the pipeline. So:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
But sometimes the field I am collecting unique values from is an array. This means I have to unwind it before grouping. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$myfield'}
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Other times the field I am getting unique values for may not be an array but it's parent node might be an array. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Finally sometimes I also need to filter on data inside the field I am getting distinct values. This means I sometimes need another match operator after the unwind:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$match: {... filter within list.myfield ...}},
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
The above are all simplifications of my actual data. Here is a real pipeline from the application:
[
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id": 1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$unwind":"$data.surroundings.school.nearby"},
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id":1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$group":{"_id":"$data.surroundings.school.nearby"}},
{"$sort":{"_id":1}},
{"$skip":10},
{"$limit":10}
]
I hand the same $match document to both the initial filter and the filter after the unwind because the $match document is somewhat opaque from a 3rd party so I don't really know which parts of the query are filtering outside my unwind vs inside the data I am getting distinct values for.
Is there any obvious different way to go about this. In general, my strategy is working but there are some queries where it is taking 10-15 seconds to return the results. There are about 200,000 documents in the collection although only about 60,000 after the first $match in the pipeline is applied (which can use indexes).

Uniformly distributing results in elastic search based on an attribute

I am using tire to perform searches on sets of objects that have a category attribute (there are 6 different categories).
I want the results to come in pages of 6 with one of each category on a page (while it is possible).
Eg1. So if the first,second and third category had 2 objects each and the fourth, fifth and sixth categories had 1 object each the pages would look like:
Data: [1,1,2,2,3,3,4,5,6]
1: 1,2,3,4,5,6
2: 1,2,3
Eg2. [1,1,1,1,1,2,2,3,4,5]
1: 1,2,3,4,5,1
2: 2,1,1,1
In something like ruby it wouldn't be too difficult to sort based on the number of times a number has appeared.
Something like
times_seen = {}
results.sort_by do |r|
times_seen[r.category] ||= 0
[times_seen[r.category] += 1, r.category]
end
E.g.
irb(main):032:0> times_seen = {};[1,1,1,1,1,2,2,3,4,5].sort_by{|i| times_seen[i] ||= 1; [times_seen[i] += 1, i];}
=> [1, 2, 3, 4, 5, 1, 2, 1, 1, 1]
To do this with a large number of results would be really slow because we would need to pull them all into ruby first and then sort.
Ideally we want to do this in elastic search and let it handle the pagination for us.
There is Script based sorting in elastic search:
http://www.elasticsearch.org/guide/reference/api/search/sort/
{
"query" : {
....
},
"sort" : {
"_script" : {
"script" : "doc['field_name'].value * factor",
"type" : "number",
"params" : {
"factor" : 1.1
},
"order" : "asc"
}
}
}
So if we could do something like this but have the times_seen logic from above in it, it would make life really easy, but it would require having a times_seen variable that persists between scripts.
Any ideas on how to achieve a uniform distribution based on an attribute or if it is possible to somehow use a variable in the script sort?

Resources