I've been struggling to find a data structure for nested comments (only 1 level of nesting, e.g. facebook)
In order to achieve a non-nested "feed" of comments I've been using sorted sets to track comments, using the score as a timestamp and member as a json-encoded set of attributes that contains all information needed to render the comment.
So adding a comment might look like this:
zadd 'users:1:comments', 123456789, {body : 'hello'}
And retrieving it is as simple as this:
zrevrange 'users:1:comments', 0, 20
In order to support nested comments I've tried to expand on this in some way
I've brainstormed two different ways, but each has a problem:
1)
Add comment_id to the list of attributes where comment_id points to the parent comment
zadd 'users:1:comments', 123456789, {id : 1, body : 'hello'}
zadd 'comments:1:comments', 123456789, {id : 2, body : 'nested hello', comment_id : 123 }
Would look like this:
-hello
-nested hello
The problem with this approach is when it comes to pagination. If, say, a comment has 20 nested comments, and I'm only showing the first 10 comments, then the nested tree will be cut off (the parent comment + 9 nested comments will be retrieved)
2)
Put the nested comments into a feed of its own:
This is a parent comment
zadd 'users:1:comments', 123456789, {id: 1, body : 'hello'}
this is a nested comment
zadd 'comments:1:comments' 123456789, {id: 2, body : 'nested hello'}
However, this would result in N+1 redis queries when trying to show a user's feed:
zrevrange 'users:1:comments', 0, 20
zrevrange 'comments:1:comments', 0, 20
zrevrange 'comments:2:comments', 0, 20
etc...
...Not to mention the nested comments probably shouldn't be selected with a range.
Ideally, I would like this to work with a single redis query, but I'm not sure how to structure my data so that is possible.
Ideas?
The only way I can come up with that will result in a single redis query is to use Lists.
When adding a parent item you can simply LPUSH it to the top (left) of the list. When adding a child comment you would use something like LINSERT 'user:1:comments' AFTER parent-comment-data child-comment-data.
This causes redis to search for the parent comment data and place the child data immediately after it. This is an O(N) operation and is done top (left) to bottom (right), so the further down the list the parent is the longer this operation will take, so for extremely long lists this could prove problematic (but should be fine if you keep your list/thread sizes in the 4 or 5 digit range).
Then, a simple LRANGE can fetch you the latest comments, both parents and children, limited to any number.
You could do something similar using the score values in sorted sets, giving children a score just lower than the parent. This could significantly complicate inserts though, as you might run out of available scores between two parent comments, meaning you'll have to run an operation to reassign scores for many (or even most) of the comments. If this happens on every insert your inserts could be (needlessly) costly.
Related
I have ~10-15 categories cat1,cat2 etc that are fixed enums, which change once maybe couple of weeks, so we can say they are constant.
For example cat1 enum could have values like that:
cat1: [c1a,c1b,c1c,c1d,c1e]
I have objects (around 10 000 of them) like these:
id: 1, cat1: [c1a, c1b, c1c, c1d], cat2: [ c2a , c2d, c2z], cat3: [c3d] ...
id: 2, cat1: [c1b, c1d], cat2: [ c2a , c2b], cat3: [c3a, c3b, c3c] ...
id: 3, cat1: [c1b, c1d, c1e], cat2: [ c2a], cat3: [c3a, c3d] ...
...
id: n, cat1: [c1a, c1c, c1d], cat2: [ c2e], cat3: [c3a, c3b, c3c, c3d] ...
Now I have incoming request looking like these, with one value for every category:
cat1: c1b, cat2: c2a, cat3: c3d ...
I need to get all ids for objects that match that request, so all objects that include every cat value from that request. Request and objects always have the same number of categories.
To get better understanding of the problem, naive way of solving that in SQL would be something like
SELECT id FROM objects WHERE 'c1b' IN cat1 AND 'c2a' IN cat2 AND 'c3d' IN cat3 ...
Result for our example request and example objects would be: id: [1,3]
I've tried using sets for that, so I had set for every category-category_value for example cat1-c1a, cat1-c1b, cat2-c2a etc with ids of the objects as values in that set and then on request I would do intersection between sets matching values from the request but having 5 digits of requests/s this doesn't scale really well. Maybe I could trade more space for time or trade almost all the space for time and precompute a hashtable with all the possibilities to get O(1) but amount of space needed would be really high. I'm looking for any other viable solutions to this problem. Objects do not change often and new ones are not added very often too so we are only read heavy. Anyone have any idea or suggestions or solved similar problem? maybe some databases/key-value stores that would handle this use case well? Any white papers ?
I store your ids in a Python list ids. ids[id_num] is a list of categories. ids[id_num][cat_num] is a set of integers instead of your letters within your enums but all that matters is they are distinct.
From that list of ids you can generate a reverse-mapping so that given a (cat_num, enum_num) pair you map to the set of all id_nums of ids that contain that enum_num in their cat_num'th category!
#%% create reverse map from (cat, val) pairs to sets of possible id's
cat_entry_2_ids = dict()
for id_num, this_ids_cats in enumerate(ids):
for cat_num, cat_vals in enumerate(this_ids_cats):
for val in cat_vals:
cat_num_val = (cat_num, val)
cat_entry_2_ids.setdefault(cat_num_val, set()).add(id_num)
The above mapping could be saved+reloaded until enums/id's change.
Given a particular request, here shown as a list of enum contained in that numbered category; then the mapping is used to return all ids that have the requested enum in each category.
def get_id(request):
idset = cat_entry_2_ids[(0, request[0])].copy()
for cat_num_req in enumerate(request):
idset.intersection_update(cat_entry_2_ids.get(cat_num_req, set()))
if not idset:
break
return sorted(idset)
Timings depend on 10 to 15 dict lookups and set intersections. In Python I get a speed of around 2_500 per second. Maybe a change of language and/or parallel lookup in the mapping (one thread for each of your 10-15 categories), might get you over that 10_000 lookups/second barrier?
I've been doing research on this and I find a plethora of articles related to Text, but they don't seem to be working for me.
To be clear this formula works, I'm just looking to make it more efficient. My formula looks like:
if [organization_id] = 1 or [organization_id] = 2 or [organization_id] = 3 then "North" else if … where organization_id is of type "WholeNumber"
I'd like to simplify this by doing something like:
if [organization_id] in {1, 2, 3} then "North" else if …
I've tried wrapping in Parenthesis, Braces, & Brackets. Nothing seems to work. Most articles are using some form of text.replace function and mine is just a custom column.
Does MCode within Power Query have any efficiencies like this or do I have to write out each individual statement like the first line?
I've had success with the a List.Contains formulation:
List.Contains({1,2,3}, [organization_id])
The above checks if [organization_id] is in the list supplied in the first argument.
In some cases, you may not want to hardcode a list as shown above but reference a table column instead. For example,
List.Contains(TableWithDesiredIds[id_column], [organization_id])
I am in the process of implementing a proof-of-concept stream processing system using Apache Flink 1.6.0 and am storing a list of received events, partitioned by key, in a ListState. (Don't worry about why I am doing this, just work with me here.) I have a StateTtlConfig set on the corresponding ListStateDescriptor. Per the documentation:
"All state collection types support per-entry TTLs. This means that list elements and map entries expire independently."
"Currently, expired values are only removed when they are read out explicitly, e.g. by calling ValueState.value()."
Question 1
Which of the following constitutes a read of the ListState:
Requesting the iterator but not using it - myListState.get();.
Actually using the iterator - for (MyItem i : myListState.get()) { ... }
Question 2
What does "per-entry TTL" actually mean? Specifically, what I'm asking about is the following:
Assume I have a specific instance of ListState<Character>. The descriptor has a TTL of 10 seconds. I insert a 'a'. Two seconds later, I insert 'b'. Nine seconds later I insert 'c'. If I iterate over this ListState, which items will be returned?
In other words:
ListState<Character> ls = getRuntimeContext().getListState(myDescriptor);
ls.add('a');
// ...two seconds later...
ls.add('b');
// ...nine seconds later...
ls.add('c');
// Does this iterate over 'a', 'b', 'c'
// or just 'b' and 'c'?
for (Character myChar : ls.get()) { ... }
Answer 1
The answer is 1. For ListState the pruning is done for myListState.get();.
Answer 2
"per-entry TTL" means the timeout is applied to a single entry rather than whole collection. For your example assuming at the point of reading 10 seconds passed since inserting the a it will iterate over b and c. a is going to be pruned in this case.
Q: How do I prevent JSONata from "auto-flattening" arrays in an array constructor?
Given JSON data:
{
"w" : true,
"x":["a", "b"],
"y":[1, 2, 3],
"z": 9
}
the JSONata query seems to select 4 values:
[$.w, $.x, $.y, $.z]
The nested arrays at $.x and $.y are getting flattened/inlined into my outer wrapper, resulting in more than 4 values:
[ true, "a", "b", 1, 2, 3, 9 ]
The results I would like to achieve are
[ true, ["a", "b"], [1, 2, 3], 9 ]
I can achieve this by using
[$.w, [$.x], [$.y], $.z]
But this requires me to know a priori that $.x and $.y are arrays.
I would like to select 4 values and have the resulting array contain exactly 4 values, independent of the types of values that are selected.
There are clearly some things about the interactions between JSONata sequences and arrays that I can't get my head around.
In common with XPath/XQuery sequences, it will flatten the results of a path expression into the output array. It is possible to avoid this in your example by using the $each higher-order function to iterate over an object's key/value pairs. The following expression will get what you want without any flattening of results:
$each($, function($v) {
$v
})
This just returns the value for each property in the object.
UPDATE: Extending this answer for your updated question:
I think this is related to a previous github question on how to combine several independent queries into the same question. This uses an object to hold all the queries in a similar manner to the one you arrived at. Perhaps a slightly clearer expression would be this:
{
"1": t,
"2": u.i,
"3": u.j,
"4": u.k,
"5": u.l,
"6": v
} ~> $each(λ($v){$v})
The λ is just a shorthand for function, if you can find it on your keyboard (F12 in the JSONata Exerciser).
I am struggling to rephrase my question in such as way as to describe the difficulties I am having with JSONata's sequence-like treatment of arrays.
I need to run several queries to extract several values from the same JSON tree. I would like to construct one JSONata query expression which extracts n data items (or runs n subqueries) and returns exactly n values in an ordered array.
This example seems to query request 6 values, but because of array flattening the result array does not have 6 values.
This example explicitly wraps each query in an array constructor so that the result has 6 values. However, the values which are not arrays are wrapped in an extraneous & undesirable array. In addition one cannot determine what the original type was ...
This example shows the result that I am trying to accomplish ... I asked for 6 things and I got 6 values back. However, I must know the datatypes of the values I am fetching and explicitly wrap the arrays in an array constructor to work-around the sequence flattening.
This example shows what I want. I queried 6 things and got back 6 answers without knowing the datatypes. But I have to introduce an object as a temporary container in order to work around the array flattening behavior.
I have not found any predicates that allow me to test the type of a value in a query ... which might have let me use the ?: operator to dynamically decide whether or not to wrap arrays in an array constructor. e.g. $isArray($.foo) ? [$.foo] : $.foo
Q: Is there an easier way for me to (effectively) submit 6 "path" queries and get back 6 values in an ordered array without knowing the data types of the values I am querying?
Building on the example from Acoleman, here is a way to pass in n "query" strings (that represent paths):
(['t', 'u.i', 'u.j', 'u.k', 'u.l', 'v'] {
$: $eval('$$.' & $)
}).$each(function($o) {$o})
and get back an array ofn results with their original data format:
[
12345,
[
"i",
"ii",
"iii"
],
[],
"K",
{
"L": "LL"
},
null
]
It seems that using $each is the only way to avoid any flattening...
Granted, probably not the most efficient of expressions, since each has to be evaluated from a path string starting at the root of the data structure -- but there ya go.
I have a collection of documents. The relevant structure of the object for this question would be:
{
"_id":ObjectId("5099803df3f4948bd2f98391"),
...
"type": "a",
"components":{
"component2": 20,
"component3": 10,
},
"price": 123
...
}
I'm currently using an old set of code that I wrote a while ago to find the cheapest permutation of a combination needed. I'm not sure if this is possible to do with just a query, but thought I would ask before moving any further.
Specifics: There are 10 possible "type"'s ( a-j ). There are 4 possible "component"'s types. Items will have at least 1 "component", but can have up to 2. They will never have more than 2. While the types are limited to 2, the value ( "grade" ) of the components can range. So either exactly 1 or exactly 2 components, with any possible combination of component values/grades.
There are 10k records, and what I'm needing to do is find the the lowest possible price, having at least one of each type, that yields me at least my desired grade for either the one or two components I enter.
The expected result would always have one of each type ( 10 total ).
In layman's I'd be asking the data set for the cheapest combination of component2's that exceed 200. Or, the cheapest combination of component1/component3 that exceed a 150 component1 grade and exceed a 150 component3 grade.
Again though that combination is restricted because it must have exactly/only one of each type. So a better price could certainly be achieved if there was 10 type "a"'s, but it would need to be 1 "a", 1 "b", etc.
I don't think it is, but is it possible this could somehow be achieved with a query alone?