Duplicates when linkswalking riak using ripple - ruby

I'm working on a project where I use Riak with Ripple, and I've stumbled on a problem.
For some reason I get duplicates when link-walking a structure of links. When I link walk using curl I don't get the duplicates as far as I can see.
The difference between my curl based link-walk
curl -v http://127.0.0.1:8098/riak/users/2306403e5177b4716da9df93b67300824aa2fd0e/_,projects,0/_,tasks,1
and my ruby ripple/riak-client based link walk
result = Riak::MapReduce.new(self.robject.bucket.client).
add(self.robject.bucket,self.key).
link(Riak::WalkSpec.new({:key => 'projects'})).
link(Riak::WalkSpec.new({:key => 'tasks', :bucket=>'tasks'})).
map("function(v){ if(!JSON.parse(v.values[0].data).completed) {return [v];} else { return [];} }", {:keep => true}).run
is as far as I can tell the map at the end.
However the result of the map/reduce contains several duplicates. I can't wrap my head around why. Now I've settled for removing the duplicates based on the key, but I wish that the riak result wouldn't contain duplicates, since it seems like waste to remove duplicates at the end.
I've tried the following:
Making sure there are no duplicates in the links sets of my ripple objects
Loading the data without the map reduce, but the link walk contains duplicate keys.
Any help is appreciated.

What you're running into here is an interesting side-effect/challenge of Map/Reduce queries.
M/R queries don't have any notion of read quorum values, and they necessarily have to hit every object (within the limitations of input filtering, of course) on every node.
Which means, when N > 1, the queries have to hit every copy of every object.
For example, let's say N=3, as per default. That means, for each written object, there are 3 copies, one each on 3 different nodes.
When you issue a read for an object (let's say with the default quorum value of R=2), the coordinating node (which received the read request from your client) contacts all 3 nodes (and potentially receives 3 different values, 3 different copies of the object).
It then checks to make sure that at least 2 of those copies have the same values (to satisfy the R=2 requirement), returns that agreed-upon value to the requesting client, and discards the other copies.
So, in regular operations (reads/writes, but also link walking), the coordinating node filters out the duplicates for you.
Map/Reduce queries don't have that luxury. They don't really have quorum values associated with them -- they are made to iterate over every (relevant) key and object on all the nodes. And because the M/R code runs on each individual node (close to the data) instead of just on the coordinating node, they can't really filter out any duplicates intrinsically. One of the things they're designed for, for example, is to update (or delete) all of the copies of the objects on all the nodes. So, each Map phase (in your case above) runs on every node, returns the matched 'completed' values for each copy, and ships the results back to the coordinating node to return to the client. And since it's very likely that your N>1, there's going to be duplicates in the result set.
Now, you can probably filter out duplicates explicitly, by writing code in the Reduce phase, to check if there's already a key present and reject duplicates if it is, etc.
But honestly, if I was in your situation, I would just filter out the duplicates in ruby on the client side, rather than mess with the reduce code.
Anyways, I hope that sheds some light on this mystery.

Related

How to use Gremlin to get both node properties and edge names in one query?

I have been thrown into a pool of golang / gremlin / Neptune, and am able to get some things to work. Life is good - enough, but I am hoping there is a simple answer (which I have not been able to find) to what seems like a simple question.
I have 'obs' nodes with some properties, two of which are ('type','domain') and ('value','whitehouse.com).
Another set of nodes is 'attack' ('type','group') and ('value','Emotet'), along with other properties.
An observation node can have an edge pointing to one or more attack nodes. (and actually, other types of nodes as well.) These edges have a time-based property - when the observation was seen manifesting a certain type of attack.
I'm working in Go, using gremson to communicate with a Neptune db. In this environment you construct your query as a string and send it down the wire to Neptune, and get something called graphson back.
Thus, I construct this, and send it...
fmt.Sprintf("g.V().hasLabel('obs').has('value','%s').limit(1)", domain)
And I get back properties for a vector, in gremson. Were I using the console, all I would get back would be the id. Go figure.
Then I construct this, and send it...
fmt.Sprintf("g.V().hasLabel('obs').has('value','%s').limit(1).out()", domain)
and I get back the properties of the connected nodes, in graphson. Again, using the console I would only get back ids. No sweat.
What I would LIKE to do is to combine these two queries somehow so that I am not doing what seems to be like two almost identical lookups.
console-wise, assume both queries also have valueMap() or entityMap() tacked on the end. Is there any way to do them as one query?
There are many ways you could write this query. Here are a couple of options
g.V().hasLabel('obs').
has('value','%s').
limit(1).as('a').
out().as('b').
select('a','b')
or using project
g.V().hasLabel('obs').
has('value','%s').
limit(1).
project('a','b').
by().
by(out().fold())
My preference is for the project example as you will get the connected vertices back in a list.

Redis: Get all keys from queue which have a value equal to some criteria

I have a bunch of jobs I store on redis, each identified by a special key, with a hash as their value that contains some information. These jobs get picked up, computed, and once they are done, they have a success field that is set to true if the computation was finished successfully. I want to populate a list of all the keys that have the success key in their value hash set to true.
e.g. redis storage:
foo_key_1 => {bar_hash_key: bar_value, baz_hash_key: baz_value, success: true}
foo_key_2 => {bar_hash_key: bar_value, baz_hash_key: baz_value, success: false}
With the example above, I'd like a simple, efficient way of scanning through all keys in redis and their success field, and end up with [foo_key_1] as a result.
Currently my approach is as follows (in ruby pseudocode):
# redis is a connection to my redis server
all_keys = redis.keys # list of all keys in redis
for i in all_keys do # iterate through all keys in redis
if (redis.hgetall i)["success"] == true # if that key has success = true attrib
completed_keys.append(i) # append that key to a new list
end
end
The problem, as you might guess, is I have a lot of keys on redis, and this, albeit O(N) in performance gets fairly sluggish in terms of performance, due to its iterative nature. I have pored through the redis docs/commands but nothing leapt out at me as a potential candidate to solve this problem more efficiently, I am no redis guru, but it seems fairly straightforward, you put stuff in and take stuff out. Are there any possible vectorized operations I might have overlooked?
Many thanks.
You cannot do it without iterating through all keys, because redis supports only key fetching. You can reduce that complexity if you happen to know a pattern for the keys where you have to look, even in this case redis will fetch all the keys in background(with some optimizations, because it will fetch them in batches http://www.rubydoc.info/github/redis/redis-rb/Redis:scan). Something like:
$redis.scan_each(match: "foo*").to_a
You can find how to write your own pattern here https://redis.io/commands/KEYS
However, if you can control what values the keys take you can make them take a value with a little hint to help you in this search, like:
let's make each key with this form "foo#{your_number}:#{success_bool}"
when search use this pattern "foo*:1"
To do this efficiently only using Redis, you would have to keep an updated, "Redis-queryable", SET or LIST of success/completed keys.
At a fundamental level, it sounds like what you really have here is a job queue (and a second job queue for the successful first jobs). Unless you specifically want to implement this on your own, have you looked into job queue implementations?

Storm fields grouping

I'm having the following situation:
There is a number of bolts that calculate different values
This values are sent to visualization bolt
Visualization bolt opens a web socket and sends values to be visualized somehow
The thing is, visualization bolt is always the same, but it sends a message with a different header for each type of bolt that can be its input. For example:
BoltSum calculates sum
BoltDif calculates difference
BoltMul calculates multiple
All this bolts use VisualizationBolt for visualization
There are 3 instances of VisualizationBolt in this case
My question is, should I create 3 independent instances, where each instance will have one thread, e.g.
builder.setBolt("forSum", new VisualizationBolt(),1).globalGrouping("bolt-sum");
builder.setBolt("forDif", new VisualizationBolt(),1).globalGrouping("bolt-dif");
builder.setBolt("forMul", new VisualizationBolt(),1).globalGrouping("bolt-mul");
Or should I do the following
builder.setBolt("forAll", new VisualizationBolt(),3)
.fieldsGrouping("forSum", new Fields("type"))
.fieldsGrouping("forDif", new Fields("type"))
.fieldsGrouping("forMul", new Fields("type"));
And emit type from each of the previous bolts, so they can be grouped on based on it?
What are the advantages?
Also, should I expect that each and every time bolt-sum will go to first visualization bolt, bolt-dif will go to second visualization bolt and bolt-mul will go to third visualization bolt? They won't be mixed?
I think that that should be the case, but it currently isn't in my implementation, so I'm not sure if it's a bug or I'm missing something?
The first approach using three instances is the correct approach. Using fieldsGrouping does not ensure, that "sum" values go to "Sum-Visualization-Bolt" and neither that sum/diff/mul values are distinct (ie, in different bolt instances).
The semantic of fieldGrouping is more relaxed: it only guarantees, that all tuples of the same type will be processed by a single bolt instance, ie, that it will never be the case, that two different bolt instances get the same type.
I guess you can use Partial Key grouping (partialKeyGrouping). On the Storm documentation about stream groups says:
Partial Key grouping: The stream is partitioned by the fields
specified in the grouping, like the Fields grouping, but are load
balanced between two downstream bolts, which provides better
utilization of resources when the incoming data is skewed. This paper
provides a good explanation of how it works and the advantages it
provides.
I implemented a simple topology using this grouping and the chart on Graphite server show a better load balance compared to fieldsGrouping. The full source code is here.
topologyBuilder.setBolt(MqttSensors.BOLT_SENSOR_TYPE.getValue(), new SensorAggregateValuesWindowBolt().withTumblingWindow(Duration.seconds(5)), 2)
// .fieldsGrouping(MqttSensors.SPOUT_STATION_01.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
// .fieldsGrouping(MqttSensors.SPOUT_STATION_02.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.partialKeyGrouping(MqttSensors.SPOUT_STATION_01.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.partialKeyGrouping(MqttSensors.SPOUT_STATION_02.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.setNumTasks(4) // This will create 4 Bolt instances
.addConfiguration(TagSite.SITE.getValue(), TagSite.EDGE.getValue())
;

Handling large transactions: any time/memory tradeoffs?

In our system there is a (quite common) case where user's action can trigger operation that involves setting/removing labels onto/from nodes and relationships amounting to a total order of hundreds of thousands entities. (Remove label A from 100K nodes, set label B to 80K labels, set property [x,y,z] to 20K nodes and so on). Of course, I can't squeeze them all in one transaction and, thanks to the fact that these nodes can easily be separated into a large number of subsets, I perform the actions inside some number of separate transactions, which, of course, breaks all the ACIDity, but satisfies us in terms of performance. If I, however, try to nest those transaction into a single large one to rule them all, that top-level transaction tries to track all internal transactions' updates to DB, which, of course, results in an extremely poor performance.
What can you guys recommend me to solve the problem?
My config (well, its relevant parts):
"org.neo4j.server.database.mode" : "HA",
"use_memory_mapped_buffers" : "true",
"neostore.nodestore.db.mapped_memory" : "450M",
"neostore.relationshipstore.db.mapped_memory" : "450M",
"neostore.propertystore.db.mapped_memory" : "450M",
"neostore.propertystore.db.strings.mapped_memory" : "300M",
"neostore.propertystore.db.arrays.mapped_memory" : "50M",
"cache_type" : "hpc",
"dense_node_threshold" : "15",
"query_cache_size" : "150"
Any hints and clues are much appreciated :)
You are right that modifying hundreds of thousands of entities as a result of a user action in the same transaction isn't going to be performant. Nested transactions in Neo4j are just "placebo" transactions, as you correctly point out.
I would start by thinking about alternative strategies to achieve your goal (which I know nothing about) without needing to update so many entities.
If an alternative isn't possible, I would ask whether it is ok for the updates to happen a short time after the user action. If the answer is yes, then I would store a message about the user action in a persistent queue, which I would process asynchronously. That way, the user call returns quickly and the update happens eventually.
Finally, if it is acceptable for the time between the user action and the large update to take even longer, I would consider and "agent" that continuously crawls the graph and updates the labels of the entities that it encounters, as opposed to transaction-driven updates. Have a look at GraphAware NodeRank for inspiration.

how are neo4j caches speeding up queries?

I am currently working on a project using neo4j as database and queries that involve some hard relationship discover, and after running performance testing we are having some issues.
We have found out that cache is influencing the time of the requests insanely (from 3000ms to 100ms or so). Doing the same request twice would result in one really slow, and the second one much faster. After some searches we saw the warm-up method, that is going to preload all the nodes and relationships in the database querying something like this:
match (n)-[r]->() return count(1);
Having cache activated plus this warm-up query we had a big decrease of the time of our queries, but still not as fast as if you queried two, three or four times the same query.
So we went on testing and searching info until that we saw that Neo4j is also somehow buffering the queries in order to not be compiled every time (using Scala compiler, if I am right). I say somehow, because after intense testing I could conclude that Neo4j is compiling the query "on the fly".
Let me show a simplified example of what I mean:
(numbers are id attributes)
If I make a request like the following:
match (n:green {id: 1})-[r]->(:red)-[s]->(:green)<-[t]-(m:yellow {id: 7})
return count(m);
What I want to do is to find if there is a connection between the node 1 and the node. As you can see, I have to discover a bunch of nodes and more important, relationships, and the compile process looks more or less complicated since the request took 1227 ms to complete. If I make exactly the same request again, I get a response time of about 5 ms, good enough to pass the performance testing. Definitely Neo4j or the Scala compiler was buffering the cypher queries too.
After understanding that there is a compile process in the cypher request, I went deeper and started modifying only parts of an already buffered request. Changing the label or id parameter of the last node matched was also producing a delay, but only ~19 ms, still acceptable:
match (n:green {id: 1})-[r]->(:red)-[s]->(:green)<-[t]-(m:purple {id: 7})
return count(m);
However, when I restart the server, do warm-up and adjust the query so that the first node (labelled before as n) doesn't match, the query will respond very fast with 0 results so I can deduce that not all the query was parsed, since the first node didn't match and there is no need to go deeper in the tree.
I also tried with optional match, providing that returns null if no match was found, but it isn't working either.
I wanted to ask first of all if so far everything that I said based in my tests is correct and in case that it is not, how it's actually working ? And secondly, what should I do (if there is a way) to cache everything at the beginning, when the server started. Unfortunately, the requirements of the project say that queries should perform well, even the first one (and not to say that the real scenario has thousands more relationships and nodes, making everything slower), or if there is no way to avoid this delay.
First of all you need to consider JVM warm up - beware that classes are loaded lazily when needed (your first query) and JIT may only kick in after several (thousands) of calls.
This
match (n)-[r]->() return count(1);
should properly warm up node and relationship cache, however I am not sure if it also loads all their properties and indexes. Also make sure that your data set fits in memory.
Providing values directly in cypher query like this: {id: 1}, instead of using parameters{id: {paramId}} means that when you change the value of the id then the query needs to be compiled again.
You can pass parameters in this way in shell:
neo4j-sh (?)$ export paramId=5
neo4j-sh (?)$ return {paramId};
==> +-----------+
==> | {paramId} |
==> +-----------+
==> | 5 |
==> +-----------+
==> 1 row
==> 4 ms
So if you need to have performing queries from the beginning
change queries to use parameters
execute your other queries at startup together with your warm-up query
EDIT: added information how to pass parameters in shell

Resources