Fringe-detection algorithm for hierarchical structure - algorithm

My back-end (Java) is heavily relying on Tree structures with strong inheritance. Conflict resolution is complex so I am looking to test way to simply block users when the propagation of changes in higher nodes has not yet reached the current element.
Hierarchies are represented through both Materialized Paths and Adjacency Lists for performance reasons. The goal would be to;
Prevent update (bad request) when API requests the change of a node with pending propagation
Inform user through the DTO (e.g. isLocked attribute) when they retrieve a node with pending propagation
Propagation is a simple matter of going through all nodes in a top-down fashion, previously level-by-level (which would have been easier) but now is no longer orchestrated: each node sends the message to its children.
At the moment I have two ideas I do not like:
Add a locked flag on each node (persisted in DB), toggle it to true for all descendants of a modified node, then each node can be unlocked after being processed.
Leverage the materialized path and record the current unprocessed node in a new table. If node D with path A.B.C.D is queried, any of the 4 path nodes in DB means the node has not been processed yet and should be locked.
I do not like approach 1 because it needs to update all entities twice, although retrieving the list would be quick with the Materialized Path.
I do not like approach 2 because:
The materialized path is stored as VARCHAR2, thus the comparison cannot be done in DB and I would first have to unwrap the path to get all nodes in the path and then query the DB to check for any of the elements in hierarchy.
Trees can be quite large with hundreds of children per node, tens of thousands of nodes per tree. Modifying the root would create a huge amount of those temporary records holding the current 'fringe' of the propagation. That many independent DB calls is not ideal, especially since nodes can often be processed in less than 10 ms. I'd probably quickly encounter a bottleneck and bad performances.
Is there another approach that could be taken to identify whether a propagation has reached a node? Examples, comparisons, ... Anything that could help decide on the best way to approach this problem.

Related

how to make apache ignite scale linearly with increase in Number of nodes?

I'm running some tests and found that 1 node is faster and produces more result than 2 and 4 nodes? I'm not able to understand why it is happening.
I'm using parition_aware=True and lazy=True while writing and querying data to ignite.
here are some of the result I got. Its for crossJoin of two 100k row tables.
Results I got after running some queries
Different result sets for different Ignite topologies is an implicit indicator that your affinity collocation configuration is incorrect. You need to distribute your entries across a cluster in the particular way allowing to join tables locally. Make sure that leads and products have the same affinity key column, and use it for your join. This concept is called collocated join, it helps to avoid additional network hops.
For this particular case it seems you are trying to calculate Levenshtein distance, the only way to do that is cross join, it's basically a cartesian product of tables. It means that for each row from the left table you'll need to traverse all the records from the right table (there are some possible optimisations though). The only way to achieve that is to leverage non-collocated joins. But keep in mind that it implies additional network activity. Here's a rough estimation of how much we actually need.
Assume we want to compute the cross join of tables A and B. Let's also assume that the table A contains n rows and the table B contains m rows. In that case for a cluster with k nodes (we are not taking backups into account, they don't take part in SQL) we would come up with some complexity estimation in terms of network data transfer.
There are rows in the table A on every node on the average. For every node-local row in A there are approximately rows in B (residing on the other nodes) to fetch through network. Having k nodes total we'll have the required network activity proportional to . With the growing number of nodes it will creep up to (the entire dataset squared). And it's not really good in fact. Having a smaller number of nodes actually decreases the network load in this scenario.
In a nutshell:
try enabling distributed joins, it will fix the result set size
it's difficult to say what's going on without profiling and query execution plans

how to rebalance a random binary search tree

Here is the situation: There's a balanced binary search tree which may be access by tens of threads. So when I need to insert or delete a node, I don't want to lock the whole tree due to the concurrency. as time goes it becomes not balanced again. When the tree is not so busy being used, I finally get chance to lock and rebalance it. How can I do this?
or is there a better data struct I can use?
You can actually rebalance it using the Day-Stout-Warren algorithm. It's linear in the number of nodes, so might take a while. Moreover, this approach raises a question: what if during the interval when you don't rebalance the tree that's being read it quickly becomes severely unbalanced, and all consequent reads are done in, say, O(N) instead of O(logN)? Is it OK to have this loss of performance for hours in order to not lock things? Are you sure there will be a performance win?
If you can tolerate lack of linearizability (i.e. you write a value but when you search for it immediately after it's not found; it will be there eventually, but 100ms-10s might pass), you can implement a "copy on write" tree: all writes are done by one thread (with rebalancing), and you periodically clone the tree into a read-only copy that can be used by reading threads without any concurrency control, you just need to publish it atomically. Can be done especially fast if the tree is implemented on top of a continuous memory chunk that can be copied as a whole and freed/garbage-collected as a whole.
Another option is to use a concurrent skip list: it gives logarithmic average case search/delete/insert time and is more easily parallelizable. There is a standard lock-free implementation for Java if you happen to use it. You can find more information about concurrent skip lists and balanced search trees here. Particularly, you can find there mentions of a chromatic tree, a binary search tree that is optimized for concurrent rebalancing.

How can I guarantee sequential order in multi-server Oracle RAC environment

We are using a timestamp to ensure that entries in a log table are recorded sequentially, but we have found a potential flaw. Say, for example, we have two nodes in our RAC and the node timestamps are 1000ms off. Our app server inserts two log entries within 30ms seconds of each other. The first insert is serviced by Node1 and the second by Node2. With 1000ms difference between the two nodes, the timestamp could potentially show the log entries occurring in the wrong order! (I would just use a sequence, but our sequences are cached for performance reasons... )
NTP sync doesn't help this situation because NTP has a fault tolerance of 128ms -- which leaves the door open for records to be recorded out of order when they occur more frequently than that.
I have a feeling I'm looking at this problem the wrong way. My ultimate goal is to be able to retrieve the actual sequence that log entries are recorded. It doesn't have to be by a timestamp column.
An Oracle sequence with ORDER specified is guaranteed to return numbers in order across a RAC cluster. So
create sequence my_seq
start with 1
increment by 1
order;
Now, in order to do this, that means that you're going to be doing a fair amount of inter-node communication in order to ensure that access to the sequence is serialized appropriately. That's going to make this significantly more expensive than a normal sequence. If you need to guarantee order, though, it's probably the most efficient approach that you're going to have.
Bear in mind that an attached timestamp on a row is generated at time of the insert or update, but the time that the actual change to the database takes place is when the commit happens - which, depending on the complexity of the transactions, row 1 might get inserted before row2, but gett committed after.
The only thing I am aware of in Oracle across the nodes that guarantees the order is the SCN that Oracle attaches to the transaction, and by which transactions in a RAC environment can be ordered for things like Streams replication.
1000ms? It is one sec, isn't it? IMHO it is a lot. If you really need precise time, then simply give up the idea of global time. Generate timestamps on log server and assume that each log server has it's own local time. Read something about Lamport's time, if you need some theory. But maybe the source of your problem is somewhere else. RAC synchronises time between nodes, and it would log some bigger discrepancy.
If two consecutive events are logged by two different connections, is the same thread using both connections? Or are those evens passed to background threads and then those threads write into the database? i.e. is it logged sequentially or in parallel?

Neo4j: Delete lots of nodes does not improve performance

I have two graphs: one has 15k nodes and is a subgraph of the other which has 30k nodes. To receive the smaller one I took the bigger one and deleted some nodes and their relationships. Now I did some performance issues on both graphs and did the same queries on both and I was wondering that the performance in the bigger graph is better. I do not know the reason. Here I found that the deleted nodes are reserved for future when new nodes will be inserted but is this the true reason? I am using version 2.1.2.
If you deleted and inserted in one go, then your deleted records are still unused.
But at your small graph size this shouldn't matter, I think you have a different problem, please share all the code / queries that run slow as well as more information about your datamodel and graph.

Set NSTreeController with a capacity?

In my OSX app, I'm using an NSTreeController to keep track of any changes to to a document. The tree controller enables versioning by acting as a source control, which means that documents can create their own branches, etc.
It works fine so far. The problem is that every change to the document adds an NSTreeNode to the tree. Which means that after a few hours of use, the tree has accumulated many nodes, which means tons of objects in memory.
Is there a way I can create an NSTreeController with a capacity (like you'd give to an NSArray) which will automatically trim child nodes? If not, what's the best way to manually flush nodes at an appropriate interval so memory usage doesn't bloat?

Resources