How to handle a queue in Neo4J? - data-structures

In my Neo4J database I have a series of queues of cards implemented via doubly-linked lists. The data structure is displayed in the following figure (SVG graph of queue generated using Alistair Jones' Arrows online tool):
Since these are queues, I always add new items from the TAIL of the queue. I know that the double relationships (next/previous) is not necessary, but they simplify traversal in both direction, so I prefer to have them.
Inserting a new node
This is the query that I am using to insert a new "card":
MATCH (currentList:List)-[currentTailRel:TailCard]->(currentTail:Card) WHERE ID(currentList) = {{LIST_ID}}
CREATE (currentList)-[newTailRel:TailCard]->(newCard:Card { title: {{TITLE}}, description: {{DESCRIPTION}} })
CREATE (newCard)-[newPrevRel:PreviousCard]->(currentTail)
CREATE (currentTail)-[newNextRel:NextCard]->(newCard)
DELETE currentTailRel
WITH count(newCard) as countNewCard
WHERE countNewCard = 0
MATCH (emptyList:List)-[fakeTailRel:TailCard]->(emptyList),
(emptyList)-[fakeHeadRel:HeadCard]->(emptyList)
WHERE ID(emptyList) = {{LIST_ID}}
WITH emptyList, fakeTailRel, fakeHeadRel
CREATE (emptyList)-[:TailCard]->(newCard:Card { title: {{TITLE}}, description: {{DESCRIPTION}} })
CREATE (emptyList)-[:HeadCard]->(newCard)
DELETE fakeTailRel, fakeHeadRel
RETURN true
The query can be broken in two parts. In the first part:
MATCH (currentList:List)-[currentTailRel:TailCard]->(currentTail:Card) WHERE ID(currentList) = {{LIST_ID}}
CREATE (currentList)-[newTailRel:TailCard]->(newCard:Card { title: {{TITLE}}, description: {{DESCRIPTION}} })
CREATE (newCard)-[newPrevRel:PreviousCard]->(currentTail)
CREATE (currentTail)-[newNextRel:NextCard]->(newCard)
DELETE currentTailRel
I handle the general case of adding a card to a queue that already has other cards.
In the second part:
WITH count(newCard) as countNewCard
WHERE countNewCard = 0
MATCH (emptyList:List)-[fakeTailRel:TailCard]->(emptyList),
(emptyList)-[fakeHeadRel:HeadCard]->(emptyList)
WHERE ID(emptyList) = {{LIST_ID}}
WITH emptyList, fakeTailRel, fakeHeadRel
CREATE (emptyList)-[:TailCard]->(newCard:Card { title: {{TITLE}}, description: {{DESCRIPTION}} })
CREATE (emptyList)-[:HeadCard]->(newCard)
DELETE fakeTailRel, fakeHeadRel
RETURN true
I handle the case in which there are no cards in the queue. In that case the (emptyList) node has two relationships of type HeadCard and TailCard pointing to itself (I call them fake tail and fake head).
This seems to be working. Being a noob at this though, I have a feeling that I am overthinking things and that there might be a more elegant and straightforward way to achieve this. One thing I'd like to understand how to do in a better/simpler way, for example, is how to separate the two subqueries. I'd also like to be able to return the newly created node in both cases, if possible.
Archiving an existing node
Here is how I am removing nodes from the queue. I never want to simply delete nodes, I'd rather add them to an archive node so that, in case of need, they can be recovered.
I have identified these cases:
When the node to be archived is in the middle of the queue
// archive a node in the middle of a doubly-linked list
MATCH (before:Card)-[n1:NextCard]->(middle:Card)-[n2:NextCard]->(after:Card)
WHERE ID(middle)=48
CREATE (before)-[:NextCard]->(after)
CREATE (after)-[:PreviousCard]->(before)
WITH middle, before, after
MATCH (middle)-[r]-(n)
DELETE r
WITH middle, before, after
MATCH (before)<-[:NextCard*]-(c:Card)<-[:HeadCard]-(l:List)<-[:NextList*]-(fl:List)<-[:HeadList]-(p:Project)-[:ArchiveList]->(archive:List)
CREATE (archive)-[r:Archived { archivedOn : timestamp() }]->(middle)
RETURN middle
When the node to be archived is the head of the queue
// archive the head node of a doubly-linked list
MATCH (list:List)-[h1:HeadCard]->(head:Card)-[n1:NextCard]->(second:Card)
WHERE ID(head)=48
CREATE (list)-[:HeadCard]->(second)
WITH head, list
MATCH (head)-[r]-(n)
DELETE r
WITH head, list
MATCH (list)<-[:NextList*]-(fl:List)<-[:HeadList]-(p:Project)-[:ArchiveList]->(archive:List)
CREATE (archive)-[r:Archived { archivedOn : timestamp() }]->(head)
RETURN head
When the node to be archived is the tail of the queue
// archive the tail node of a doubly-linked list
MATCH (list:List)-[t1:TailCard]->(tail:Card)-[p1:PreviousCard]->(nextToLast:Card)
WHERE ID(tail)=48
CREATE (list)-[:TailCard]->(nextToLast)
WITH tail, list
MATCH (tail)-[r]-(n)
DELETE r
WITH tail, list
MATCH (list)<-[:NextList*]-(fl:List)<-[:HeadList]-(p:Project)-[:ArchiveList]->(archive:List)
CREATE (archive)-[r:Archived { archivedOn : timestamp() }]->(tail)
RETURN tail
When the node to be archived is the only node in the queue
// archive the one and only node in the doubly-linked list
MATCH (list:List)-[tc:TailCard]->(only:Card)<-[hc:HeadCard]-(list:List)
WHERE ID(only)=48
CREATE (list)-[:TailCard]->(list)
CREATE (list)-[:HeadCard]->(list)
WITH only, list
MATCH (only)-[r]-(n)
DELETE r
WITH only, list
MATCH (list)<-[:NextList*]-(fl:List)<-[:HeadList]-(p:Project)-[:ArchiveList]->(archive:List)
CREATE (archive)-[r:Archived { archivedOn : timestamp() }]->(only)
RETURN only
I have tried in many ways to combine the following cypher queries into one, using WITH statements, but I was unsuccessful. My current plan is to run all 4 queries one after the other. Only one will actually do something (i.e. archive the node).
Any suggestions to make this better and more streamlined? I am even open to restructuring the data structure since this is a sandbox project that I created for myself to learn Angular and Neo4J, so the ultimate goal is to learn how to do thing better :)
Maybe the data structure itself could be improved? Given how complicated is to insert/archive a node at the end of the queue, I can only imagine how hard is going to be to move elements in the queue (one of the requirements of my self project is to be able to reorder elements in the queue whenever needed).
EDIT:
Still working on trying to combine those 4 queries.
I got this together:
MATCH (theCard:Card) WHERE ID(theCard)=22
OPTIONAL MATCH (before:Card)-[btc:NEXT_CARD]->(theCard:Card)-[tca:NEXT_CARD]->(after:Card)
OPTIONAL MATCH (listOfOne:List)-[lootc:TAIL_CARD]->(theCard:Card)<-[tcloo:HEAD_CARD]-(listOfOne:List)
OPTIONAL MATCH (listToHead:List)-[lthtc:HEAD_CARD]->(theCard:Card)-[tcs:NEXT_CARD]->(second:Card)
OPTIONAL MATCH (listToTail:List)-[ltttc:TAIL_CARD]->(theCard:Card)-[tcntl:PREV_CARD]->(nextToLast:Card)
RETURN theCard, before, btc, tca, after, listOfOne, lootc, tcloo, listToHead, lthtc, tcs, second, listToTail, ltttc, tcntl, nextToLast
which is return NULLs when something is not found, and nodes/relationship when something is found. I thought this could be a good starting point, so I added the following:
MATCH (theCard:Card) WHERE ID(theCard)=22
OPTIONAL MATCH (before:Card)-[btc:NEXT_CARD]->(theCard:Card)-[tca:NEXT_CARD]->(after:Card)
OPTIONAL MATCH (listOfOne:List)-[lootc:TAIL_CARD]->(theCard:Card)<-[tcloo:HEAD_CARD]-(listOfOne:List)
OPTIONAL MATCH (listToHead:List)-[lthtc:HEAD_CARD]->(theCard:Card)-[tcs:NEXT_CARD]->(second:Card)
OPTIONAL MATCH (listToTail:List)-[ltttc:TAIL_CARD]->(theCard:Card)-[tcntl:PREV_CARD]->(nextToLast:Card)
WITH theCard,
CASE WHEN before IS NULL THEN [] ELSE COLLECT(before) END AS beforeList,
before, btc, tca, after,
listOfOne, lootc, tcloo, listToHead, lthtc, tcs, second, listToTail, ltttc, tcntl, nextToLast
FOREACH (value IN beforeList | CREATE (before)-[:NEXT_CARD]->(after))
FOREACH (value IN beforeList | CREATE (after)-[:PREV_CARD]->(before))
FOREACH (value IN beforeList | DELETE btc)
FOREACH (value IN beforeList | DELETE tca)
RETURN theCard
When I executed this (with an ID chosen to make before=NULL, the fan of my laptop start spinning like crazy, the query never returns and eventually the neo4j browser says that it has lost connection with the server. The only way to end the query is to stop the server.
So I changed the query to the simpler:
MATCH (theCard:Card) WHERE ID(theCard)=22
OPTIONAL MATCH (before:Card)-[btc:NEXT_CARD]->(theCard:Card)-[tca:NEXT_CARD]->(after:Card)
OPTIONAL MATCH (listOfOne:List)-[lootc:TAIL_CARD]->(theCard:Card)<-[tcloo:HEAD_CARD]-(listOfOne:List)
OPTIONAL MATCH (listToHead:List)-[lthtc:HEAD_CARD]->(theCard:Card)-[tcs:NEXT_CARD]->(second:Card)
OPTIONAL MATCH (listToTail:List)-[ltttc:TAIL_CARD]->(theCard:Card)-[tcntl:PREV_CARD]->(nextToLast:Card)
RETURN theCard,
CASE WHEN before IS NULL THEN [] ELSE COLLECT(before) END AS beforeList,
before, btc, tca, after,
listOfOne, lootc, tcloo, listToHead, lthtc, tcs, second, listToTail, ltttc, tcntl, nextToLast
And I still end up in an infinite loop or something...
So I guess the line CASE WHEN before IS NULL THEN [] ELSE COLLECT(before) END AS beforeList was not a good idea... Any suggestions on how to proceed from here? Am I on the wrong path?
A SOLUTION?
Finally, after much research, I found a way to write a single query that takes care of all possible scenarios. I don't know if this is the best way to achieve what I am trying to achieve, but it seems elegant and compact enough to me. What do you think?
// first let's get a hold of the card we want to archive
MATCH (theCard:Card) WHERE ID(theCard)=44
// next, let's get a hold of the correspondent archive list node, since we need to move the card in that list
OPTIONAL MATCH (theCard)<-[:NEXT_CARD|HEAD_CARD*]-(theList:List)<-[:NEXT_LIST|HEAD_LIST*]-(theProject:Project)-[:ARCHIVE_LIST]->(theArchive:List)
// let's check if we are in the case where the card to be archived is in the middle of a list
OPTIONAL MATCH (before:Card)-[btc:NEXT_CARD]->(theCard:Card)-[tca:NEXT_CARD]->(after:Card)
OPTIONAL MATCH (next:Card)-[ntc:PREV_CARD]->(theCard:Card)-[tcp:PREV_CARD]->(previous:Card)
// let's check if the card to be archived is the only card in the list
OPTIONAL MATCH (listOfOne:List)-[lootc:TAIL_CARD]->(theCard:Card)<-[tcloo:HEAD_CARD]-(listOfOne:List)
// let's check if the card to be archived is at the head of the list
OPTIONAL MATCH (listToHead:List)-[lthtc:HEAD_CARD]->(theCard:Card)-[tcs:NEXT_CARD]->(second:Card)-[stc:PREV_CARD]->(theCard:Card)
// let's check if the card to be archived is at the tail of the list
OPTIONAL MATCH (listToTail:List)-[ltttc:TAIL_CARD]->(theCard:Card)-[tcntl:PREV_CARD]->(nextToLast:Card)-[ntltc:NEXT_CARD]->(theCard:Card)
WITH
theCard, theList, theProject, theArchive,
CASE WHEN theArchive IS NULL THEN [] ELSE [(theArchive)] END AS archives,
CASE WHEN before IS NULL THEN [] ELSE [(before)] END AS befores,
before, btc, tca, after,
CASE WHEN next IS NULL THEN [] ELSE [(next)] END AS nexts,
next, ntc, tcp, previous,
CASE WHEN listOfOne IS NULL THEN [] ELSE [(listOfOne)] END AS listsOfOne,
listOfOne, lootc, tcloo,
CASE WHEN listToHead IS NULL THEN [] ELSE [(listToHead)] END AS listsToHead,
listToHead, lthtc, tcs, second, stc,
CASE WHEN listToTail IS NULL THEN [] ELSE [(listToTail)] END AS listsToTail,
listToTail, ltttc, tcntl, nextToLast, ntltc
// let's handle the case in which the archived card was in the middle of a list
FOREACH (value IN befores |
CREATE (before)-[:NEXT_CARD]->(after)
CREATE (after)-[:PREV_CARD]->(before)
DELETE btc, tca)
FOREACH (value IN nexts | DELETE ntc, tcp)
// let's handle the case in which the archived card was the one and only card in the list
FOREACH (value IN listsOfOne |
CREATE (listOfOne)-[:HEAD_CARD]->(listOfOne)
CREATE (listOfOne)-[:TAIL_CARD]->(listOfOne)
DELETE lootc, tcloo)
// let's handle the case in which the archived card was at the head of the list
FOREACH (value IN listsToHead |
CREATE (listToHead)-[:HEAD_CARD]->(second)
DELETE lthtc, tcs, stc)
// let's handle the case in which the archived card was at the tail of the list
FOREACH (value IN listsToTail |
CREATE (listToTail)-[:TAIL_CARD]->(nextToLast)
DELETE ltttc, tcntl, ntltc)
// finally, let's move the card in the archive
// first get a hold of the archive list to which we want to add the card
WITH
theCard,
theArchive
// first get a hold of the list to which we want to add the new card
OPTIONAL MATCH (theArchive)-[tact:TAIL_CARD]->(currentTail:Card)
// check if the list is empty
OPTIONAL MATCH (theArchive)-[tata1:TAIL_CARD]->(theArchive)-[tata2:HEAD_CARD]->(theArchive)
WITH
theArchive, theCard,
CASE WHEN currentTail IS NULL THEN [] ELSE [(currentTail)] END AS currentTails,
currentTail, tact,
CASE WHEN tata1 IS NULL THEN [] ELSE [(theArchive)] END AS emptyLists,
tata1, tata2
// handle the case in which the list already had at least one card
FOREACH (value IN currentTails |
CREATE (theArchive)-[:TAIL_CARD]->(theCard)
CREATE (theCard)-[:PREV_CARD]->(currentTail)
CREATE (currentTail)-[:NEXT_CARD]->(theCard)
DELETE tact)
// handle the case in which the list was empty
FOREACH (value IN emptyLists |
CREATE (theArchive)-[:TAIL_CARD]->(theCard)
CREATE (theArchive)-[:HEAD_CARD]->(theCard)
DELETE tata1, tata2)
RETURN theCard
LAST EDIT
Following Wes advice I decided to change the way each of the queues in my application was handled, adding two extra nodes, the head and the tail.
Inserting a New Card
Moving the concepts of head and tail from simple relationships to nodes allows to have a single case when inserting a new card. Even in the special case of an empty queue…
all we have to do to add a new card to the tail of the queue is:
find the (previous) node connected by a [PREV_CARD] and a [NEXT_CARD] relationships to the (tail) node of the queue
create a (newCard) node
connect the (newCard) node to the (tail) node with both a [PREV_CARD] and a [NEXT_CARD] relationships
connect the (newCard) node to the (previous) node with both a [PREV_CARD] and a [NEXT_CARD] relationships
finally delete the original [PREV_CARD] and a [NEXT_CARD] relationships that connected the (previous) node to the (tail) node of the queue
which translates into the following cypher query:
MATCH (theList:List)-[tlt:TAIL_CARD]->(tail)-[tp:PREV_CARD]->(previous)-[pt:NEXT_CARD]->(tail)
WHERE ID(theList)={{listId}}
WITH theList, tail, tp, pt, previous
CREATE (newCard:Card { title: "Card Title", description: "" })
CREATE (tail)-[:PREV_CARD]->(newCard)-[:NEXT_CARD]->(tail)
CREATE (newCard)-[:PREV_CARD]->(previous)-[:NEXT_CARD]->(newCard)
DELETE tp,pt
RETURN newCard
Archiving a Card
Now let's reconsider the use case in which we want to archive a card. Let's review the architecture:
We have:
each project has a queue of lists
each project has an archive queue to store all archived cards
each list has a queue of cards
In the previous queue architecture I had 4 different scenarios, depending in whether the card to be archived was the head, the tail, or a card in between or if it was the last card left in the quee.
Now, with the introduction of the head and tail nodes, there is only one scenario, because the head and the tail node are there to stay, even in the case in which the queue is empty:
we need to find the (previous) and the (next) nodes, immediately before and after (theCard) node, which is the node that we want to archive
then, we need to connect (previous) and (next) with both a [NEXT_CARD] and a [PREV_CARD] relationship
then, we need to delete all the relationships that were connecting (theCard) to the (previous) and (next) nodes
The resulting cypher query can be subdivided in three distinct parts. The first part is in charge of finding (theArchive) node, given the ID of (theCard) node:
MATCH (theCard)<-[:NEXT_CARD|HEAD_CARD*]-(l:List)<-[:NEXT_LIST*]-(h)<-[:HEAD_LIST]-(p:Project)-[:ARCHIVE]->(theArchive:Archive)
WHERE ID(theCard)={{cardId}}
Next, we execute the logic that I described few lines earlier:
WITH theCard, theArchive
MATCH (previous)-[ptc:NEXT_CARD]->(theCard)-[tcn:NEXT_CARD]->(next)-[ntc:PREV_CARD]->(theCard)-[tcp:PREV_CARD]->(previous)
WITH theCard, theArchive, previous, next, ptc, tcn, ntc, tcp
CREATE (previous)-[:NEXT_CARD]->(next)-[:PREV_CARD]->(previous)
DELETE ptc, tcn, ntc, tcp
Finally, we insert (theCard) at the tail of the archive queue:
WITH theCard, theArchive
MATCH (theArchive)-[tat:TAIL_CARD]->(archiveTail)-[tp:PREV_CARD]->(archivePrevious)-[pt:NEXT_CARD]->(archiveTail)
WITH theCard, theArchive, archiveTail, tp, pt, archivePrevious
CREATE (archiveTail)-[:PREV_CARD]->(theCard)-[:NEXT_CARD]->(archiveTail)
CREATE (theCard)-[:PREV_CARD]->(archivePrevious)-[:NEXT_CARD]->(theCard)
DELETE tp,pt
RETURN theCard
I hope you find this last edit interesting as I found working through this exercise. I want to thank again Wes for his remote help (via Twitter and Stack Overflow) in this interesting (at least to me) experiment.

A nice problem--doubly-linked lists in the graph. I recently worked on a similar concept where I needed to keep track of a list, but tried to come up with a way to avoid the difficulties of needing to know how to handle the end of the list edge cases. The result of my effort was this graph gist about skip lists in cypher.
Translated to a doubly-linked list, this would look like:
CREATE
(head:Head),
(tail:Tail),
(head)-[:NEXT]->(tail),
(tail)-[:PREV]->(head),
(a:Archive)-[:NEXT]->(:ArchiveTail)-[:PREV]->(a);
; // we need something to start at, and an archive list
Then you could queue nodes with the simple:
MATCH (tail:Tail)-[p:PREV]->(prev),
(tail)<-[n:NEXT]-(prev)
CREATE (new {text:"new card"})<-[:NEXT]-(prev),
(new)-[:PREV]->(prev),
(new)<-[:PREV]-(tail),
(new)-[:NEXT]->(tail)
DELETE p, n
And archive nodes with the fairly simple:
MATCH (toArchive)-[nn:NEXT]->(next),
(toArchive)<-[np:PREV]-(next),
(toArchive)<-[pn:NEXT]-(prev),
(toArchive)-[pp:PREV]->(prev)
WHERE toArchive.text = "new card 2"
CREATE (prev)-[:NEXT]->(next)-[:PREV]->(prev)
DELETE nn, np, pn, pp
WITH toArchive
MATCH (archive:Archive)-[n:NEXT]->(first)-[p:PREV]->(archive)
CREATE (archive)-[:NEXT]->(toArchive)<-[:PREV]-(first),
(archive)<-[:PREV]-(toArchive)-[:NEXT]->(first)
DELETE n, p
Your use cases are actually much easier than the skip list algorithmically, because you avoid most needs for varlength paths by keeping the tail to queue cards to the end directly. Hopefully others implementing similar ideas will find your SO question useful.

Your solution seems nice enough and serves your purpose too. I just add some suggestions for you.. that.. when you check the card to be archived is in the middle of a list and confirm that it is in the middle, than you don't need to check whether it is at the head or tail of the list and vice versa.

Related

Hashmap (O(1)) supporting joker/match-all keys

The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?
Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.
From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.
For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.
For each searchable column, you'll build a specific index, dedicated to this column.
The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
Same as MAINHASH, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in MAINHASH. If found, return it, search done.
Upgrade (see below): search also in CACHE before going to step #2.
For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).
With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.
For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.
Once all columns are done, you'll have to analyze TMPHASH.
Analyzing TMPHASH
First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:
Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.
I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value += 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i+1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i+1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id += 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id += 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))
As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.
You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.
Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:
map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");
Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:
map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG
If the answer in the third step was null, you would continue as follows:
map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG
And there still remain two options:
map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG
Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.

How to get level(depth) number of two connected nodes in neo4j

I'm using neo4j as a graph database to store user's connections detail into this. here I want to show the level of one user with respect to another user in their connections like Linkedin. for example- first layer connection, second layer connection, third layer and above the third layer shows 3+. but I don't know how this happens using neo4j. i searched for this but couldn't find any solution for this. if anybody knows about this then please help me to implement this functionality.
To find the shortest "connection level" between 2 specific people, just get the shortest path and add 1:
MATCH path = shortestpath((p1:Person)-[*..]-(p2:Person))
WHERE p1.id = 1 AND p2.id = 2
RETURN LENGTH(path) + 1 AS level
NOTE: You may want to put a reasonable upper bound on the variable-length relationship pattern (e.g., [*..6]) to avoid having the query taking too long or running out of memory in a large DB). You should probably ignore very distant connections anyway.
it would be something like this
// get all persons (or users)
MATCH (p:Person)
// create a set of unique combinations , assuring that you do
// not do double work
WITH COLLECT(p) AS personList
UNWIND personList AS personA
UNWIND personList AS personB
WITH personA,personB
WHERE id(personA) < id(personB)
// find the shortest path between any two nodes
MATCH path=shortestPath( (personA)-[:LINKED_TO*]-(personB) )
// return the distance ( = path length) between the two nodes
RETURN personA.name AS nameA,
personB.name AS nameB,
CASE WHEN length(path) > 3 THEN '3+'
ELSE toString(length(path))
END AS distance

Merge sort with insertion function

I have two sorted linked list l1 and l2. I wanted to merge l2 into l1 while keeping l1 sorted. My return statement is l1 sorted and l2 is an empty list. How do I approach this problem by using insert_before, insert_after, and remove_node function? Any ideas?
If you know how linked lists are implemented, you can easily just iterate through both of them, using something like the approach below. If you don't know too much about linked lists you might want to read up on those first.
if(l1 == null || l2 == null){
return l1==null ? l2 : l1;
indexA = indexB = 0;
elemA = l1.get(0);
elemB = l2.get(0);
while(indexA<l1.size() || indexB<l2.size()){
if(indexA==l1.size()){
l1.insert_after(l1.size()-1,elemB);
indexA++;
indexB++;
if(indexB<l2.size()){
elemB = l2.get(indexB);
}
continue;
}
if(indexB==l2.size()){
break;
}
if(elemA<elemB){
indexA++;
if(indexA<l1.size()){
elemA = l1.get(indexA);
}
continue;
}
if(elemA>elemB){
l1.insert_before(indexA,elemB);
indexA++;
indexB++;
if(indexB<l2.size()){
elemB = l2.get(indexB);
}
}
return l1;
this is a somehow java like implementation, it iterates through both lists and merges them together on the go. I decided to only return l1 as the empty list would be somewhat useless right?, all the inefficient get() calls could be avoided by writing an own simple LinkedList class to gain access to the next and last fields, which would greatly facilitate navigation.
If you could provide additional information about the methods you have available or the language this has to be written in I could probably provide you with better explanations.
Edit: Explanation
first both elemA /elemB are given the value of the first element in the linked lists then the while loop is used to iterate through them until the index values that are used to keep track of where we are in the lists are both too high (e.g. ==size of list)
Now the first two if statements are to check wheter one of the lists is already completely iterated through, if this is the case for the first list, the next element of the second list is just added behind the last element of l1, as it must be greater than it, otherwise indexA would not have been increased. (In this if statement indexA is also incremented because the size of the list is getting bigger and indexA is compared to l1.size() ), finally the continue statement skips to the next iteration of the loop.
The second if statement in the while loop tests if the second list has already completely been iterated through, if this is true there is nothing more to merge so the loop is stopped by the break statement.
below that is more or less classic mergesort;
if the current element of l1 is smaller than the current element of l2, just go to the next element of l1 and go to next iteration of loop, otherwise insert the current element of l2 before elemA and iterate further here indexA is also incremented because otherwise it would not point to the right element anymore as an element was inserted before it.
is this sufficient?
Edit: Added testing if either of the lists is null as suggested by #itwasntme
The functions insert_before, insert_after, and remove_node are not clearly defined in the question. So assuming one of the parameters is a reference to the "list", is the other parameter an index, an iterator, or a pointer to a node? If the lists are doubly linked lists, then it makes sense for the second parameter to be an iterator or a pointer to a node (each node has a pointer to previous node, so there's no need to scan according to an index to find the prior node).
In the case of Visual Studio's std::list::sort, it uses iterators and std::list::splice to move nodes within a list, using iterators to track the beginning and ending of sub-lists during a merge sort. The version of std::list::splice used by std::list::sort combines remove_node and insert_before into a single function.

optimizing the data structure implementation

There is a stream of random characters coming like 'a''b''c''a'... and so on. At any given point in time when I query I need to get the first non repeating character. For example, for the input "abca", 'b' should be returned since a is repeated and the first non repeating character is 'b'.
There needs to be two methods, one for inserting and one for querying.
My solution is to have a linkedList to store the incoming stream characters. While I get the next character, I just compare with all the current characters and if present I will not insert into the end of linkedlist, else I will insert at the end. By this approach, the query will take O(1) since I will get the first element on the linkedlist and insert will take O(n) since I need to compare from the first element till the last element in the worst case.
Is there any better performing way?
Either you haven't explained your algorithm well or it won't return the correct result. In the example a b a, would your algorithm return a (because it is the first element in the linked list)?
Anyway, here is a modification that improves performance. The idea is to use a hash map from characters to (doubly) linked list nodes. This map can be used to determine if a character has already been inserted and to get to the required node quickly. We should allow a null value for the map target (instead of the list node) to express a character that has ocurred more than once already.
The insertion method works as follows:
Check if the map contains the current character (O(1)). If not, add it to the end of the list and add a reference to the map (O(1)).
If the character is already in the map: Check if the pointed to node is null (O(1)). If so, just ignore it. If it is not, remove the pointed to node from the list and update the reference to a null value (O(1)).
Overall, a O(1) operation.
The query works as in your previous solution.
Here is a C# implementation. It's basically a 1:1 translation of the above explanation:
class StreamAnalyzer
{
LinkedList<char> characterList = new LinkedList<char>();
Dictionary<char, LinkedListNode<char>> characterMap
= new Dictionary<char, LinkedListNode<char>>();
public void AddCharacter(char c)
{
LinkedListNode<char> referencedNode;
if (characterMap.TryGetValue(c, out referencedNode))
{
if(referencedNode != null)
{
characterList.Remove(referencedNode);
characterMap[c] = null;
}
}
else
{
var node = new LinkedListNode<char>(c);
characterList.AddLast(node);
characterMap.Add(c, node);
}
}
public char? GetFirstNonRepeatingCharacter()
{
if (characterList.First == null)
return null;
else
return characterList.First.Value;
}
}

Data structure optimized for adding items (at end of list), iterating, and removing items

I need a data structure which will support the following operations in a performant manner:
Adding an item to the end of the list
Iterating through the list in the order the items were added to it (random access is not important)
Removing an item from the list
What type of data structure should I use? (I will post what I am currently thinking of as an answer below.)
You should use a linked list. Adding an item to the end of the list is O(1). Iterating is easy, and you can remove an item from any known position in the list in O(1) as well.
It sounds like a linked list, however there's a catch you need to consider. When you say "removing an item from the list", it depends on whether you have the "complete item" to remove, or just its value.
I will clarify: let's say your values are strings. You can construct a class/struct containing a string and two linking pointers (forwards and backwards). When given such a class, it's very easy to remove it from the list in O(1). In pseudo code, removing item c looks like this (please disregard validation tests):
c.backwards = c.forwards
if c.backwards = null: head = c.forwards
if c.forwards = null: tail = c.backwards
delete c
However, if you wish to delete the item containing the string "hello", that would take O(n) because you would need to iterate through the list.
If that's the case I would recommend using a combination of a linked list and and hash table for O(1) lookup. Inserting to the end of the list (pseudo code):
new_item = new Item(value = some_string, backwards = tail, forwards = null)
tail.forwards = new_item
tail = new_item
hash.add(key = some_string, value = new_item)
Scanning through the list is just scanning through the linked list, no problems:
i = head
while i != null:
... do something with i.value ...
i = i.forwards
Removing an item from the list by value (pseudo code, no validation testing):
item_to_remove = hash.find_by_key(some_string)
if (item_to_remove != null):
hash.delete_key(some_string)
item_to_remove.backwards = item_to_remove.forwards
if item_to_remove.forwards = null: tail = item_to_remove.backwards
if item_to_remove.backwards = null: head = item_to_remove.forwards
delete item_to_remove
I am thinking of using a simple list of the items. When a new item is added I will just add it to the end of the list. To remove an item, I won't actually remove it from the list, I will just mark it as deleted, and skip over it when iterating through the items.
I will keep track of the number of deleted items in the list, and when more than half of the items are deleted, I will create a new list without the deleted items.
Circular & Doubly-Linked List. It satisfies all 3 requirements:
Adding an item to the end of the list: O(1). By add to Head->prev. It supports Iterating through the list in the same order in which they were added. You can remove any element.
Assuming Java (other languages have similar structures, but I found the JavaDocs first):
ArrayList if you have the index of the item you want to delete
LinkedHashMap if you only have the item and not its position

Resources