Is any standard way to keep netflow data (aggregated by ports and destination IP, distinguished by source IP)?
Data input: netflow data (source IP, timestamp, octets), updates are very often
Request input: IP, range (two timestamps)
Request output: number of octets
It it possible to get O(log(n)) or better for data storing and requests? How?
Have a (hash) map of IP to a binary search tree indexed by timestamp.
To efficiently count the number of elements between two nodes in the binary search tree, you could have each node store a count of the nodes in the subtree of its left child (similar to a rope, then you could determine the index in the BST for both the start and end timestamp, giving you the number of elements in between.
The map lookup is expected O(1), the BST queries are O(log n) each, giving O(log n) total.
Related
I've a few million records (which are updated often) with 2 properties:
Timestamp
Popularity score
I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".
I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.
I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.
If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:
Perform binary search for a particular timestamp O(logn)
Extract the greater than values in O(1) since the array is sorted
Extract the matching popularity vote in O(1) since they are in a dictionary
Update a record popularity score in O(1) due to the dictionary
Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value
Suppose you have m records with the wanted timestamp range, you can
generate a max heap from them by popularity, this takes O(m) and then perform k pops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assuming k << m this will run in O(m).
Iterate over the m records with a list in size k to keep track of thr top k popular songs. After passing over all m records you will have the top k in the list. This takes O(m) as well
Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.
I want to find an efficient data structure that can handle the following use case.
I can add new elements to this data structure, e.g.
I call add() API, add([2,3,4,5,3]), then this data structure stores [2,3,3,4,5]. I can query some target and return how many numbers smaller than this target. e.g. query(4), return 3 (since one 2 and two 3). And the frequencies of calling add and query are in the same order.
Firstly, I think of segment tree, however, the input number can be anyone in int value, space will be O(2^32)
Could you give me some advice about which data structure should I use?
You can do this using an order statistic tree, which is a kind of binary search tree where each node also stores the cardinality of its own subtree. Inserting into an order statistic tree still takes O(log n) time, because it's a binary search tree, although the insert operation is a little more complicated because it has to keep the cardinalities of each node up-to-date.
Computing the number of members less than a given target also takes O(log n) time; start at the root node:
If the target is less than or equal to the root node's value, then recurse on the left subtree.
Otherwise, return the left child's cardinality plus the result of recursing on the right subtree.
The base case is that you always return 0 for an empty subtree.
Is there general pseudocode or related data structure to get the nth value of a b-tree? For example, the eighth value of this tree is 13 [1,4,9,9,11,11,12,13].
If I have some values sorted in a b-tree, I would like to find the nth value without having to go through the entire tree. Is there a better structure for this problem? The data order could update anytime.
You are looking for order statistics tree. The idea of it, is in addition to any data stored in nodes - also store the size of the subtree in the node, and keep them updated in insertions and deletions.
Since you are "touching" O(logn) nodes for each insert/delete operation - keeping it up to date still keeps the O(logn) behavior of these.
FindKth() is then done by eliminating subtrees that their bigger index is still smaller than k, and checking the next one. Since you don't need to go to the depth of each subtree, only directly to the required one (and checking the nodes in the path to this element) - you need to "touch" O(logn) nodes, which makes this operation O(logn) as well.
I want to augment a binary search tree such that search, insertion and delete be still supported in O(h) time and then I want to implement an algorithm to find the sum of all node values in a given range.
If you add an additional data structure to your BST class, specifically a Hashmap or Hashtable. Your keys will be the different numbers your BST contains and your values the number of occurrences for each. BST search(...) will not be impacted however insert(...) and delete(...) will need slight code changes.
Insert
When adding nodes to the BST check to see if that value exist in the Hashmap as a key. If it does exist increment occurrence count by 1. If it doesn't exist add it to the Hashmap with an initial value of 1.
Delete
When deleting decrement the occurrence count in the Hashmap (assuming your aren't being told to delete a node that doesn't exist)
Sum
Now for the sum function
sum(int start, int end)
You can iteratively check your Hashmap to see which numbers from the range exist in your map and their number of occurrences. Using this you can build out your sum by adding up all of the values in the Map that are in the range multiplied by their number of occurrences.
Complexities
Space: O(n)
Time of sum method: O(range size).
All other method time complexity isn't impacted.
You didn't mention a space restraint so hopefully this is OK. I am very interested to see if you can some how use the properties of a BST to solve this more efficiently nothing comes to mind for me.
We need to maintain mobileNumber and its location in memory.
The challenge is that we have more than 5 million of users
and storing the location for each user will be like hash map of 5 million records.
To resolve this problem, we have to work on ranges
We are given ranges of phone numbers like
range1 start="9899123446" end="9912345678" location="a"
range2 start="9912345679" end="9999999999" location="b"
A number can belong to single location only.
We need a data structure to store these ranges in the memory.
It has to support two functions
findLocation(Integer number) it should return the location name to
which number belongs
changeLocation( Integer Number , String range). It changes location of Number from old location to new location
This is completely in memory design.
I am planning to use tree structure with each node contains ( startofrange , endofrange ,location).
I will keep the nodes in sorted order. I have not finalized anything yet.
The main problem is-- when 2nd function to change location is called say 9899123448 location to b
The range1 node should split to 3 nodes 1st node (9899123446,9899123447,a)
2nd node (9899123448,9899123448,b) 3rd node (9899123449,9912345678,a).
Please suggest the suitable approach
Thanks in advance
Normally you can use specialized data structures to store ranges and implement the queries, e.g. Interval Tree.
However, since phone number ranges do not overlap, you can just store the ranges in a standard tree based data structure (Binary Search Tree, AVL Tree, Red-Black Tree, B Tree, would all work) sorted only by [begin].
For findLocation(number), use corresponding tree search algorithm to find the first element that has [begin] value smaller than the number, check its [end] value and verify if the number is in that range. If a match if found, return the location, otherwise the number is not in any range.
For changeLocation() operation:
Find the old node containing the number
If an existing node is found in step 1, delete it and insert new nodes
If no existing node is found, insert a new node and try to merge it with adjacent nodes.
I am assuming you are using the same operation for simply adding new nodes.
More practically, you can store all the entries in a database, build an index on [begin].
First of all range = [begin;end;location]
Use two structures:
Sorted array to store ranges begins
Hash-table to access ends and locations by begins
Apply following algo:
Use binary search to find "nearest less" value ob begin
Use hash-table to find end and location for begin