I am using dynatree for loading geographical locations in a hierarchical fashion. I have to programmatically select large number of nodes depending on the response from web service. It takes long time to render it on the GUI. IN FF, It takes atleast 3 mins, and in IE 8, I get slow script error. I use the following code to select the nodes in a loop.
tree.getNodeByKey(data).select()
Any help would be appreciated.
If the server could set the select attibutes of the nodes when sending them to the client, this would be more efficient, of course.
If that is not an option, you may consider another pattern, assuming you have an array of keys that should be selected:
Use tree.visit() to iterate over all nodes, and call node.select(), if node.data.id is a member of the array.
Related
I need to display a crypto currency price graph based similar to what is done on CoinMarketCap: https://coinmarketcap.com/currencies/bitcoin/
There could be gigabytes of data for one currency pair over a long period of time, so sending all the data to the client is not an option.
After doing some research I ended up using a Douglas-Peucker Line Approximation Algorithm: https://www.codeproject.com/Articles/18936/A-C-Implementation-of-Douglas-Peucker-Line-Appro It allows to reduce the amount of dots that is sent to the client but there's one problem: every time there's new data I have to go through all the data on the server and as I'd like to update data on the client in real time, it takes a lot of resources.
So I'm thinking about a some kind of progressive algorithm where, let's say, if I need to display data for the last month I can split data into 5 minute intervals, preprocess only the last interval and when it's completed, remove the first one. I'm debating either customising the Douglas-Peucker algorithm (but I'm not sure if it fits this scenario) or finding an algorithm designed for this purpose (any hint would be highly appreciated)
Constantly re-computing the entire reduction points when the new data arrives would change your graph continuously. The graph will lack consistency. The graph seen by one user would be different from the graph seen by another user and the graph would change when the user refreshes the page(this shouldn't happen!), and even in case of server/application shutdown, your data needs to be consistent to what it was before.
This is how I would approach:
Your reduced points should be as they are. For suppose, you are getting data for every second and you have computed reduced points for a 5-minute interval graph, save those data points in a limiting queue. Now gather all seconds data for next 5-minutes and perform the reduction operation on these 600 data points and add the final reduced point to your limiting queue.
I would make the Queue synchronous and the main thread would return the data points in the queue whenever there is an API call. And the worker thread would compute the reduction point on the 5-minute data once the data for the entire 5-minute interval is available.
I'd use tree.
A sub-node contains the "precision" and "average" values.
"precision" means the date-range. For example: 1 minute, 10 minutes, 1 day, 1 month, etc. This also means a level in the tree.
"average" is the value that best represents the price for a range. You can use a simple average, a linear regression, or whatever you decide as "best".
So if you need 600 points (say you get the window size), you can find the precision by prec=total_date_range/600, and some rounding to your existing ranges.
Now you have the 'prec' you just need to retrieve the nodes for that 'prec` level.
Being gigabytes of data, I'd slice them into std::vector objects. The tree would store ids to these vectors for the lowest nodes. The rest of nodes could also be implemented by indices to vectors.
Updating with new data only requires to update a branch (or even creating a new one), starting from root, but with not so many sub-nodes.
I have a RDD called myRdd:RDD[(Long, String)] (Long is an index which it was got using zipWithIndex()) with a number of elements but I need to cut it to get a specific number of elements for the final result.
I am wondering which is better way to do this:
myRdd.take(num)
or
myRdd.filterByRange(0, num)
I don't care about the order of the selected elements, but I do care about the performance.
Any suggestions? Any other way to do this? Thank you!
take is an action, and filterByRange is a transformation. An action sends the results to the driver node, and a transformation does not get executed until an action is called.
The take method will take the first n elements of the RDD and will send it back to the driver. The filterByRange a little bit more sophisticated, since it will take those element whose key is between the specified bounds.
I'd say that there are not so many differences in terms of performance between them. If you just want to send the results to the driver, without caring about the order, use the take method. However, if you want to benefit of the distributed computation and you don't need to send results back to the driver, use filterByRange method and then call to the action.
Source: Google Interview Question
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
Have many large <string (url) -> int (visits)> maps.
Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.
Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.
I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.
All I could think of is brute force, which in some or other way, violates the given constraint.
It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:
pre-processing: let R be the number of servers in cluster, give each
server unique id from 0,1,2,...,R-1
(map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
(reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
(map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
(reduce) Server 0 will yield the top 10 across the network.
Notes:
The problem with the algorithm, like most big-data algorithms that
don't use frameworks is handling failing servers. MapReduce takes
care of it for you.
The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.
If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.
Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).
In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.
A DVR needs to store a list of programs to record. Each program has a starting time and duration. This data needs to be stored in a way that allows the system to quickly determine if a new recording request conflicts with existing scheduled recordings.
The issue is that merely looking to see if there is a show with a conflicting start time is inadequate because the end of a longer program can overlap with a shorter one. I suppose one could create a data structure that tracked the availability of each time slice, perhaps at half-hour granularity, but this would fail if we cannot assume all shows start and end at the half-hour boundary, and tracking at the minute level seems inefficient, both in storage and look up.
Is there a data structure that allows one to query by range, where you supply the lower and upper bound and it returns a collection of all elements that fall within or overlap that range?
An interval tree (maybe using the augmented tree data structure?) does exactly what you're looking for. You'd enter all scheduled recordings into the tree and when a new request comes in, check whether it overlaps any of the existing intervals. Both this lookup and adding a new request take O(log(n)) time, where n is the number of intervals currently stored.
I have need to store an unordered set of items in a manner that allows for fast
Insert
Membership testing (and/or intersection)
Random subset retrieval
Redis seems like a great candidate for this kind of storage, but as I read the docs, there's not one data type that fits this perfectly. Having a SUBSET command for the Set type would be perfect.
What's the best way to store and query this kind of data structure?
In what way does the regular Redis set not meet your criteria? Inserts and membership testing/intersection are obviously built in. Sets also have SRANDMEMBER to retrieve a random member of a set. You could call it multiple times to retrieve a subset of items (though there is the potential to get the same member back multiple times.
If the size of the set is large, and the size of the subset is small, this likely would not be that big of a deal. It gets trickier as the size of the subset grows compared to the size of the overall set (though eventually it gets cheaper to just randomly select the items that you don't want in the subset and then just do a set difference on it).