d3js force large number of nodes - d3.js

pl. help me with this noob questions. I want to show a network with large number (70000) of nodes, and 2.1 million links in force layout. Looking for a good and scalable way to do this.
How do we actually show such large nodes practically, can we do some kind of approximation and show semantically same network (e.g: http://www.visualcomplexity.com/vc/project.cfm?id=76 )
How do we actually reduce such data in back end [ say using KDE ? We cannot afford to use science.js in front end as the volume is large ]
Initial view can be the network with pre-determined locations of the nodes or clusters. How do we predertmine the locations in back end, before sending the data to d3js. Do we have to use topojson ?
Any such examples are available using d3js (and a backend - say java, python etc) ?

Sorry about the question, but do you really need to show all that information in one shot?
If you really need it, have first a look with Gephi and see what it looks like, then pass to the next step.
If you see that you can focus on specific nodes or patterns at the beginning and then explore the result of the chart, probably this is the best solution from a performance point of view.
In case the discovery approach works but you are still having troubles with many items on the screen, just control the force layout with a time based threshold. It's not perfect but it will work for hundred nodes.
Next step
If you decide to go anyway on this path, I would recommend the followings:
Aggregate: that's probably the most useful thing you can do here: let the user interact with the data and dig in it to see more in detail. That is the best solution if you have to serve many clients.
Do not run the force directed layout on the front end with the entire network as is: it will eat all the browser resources for at least tens of minutes in any case.
Compute the layout on the back end - e.g. using JUNG or Gephi core itself in Java or NetworkX in Python - and then just display the result.
Cache the result of the point above as well: they are many even for the server if you have many clients, so cache it.
When the user drag the network, hide the links: it should speed up the computation ( sigmajs uses this trick)

Related

Cytoscape vs STRING for long list of proteins

I am mid-way through my university project, and I have run into an issue. I have a long list of around 1000 proteins that I wanted to analyse in STRING, however, my list is too large. I decided to try and utilise Cytoscape (and downloaded the stringApp), but the networks generated are still very messy. I've attached a screenshot here. Is there any way to improve the presentation of the network by downloading any Cytoscape apps or by tweaking the settings?
Thanks in advance
Well, the short answer is "no". A slightly longer answer is "it depends".
Showing a hairball really isn't helpful, usually, so you need to refine things somewhat. What is your data source (i.e. where did the 1000 proteins come from)? What do you hope to see in the network? If you are looking for particular groups of proteins (e.g. complexes), you would probably want to use MCL to cluster them first. If you have some other data you want to map, such as transcriptomic or proteomic data, you could refine your network based on fold change or abundance values.
All that being said, somethings you might try. First, you are seeing the "fast" version of the network. Try clicking on the show graphics details button (the diamond in the network view tool bar). That will give you the full graphics details. Second, you might try spreading the network out a bit by using the Layout->Layout Tools. Turn off the "Selected Only" and then adjust the scale. Finally, depending on your biological question, you might want to eliminate proteins that are only present in the nucleus or cytoplasm, or are only in lung tissue. This is all possible using the sliders provided by the stringApp's Results Panel.
-- scooter

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

How to implement a spreadsheet in a browser?

I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
Thanks!
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
data.
Resizing cell height and width is achievable with offsetX and
offsetY.
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.

How to handle large numbers of pushpins in Bing Maps

I am using Bing Maps with Ajax and I have about 80,000 locations to drop pushpins into. The purpose of the feature is to allow a user to search for restaurants in Louisiana and click the pushpin to see the health inspection information.
Obviously it doesn't do much good to have 80,000 pins on the map at one time, but I am struggling to find the best solution to this problem. Another problem is that the distance between these locations is very small (All 80,000 are in Louisiana). I know I could use clustering to keep from cluttering the map, but it seems like that would still cause performance problems.
What I am currently trying to do is to simply not show any pins until a certain zoom level and then only show the pins within the current view. The way I am currently attempting to do that is by using the viewchangeend event to find the zoom level and the boundaries of the map and then querying the database (through a web service) for any points in that range.
It feels like I am going about this the wrong way. Is there a better way to manage this large amount of data? Would it be better to try to load all points initially and then have the data on hand without having to hit my web service every time the map moves. If so, how would I go about it?
I haven't been able to find answers to my questions, which usually means that I am asking the wrong questions. If anyone could help me figure out the right question it would be greatly appreciated.
Well, I've implemented a slightly different approach to this. It was just a fun exercise, but I'm displaying all my data (about 140.000 points) in Bing Maps using the HTML5 canvas.
I previously load all the data to the client. Then, I've optimized the drawing process so much that I've attached it to the "Viewchange" event (which fires all the time during the view change process).
I've blogged about this. You can check it here.
My example does not have interaction on it but could be easily done (should be a nice topic for a blog post). You would have thus to handle the events manually and search for the corresponding points yourself or, if the amount of points to draw and/or the zoom level was below some threshold, show regular pushpins.
Anyway, another option, if you're not restricted to Bing Maps, is to use the likes of Leaflet. It allows you to create a Canvas Layer which is a tile-based layer but rendered in client-side using HTML5 canvas. It opens a new range of possibilities. Check for example this map in GisCloud.
Yet another option, although more suitable to static data, is using a technique called UTFGrid. The lads that developed it can certainly explain it better than me, but it scales for as many points as you want with a fenomenal performance. It consists on having a tile layer with your info, and an accompanying json file with something like an "ascii-art" file describing the features on the tiles. Then, using a library called wax it provides complete mouse-over, mouse-click events on it, without any performance impact whatsoever.
I've also blogged about it.
I think clustering would be your best bet if you can get away with using it. You say that you tried using clustering but it still caused performance problems? I went to test it out with 80000 data points at the V7 Interactive SDK and it seems to perform fine. Test it out yourself by going to the link and change the line in the Load module - clustering tab:
TestDataGenerator.GenerateData(100,dataCallback);
to
TestDataGenerator.GenerateData(80000,dataCallback);
then hit the Run button. The performance seems acceptable to me with that many data points.

How can I store a graph and run page rank like analytics on it hbase?

Sorry if this question seems a bit complex but I think its all related so I wanted try to get the answer in one shot. Basically I have a layered graph*, that has various sets of data that are connected to only the next set of data(so set1 has vertexes that have edges to set2, and so on but set1 has nothing connecting to set3 or anything other than set2. It might be relevant not sure). Generally, you can think of my data as one massive family tree(every set I add about a billion nodes) that I keep loading new generations with every new set(families create new families and no edges go backwards).
I have an Hbase/hadoop system running and I know how to use java to add columns and values, but what I don't know how to do is:
add data to hbase in a graph type format(since its hbase, I want to load it in a way that I can add a ton of data and it'll scale..unlike other databases that limit graphs to the size of the system). I know how to add data but don't understand how to do it in a scalable graph way.
Once the graph is loaded I want to know how to apply some kind of analytics to it. Pagerank is popular so I thought I would say it, but pretty much anything that is based on processing a graph.
I guess the simplified way of asking the question is how to do I specifically get a graph into hbase and once its there how do I analyze it? Is there a tutorial? There's a lot of hbase information on the internet(I read the hbase book) but I could not find anything specific to graphs. I found, giraph, but I don't think it can connect to hbase(yet). Seeing how hadoop/hbase are versions of mapreduce/bigtables I suspect there is a way to process graphs I'm just not having luck finding anything.
*A layered graph is a directed graph with a level for different set of vertex's, like so: http://en.wikipedia.org/wiki/Layered_graph_drawing
I think this question on SO could help:
https://stackoverflow.com/questions/9865738/is-it-possible-to-store-graphs-hbase-if-so-how-do-you-model-the-database-to-sup/9867563#9867563
This part of my answer to this question might be of use.
Using HBase/Accumulo as input to giraph has been submitted recently (7
Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input
and Output formats (GIRAPH-153)
We use giraph in this way, it only store minimum data in each vertex, and then run the graph algorithm with giraph, then we assemble the result with rich data using pig, for page rank algo, each vertex only needs to store vertex id, rank, thus it could scale to almost billion level.

Resources