I'm planning a D3.js application that will display a network graph (i.e. display nodes and edges--not a line plot or barchart, etc.). Only some nodes and edges need to be displayed at any given moment, and the attributes of nodes and edges will change, too--all in response to user interaction. So far, so good--I know that d3.js can do this sort of thing, as illustrated by the force-collapsible example and the health and wealth of nations example. It would be simplest to keep all of the data in a single JSON or XML object.
I'm worried that if my application loads all of the data needed for all parts of the network at any time, I'll overwhelm the user's system. A typical network will have 35000 nodes, with attributes that vary at up to 5000 timesteps. (This is about 4GB in a GEXF format XML file with unnecessary whitespace removed.)
Is there a way to request only part of a JSON or XML object, i.e. only those parts of the tree that I need a given time? Or will I have to do something more complicated? Any pointers to options to investigate will be appreciated.
(This might be FAQ, but it's one of those things that's difficult to search on.)
Related
I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.
Working in Xcode, Cocoa, Objective C
I m building an app which has an SQLIte database holding the data. The data is a daily summary of about 6 float values over a period of potentially 40 years, although more usually 10 years.
I am writing routines which will graph the data into an NSview. The user has several options in the UI as to whether to draw line or bar graph, the time period to graph, whether the data is weekly or daily etc.
There are two main functions to write here, one for updating the graph settings and one for getting data from the database in a form which the graph can handle (nested array of dates and values)
the question I have is whether it is best to load the full set of graph able data and have the graph 'decide' which slices of data to graph. Or whether to submit multiple requests to the database each time the user selects an option.
For example, if the full set of graph able data were loaded, then if the user selects the weekly option, the graph drawRect method could simply iterate over each 7th entry in the array. Alternatively, I could ask the database to re-submit an array of graph able data.
I hope this makes sense
I think it's best to select only the data you need, rather than letting the graph decide.
There's a lot more to consider than just the quantity of data you'll be reading. There is also the number of memory allocations, how much memory will actually be used, and the memory that is used "behind the scenes" by the allocator. There is also backing store paging.
pl. help me with this noob questions. I want to show a network with large number (70000) of nodes, and 2.1 million links in force layout. Looking for a good and scalable way to do this.
How do we actually show such large nodes practically, can we do some kind of approximation and show semantically same network (e.g: http://www.visualcomplexity.com/vc/project.cfm?id=76 )
How do we actually reduce such data in back end [ say using KDE ? We cannot afford to use science.js in front end as the volume is large ]
Initial view can be the network with pre-determined locations of the nodes or clusters. How do we predertmine the locations in back end, before sending the data to d3js. Do we have to use topojson ?
Any such examples are available using d3js (and a backend - say java, python etc) ?
Sorry about the question, but do you really need to show all that information in one shot?
If you really need it, have first a look with Gephi and see what it looks like, then pass to the next step.
If you see that you can focus on specific nodes or patterns at the beginning and then explore the result of the chart, probably this is the best solution from a performance point of view.
In case the discovery approach works but you are still having troubles with many items on the screen, just control the force layout with a time based threshold. It's not perfect but it will work for hundred nodes.
Next step
If you decide to go anyway on this path, I would recommend the followings:
Aggregate: that's probably the most useful thing you can do here: let the user interact with the data and dig in it to see more in detail. That is the best solution if you have to serve many clients.
Do not run the force directed layout on the front end with the entire network as is: it will eat all the browser resources for at least tens of minutes in any case.
Compute the layout on the back end - e.g. using JUNG or Gephi core itself in Java or NetworkX in Python - and then just display the result.
Cache the result of the point above as well: they are many even for the server if you have many clients, so cache it.
When the user drag the network, hide the links: it should speed up the computation ( sigmajs uses this trick)
Is there open data format for representing such GIS data as roads, localities, sublocalities, countries, buildings, etc.
I expect that format would define address structure and names for components of address.
What I need is a data format to return in response to reverse geocoding requests.
I looked for it on the Internet, but it seems that every geocoding provider defines its own format.
Should I design my own format?
Does my question make any sense at all? (I'm a newbie to GIS).
In case I have not made myself clear I don't look for such data formats as GeoJSON, GML or WKT, since they define geometry and don't define any address structure.
UPD. I'm experimenting with different geocoding services and trying to isolate them into separate module. I need to provide one common interface for all of them and I don't want to make up one more data format (because on the one hand I don't fully understand domain and on the other hand the field itself seems to be well studied). The module's responsibility is to take partial address (or coordinates) like "96, Dubininskaya, Moscow" and to return data structure containing house number (96), street name (Dubininskaya), sublocality (Danilovsky rn), city (Moscow), administrative area (Moskovskaya oblast), country (Russia). The problem is that in different countries there might be more/less division (more/less address components) and I need to unify these components across countries.
Nope there is not unfortunately.
Why you may ask
Beacuse different nations and countries have vastly different formats and requirements for storing addresses.
Here in the UK for example, defining a postcode has quite a complex set of rules, where as ZIP codes in the US, are 4 digit numerical prefixed with a simple 2 letter state code.
Then you have to consider the question what exactly constitutes an address? again this differences not just from country to country, but some times drastically within the same territory.
for example: (Here in the UK)
Smith and Sons Butchers
10 High street
Some town
Mr smith
10 High street
Some town
The Occupier
10 High Street
Some Town
Smith and Sons Butchers
High Street
Some Town
Are all valid addresses in the UK, and in all cases the post would arrive at the correct destination, a GPS however may have trouble.
A GPS database might be set up so that each building is a square bit of geometry, with the ID being the house number.
That, would give us the ability to say exactly where number 10 is, which means immediately the last look up is going to fail.
Plots may be indexed by name of business, again that s fine until you start using person names, or generic titles.
There's so much variation, that it's simply not possible to create one unified format that can encompass every possible rule required to allow any application on the planet to format any geo-coded address correctly.
So how do we solve the problem?
Simple, by narrowing your scope.
Deal ONLY with a specific set of defined entities that you need to work with.
Hold only the information you need to describe what you need to describe (Always remember YAGNI* here)
Use standard data transmission formats such as JSON, XML and CSV this will increase your chances of having to do less work on code you don't control to allow it to read your data output
(* YAGNI = You ain't gonna need it)
Now, to dig in deeper however:
When it comes to actual GIS data, there's a lot of standard format files, the 3 most common are:
Esri Shape Files (*.shp)
Keyhole mark up Language (*.kml)
Comma separated values (*.csv)
All of the main stay GIS packages free and paid for can work with any of these 3 file types, and many more.
Shape files are by far the most common ones your going to come across, just about every bit of Geospatial data Iv'e come across in my years in I.T has been in a shape file, I would however NOT recommend storing your data in them for processing, they are quite a complex format, often slow and sequential to access.
If your geometry files to be consumed in other systems however, you can't go wrong with them.
They also have the added bonus that you can attach attributes to each item of data too, such as address details, names etc.
The problem is, there is no standard as to what you would call the attribute columns, or what you would include, and probably more drastically, the column names are restricted to UPPERCASE and limited to 32 chars in length.
Kml files are another that's quite universally recognized, and because there XML based and used by Google, you can include a lot of extra data in them, that technically is self describing to the machine reading it.
Unfortunately, file sizes can be incredibly bulky even just for a handful of simple geometries, this trade off does mean though that they are pretty easy to handle in just about any programming language on the planet.
and that brings us to the humble CSV.
The main stay of data transfer (Not just geo-spatial) ever since time began.
If you can put your data in a database table or a spreadsheet, then you can put it in a CSV file.
Again, there is no standards, other than how columns may or may not be quoted and what the separation points are, but readers have to know ahead of time what each column represents.
Also there's no "Pre-Made" geographic storage element (In fact there's no data types at all) so your reading application, also will need to know ahead of time what the column data types are meant to be so it can parse them appropriately.
On the plus side however, EVERYTHING can read them, whether they can make sense of them is a different story.
What do you think the performance difference would be?
20,000 nodes
Each node has a Link field. The number of values range from 50 to 200. The Links will have no title.
OR
20,000 nodes
Each node will have the links in the body field as straight text with filtered html. As so:
http://link1.com
http://link2.com
http://link3.com
http://link4.com
http://link5.com
http://link6.com
http://link7.com
http://link8.com
http://link9.com
http://link10.com
It really depends how/what you are going to use them. I doubt you are going to display 20.000 nodes at once. It's really hard to say much about performance, without a specific use case, and even then, you have to take caching and what not into consideration as well.
In any regard, CCK will probably always be a tiny bit slower, because you are extracting multiple values instead of a single value, which makes the query a tiny bit more complex. I doubt that you will be able to measure that on your drupal site though.
Another thing to keep in mind, is that using CCK fields will give you added flexibility, is it integrates well with views. So you can easily pull out the links and format them in different ways.