d3.js not being able to visualiza a large dataset - d3.js

I need some suggestions on using d3.js for visualizing big data. I am pulling data from hbase and storing in a json file for visualizing using d3.js. When I pull the data of few hours the size of json file is around 100MB and can be easily visualized by d3.js but the filtering using dc.js and crossfilter is little slow. But when I pull the dataset of 1 week the json file size becomes more than 1GB and try to visualize using d3.js, dc.js and crossfilter then the visualization is not working properly and the filtering is also not possible. Can anyone give me any idea whether there is a good solution to this or I need to work on different platform instead of d3?

I definitely agree with what both Mark and Gordon have said before. But I must add what I have learnt in the past months as I scaled up a dc.js dashboard to deal with pretty big datasets.
One bottleneck is, as pointed out, the size of your datasets when it translates into thousands of SVG/DOM or Canvas elements. Canvas is lighter on the browser, but you still have a huge amount of elements in memory, each with their attributes, click events, etc.
The second bottleneck is the complexity of your data. The responsiveness of dc.js depends not only on d3.js, but also on crossfilter.js. If you inspect the Crossfilter example dashboard, you will see that the size of the data they use is quite impressive: over 230000 entries. However, the complexity of those data is rather low: just five variables per entry. Keeping your datasets simple helps scaling up a lot. Keep in mind that five variables per each entry here means about one million values in the browser's memory during visualization.
Final point, you mention that you pull the data in JSON format. While that is very handy in Javascript, parsing and validating big JSON files is quite demanding. Besides, it is not the most compact format. The Crossfilter example data are formatted as a really simple and tight CSV file.
In summary, you will have to find the sweet spot between size and complexity of your data. One million data values (size times complexity) is perfectly feasible. Increase that by one order of magnitude and your application might still be usable.

As #Mark says, canvas versus DOM rendering is one thing to consider. For sure the biggest expense in Web visualization is DOM elements.
However, to some extent crossfilter can mitigate this by aggregating the data into a smaller number of visual elements. It can get you up into the hundreds of thousands of rows of data. 1GB might be pushing it, but 100s of megabytes is possible.
But you do need to be aware of what level you are aggregating at. So, for example, if it's a week of time series data, probably bucketing by the hour is a reasonable visualization, for 7*24 = 168 points. You won't actually be able to perceive many more points, so it is pointless asking the browser to draw thousands of elements.

Related

Balanced Tree structure for storing and updating 3D points

I'm on an odyssey to find a good tree structure to store and update my application data.
The data are positions in 3 dimensions (x, y, z)
They need to be able to be updated and queried by range quickly (every 30 milliseconds). The queries would be, for example: "get all the points around (2,3,4) in a radius of 100cm"
The data is always in internal memory.
Could someone of you recommend me a good type of tree that meets these requirements?
The KD-Trees wouldn't work for me because they are not made to be updated at this speed. I should rebuild them whole on every update.
BKD-Trees wouldn't work for me either because they are made to store data on disk (not in internal memory).
Apparently the R-Trees are also designed to store the data in the leaves.
If you need fast updates as well as range queries, in-memory, I can recommend either a grid index or the PH-tree.
A grid index is essentially an 2D/3D array of buckets. The grid is laid over your data space and you just store your data in the bucket (=grid cell) where your point is. For range queries you just check all entries in all buckets that overlap with your query range.
It takes a bit trial and error to find the best grid size.
In my experience this is the best solution in 2D with 1000 points or less. I have no experience with 3D grid indexes.
For larger datasets I recommend the PH-tree (disclaimer: self advertisement). Updates are much faster than with R-trees, deletion is as fast as insertion. There is no rebalancing (as it happens with R-trees or some kd-trees) so insertion/deletion times are quite predictable (rebalancing is neither need nor possible, imbalance is inherently limited).
Range queries (= window queries) are a bit slower than R-trees, but the difference almost disappears for very small ranges (windows).
It is available in Java and C++.

Reading Jennifer5 Monitor

I am using Jennifer5 to monitor my webservices, but I am confused about the information on the monitor. I have attached an image, and if you see the circled part of the graphs, they are showing future time for current day, and does have some data, is this data an average of the past data, or some algorithm used on past data to predict the possible future data? I cannot say what those values exactly are.
It was the data of the previous day, as described for one of the charts in the manual of Jennifer5;

ssas dimension processing incremental

I am having a large dimension and it is taking me more and more time to process it. I would like to decrease the processing time as much as possible
there is literally hundreds of different articles on how to process ssas objects as efficient and fast as possible.
There are lots of tips and tricks that one can apply to speed up dimensions and cube processing. I managed to apply all or at least a big majority of them and I am still not happy with the result,.
I have a large dimension built on top of a table.
It has around 60 mil records and it keeps on growing fast.
It either add new rows to it or delete the existing ones. there are no updates possible
I am looking for a solution that will allow me to perform an incremental processing of my dimension.
I know that the data in the previous month will not be changed. I would like to do smth similar to partitioning of my cube but on the dimension.
I am using SLQ SERVER 2012 and to my knowledge dimension partitioning is not supported.
I am currently using process update on my dimension - I tried processing using by attribute and by table but both render almost the same result. I have hierarchies and relationships - some set to rigid. I am only using those attributes that are truly needed etc etc etc
process update has to read all the records in a dimension even those that i know have not changed. is there a way to partition a dimension? if I could tell SSAS to only process the last 3-4 weeks of data in my dimension and not touch the rest - it would greatly speed up my processing time.
I would appreciate your help
ok so I did a bit of research and I can confirm that incremental dimension processing is not supported.
it is possible to do process add on a dimension but if you have records that got deleted or updated you cannot do that
it would be a useful thing to have but MS hasn't developed it and I don't think it will
incremental processing of any table is however possible in tabular cubes
so if you have a similar requirement and your cube is not too complex then creating a tabular cube is the way to go

D3: What are the most expensive operations?

I was rewriting my code just now and it feels many magnitudes slower. Previously it was pretty much instant, now my animations take 4 seconds to react to mouse hovers.
I tried removing transitions and not having opacity changes but it's still really slow.
Though it is more readable. - -;
The only thing I did was split large functions into smaller more logical ones and reordered the grouping and used new selections. What could cause such a huge difference in speed? My dataset isn't large either...16kb.
edit: I also split up my monolithic huge chain.
edit2: I fudged around with my code a bit, and it seems that switching to nodeGroup.append("path") caused it to be much slower than svg.append("path"). The inelegant thing about this though is that I have to transform the drawn paths to the middle when using svg while the entire group is already transformed. Can anyone shed some insight and group.append vs svg.append?
edit3: Additionally I was using opacity:0 to hide all my path line before redrawing, which caused it to become slower and slower because these lines were never removed. Switched to remove();
Without data it is hard to work with or suggest a solution. You don't need to share private data but it helps to generate some fake data with the same structure. It's also not clear where your performance hit comes if we can't see how many dom elements you are trying to make/interact with.
As for obvious things that stand out, you are not doing things in a data driven way for drawing your segments. Any time you see a for loop it is a hint that you are not using d3's selections when you could.
You should bind listEdges to your paths and draw them from within the selection, it's ok to transform them to the center from there. also, you shouldn't do d3.select when you can do nodeGroup.select, this way you don't need to traverse the entire page when searching for your circles.

Does someone really sort terabytes of data?

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.

Resources