How to handle large dataset in d3js - d3.js

I have a data set of 11 MB. It's slow to load it every time the document is loaded.
d3.csv("https://s3.amazonaws.com/vidaio/QHP_Individual_Medical_Landscape.csv", function(data) {
// drawing code...
});
I know that crossfilter can be used to slice-and-dice the data once it's loaded in browser. But before that, dataset is big. I only use an aggregation of the data. It seems like I should pre-process the data on server before sending it to client. Maybe, use crossfilter on server side. Any suggestion on how to handle/process large dataset for d3?

Is your data dynamic? If it's not, then you can certainly aggregate it and store the result on your server. The aggregation would only be required once. Even if the data is dynamic, if the changes are infrequent then you could benefit from aggregating only when the data changes and caching that result. If you have highly dynamic data such that you'll have to aggregate it fresh with every page load, then doing it on the server vs. the client could depend on how many simultaneous users you expect. A lot of simultaneous users might bring your server to its knees. OTOH, if you have a small number of users, then your server probably (possibly?) has more horsepower than your users' browsers, in which case it will be able to perform the aggregation faster than the browser. Also keep in mind the bandwidth cost of sending 11 MB to your users. Might not be a big deal ... unless they're loading the page a lot and doing it on mobile devices.

Try simplifying the data (also suggested in the comment from Stephen Thomas)
Try pre-parsing the data into json. This will likely result in a larger file (more network time) but have less parsing overhead (lower client cpu). If your problem is the parsing this could save time
Break the data up by some kind of sharding key, such as year. Limit the to that shard and then load up the other data files on demand as needed
Break up the data by time, but show everything in the UI. load the charts on the default view (such as most recent timeframe) but then asynchronously add the additional files as they arrive (or when they all arrive)

How about server side (gZip) compression. should be a few kb after compressing and browser will de-compress on the background.

Related

Laravel pagination in Data Table

I am using DataTable plugin in Laravel. I have a record of 3000 entries in some
But when i load that page it loads all 3000 records in the browser then create pagination, this slow down the page loading.
How to fix this or correct way
Use server-side processing.
Get help from some Laravel Packages. Such as Yajra's: https://yajrabox.com/docs/laravel-datatables/
Generally you can solve pagination either on the front end, the back end (server or database side), or a combination of both.
Server side processing, without a package, would mean setting up TOP/FETCH or make rows in data being returned from your server. 

You could also load a small amount (say 20) and then when the user scrolls to the bottom of the list, load another 20 or so. I mention the inclusion of front end processing as well because I’m not sure what your use cases are, but I imagine it’s pretty rare any given user actually needs to see 3000 rows at a time.

Given that Data Tables seems to have built-in functionality for paginating data, I think that #tersakyan is essentially correct — what you want is some form of back-end filtering or paginating of rows of data to limit what’s being sent to the front end.

I don’t know if that package works for you or not or what your setup looks like, but pagination can also be achieved directly from a DataBase returning data via the SQL (using TOP/FETCH for example) or could be implemented in a Controller or Service by tracking pages of data and “loading a page at a time” both from the server and then into the table. All you would need is a unique key to associate each "set of pages" for a specific request.
But for performance, you want to avoid both large data requests and operations on large sets of data. So the more you limit how much data is being grabbed or processed at any stage of your application using it, the more performant your application will be in principle.




Lambda Architecture - Why batch layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.
Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!
Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements
You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

Core Data or sqlite for fast search?

This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish
Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.

IndexedDB Access Speed and Efficiency

I'm developing an RPG in Dart, and I'm going to use IndexedDB for data persistence.
I will have two databases: one for read-only access and one for read-write access where save games will be stored. I was just wondering if I should read required data directly from the database or cache it in Maps. I could potentially have several hundred records that need to be pulled from the read-only database (enemies, game maps etc.) and I though pulling everything from the database may be less efficient than using Dart's Maps.
Oh, also each database will be stored in a map. Object Stores will be nested maps inside that map.
Should I read directly from the database, or should I put everything into a Map and read from there?
EDIT: Forgot to mention, the read-only database will be initialised with data from a JSON file located on the user's machine, not through AJAX.
I am confident that hundreds of records will present you no issue in IndexedDB. IDB was designed with that kind of scale in mind, and its async APIs -- while vexing for novices -- make sure your app stays responsive by design.
I am working on a demo designed to push IDB further than it should go, and have some easy-to-reach statistics for you. These are gets on a single index in a single store on a database.
Gets are blazing fast in IndexedDB. The issue with IDB at scale is typically writes.
One thousand success callbacks, one complete callback, were sub-second:
Ten thousand success callbacks, one complete callback, was about 5 seconds:
More than fifty thousand success callbacks fired in less than a minute:
Writes are much slower - bursty at first, but then slow after minutes and dog slow after hours. That's with any schema, but you'd likely have multiple indexes on location (both latitude and longitude at least, I imagine) so your writes will be especially slow (more indexes, more work to do to main those in inserts and updates).
Layout for the stats above (just as important as the stats themselves, make sure to design your schema according to how you need to access it):
I would go with direct database access and then monitor the performance and then optimize where noteable gains are to be expected. Premature optimization is seldom a good idea.

In Memory Caching of Dataset

I am planning to do some in memory caching of my data for operations in my web service. This data would be basically lookup values which do not change frequently. I was planning to get all that data in datasets (multiple tables) and store them till the data does not change on DB side. This is so because some of my data never changes, where some may change quite frequently. Any idea?
I would probably cache it at the DataTable level, then each table could have it's own caching rules (expiration time, last updated, etc, etc).

Resources