Strategies to reduce DOM elements of large data sets - dc.js

I have a large dataset that I want to display using dc.js. The amount of entries exceeds the available drawing space in pixels on the screen by far. So it does not make sense to render 20k points on a 500px wide chart and also slows down the browser.
I read the Performance teak section of the wiki and thought of some other things:
Aggregating groups using crossfilter (e.g. chunk the dataset in 500 groups if I have a 500px wide svg)
simplify my data using a Douglas–Peucker or Visvalingam’s algorithm
dc.js offers a neat rangeChart that can be used to display range selection that I want to use.
But the more I zoom in the rangeChart the more Detail I want to show. But I don't know on how to get the zoom level of the chart and aggregate a group 'on the fly'. Perhaps someone has a thought about this.
I created a codepan as an example.

This comes up a lot so I've added a focus dynamic interval example.
It's a refinement of the same techniques in the switching time intervals example, except here we determine which d3 time interval to use based on the extent of the brush in the range chart.
Unfortunately I don't have time to tune it right now, so let's iterate on this. IMO it's almost but not quite fast enough - it could sample even less points but I used the built-in time intervals. When you see a jaggy line in the dc line chart
it's usually because you are displaying too many points - there should be dozens not hundreds and never thousands.
The idea is to spawn different groups for different time intervals. Here we'll define a few intervals and the threshold, in milliseconds, at which we should use that interval:
var groups_by_min_interval = [
{
name: 'minutes',
threshold: 60*60*1000,
interval: d3.timeMinute
}, {
name: 'seconds',
threshold: 60*1000,
interval: d3.timeSecond
}, {
name: 'milliseconds',
threshold: 0,
interval: d3.timeMillisecond
}
];
Again, there should be more here - since we will generate the groups dynamically and cache them, it's okay to have a bunch. (It will probably hog memory at some point, but gigabytes are OK in JS these days.)
When we need a group, we'll generate it by using the d3 interval function, which produces the floor, and then reduce total and count:
function make_group(interval) {
return dimension.group(interval).reduce(
function(p, v) {
p.count++;
p.total += v.value;
return p;
},
function(p, v) {
p.count--;
p.total += v.value;
return p;
},
function() {
return {count: 0, total: 0};
}
);
}
Accordingly we will tell the charts to compute the average in their valueAccessors:
chart.valueAccessor(kv => kv.value.total / kv.value.count)
Here's the fun part: when we need a group, we'll scan this list until we find the first spec whose threshold is less than the current extent in milliseconds:
function choose_group(extent) {
var d = extent[1].getTime() - extent[0].getTime();
var found = groups_by_min_interval.find(mg => mg.threshold < d);
console.log('interval ' + d + ' is more than ' + found.threshold + ' ms; choosing ' + found.name +
' for ' + found.interval.range(extent[0], extent[1]).length + ' points');
if(!found.group)
found.group = make_group(found.interval);
return found.group;
}
Hook this up to the filtered event of the range chart:
rangeChart.on('filtered.dynamic-interval', function(_, filter) {
chart.group(choose_group(filter || fullDomain));
});
Run out of time for now. Please ask any questions, and we'll refine this better. We will need custom time intervals (like 10th of a second) and I am failing to find that example right now. There is a good way to do it.
Note: I have one-upped you and increased the number of points by an order of magnitude to half a million. This may be too much for older computers, but on a 2017 computer it proves that data quantity is not the problem, DOM elements are.

Related

How to show only limited number of records in box plot dc.js

I want to show the most recent 10 bins for box plot.
If a filter is applied to the bar chart or line chart, the box plot should show the most recent 10 records according to those filters.
I made dimension by date(ordinal). But I am unable to get the result.
I didn’t get how to do it with a fake group. I am new to dc.js.
The pic of scenario is attached. Let me know if anyone need more detail to help me.
in image i tried some solution by time scale.
You can do this with two fake groups, one to remove the empty box plots, and one to take the last N elements of the resulting data.
Removing empty box plots:
function remove_empty_array_bins(group) {
return {
all: function() {
return group.all().filter(d => d.value.length);
}
};
}
This just filters the bins, removing any where the .value array is of length 0.
Taking the last N elements:
function cap_group(group, N) {
return {
all: function() {
var all = group.all();
return all.slice(all.length - N);
}
};
}
This is essentially what the cap mixin does, except without creating a bin for "others" (which is somewhat tricky).
We fetch the data from the original group, see how long it is, and then slice that array from all.length - N to the end.
Chain these fake together when passing them to the chart:
chart
.group(cap_group(remove_empty_array_bins(closeGroup), 5))
I'm using 5 instead of 10 because I have a smaller data set to work with.
Demo fiddle.
This example uses a "real" time scale rather than ordinal dates. There are a few ways to do ordinal dates, but if your group is still sorted from low to high dates, this should still work.
If not, you'll have to edit your question to include an example of the code you are using to generate the ordinal date group.

dc.js rowChart to Filter by max key

I have a dashboard where I'm showing Headcount over time. One is a line Graph that shows headcount over time period, the other is a rowChart that is split by HCLevel1 - that is simply there to allow users to filter.
I would like the rowChart to show Heads for the latest period within the date filter (rather than showing the full sum of heads for the full period which would be wrong).
I can do this by combining two fields into a dimension, but the problem with this is that when I use the rowChart to filter by business, I only see one month in the line chart - whereas I'd like to see the full period that's filtered. I can't work out how I could do this with a fake group, because the rowChart's dimension/key is HCLevel1.
My data is formatted like this:
var data = = [
{
"HCLevel1": "Commercial",
"HCLevel2": "Portfolio TH",
"Period": 201407,
"Heads": 720
},
I've tried to use this custom reduce (picked up from another SO question) but it doesn't work correctly (minus values, incorrect values etc).
function reduceAddAvgPeriods(p, v) {
if (v.Period in p.periodsArray) {
p.periodsArray[v.Period] += v.Heads;
} else {
p.periodsArray[v.Period] = 0;
p.periodCount++;
}
p.heads += v.Heads;
return p;
}
Currently, my jsfiddle example is combining 2 fields for the dimension, but as you can see, I can't then filter using the rowChart to show me the full period on the line chart.
I can use reductio to give me the average, but I'd like to provide actual Heads value for most recent date filtered.
https://jsfiddle.net/kevinelphick/4ybekqey/3/
I hope this is possible, any help would be much appreciated, thanks!
I glanced at this a few days ago, but it took me a little while to figure out. Tricky!
We can restrict the design by considering these two facts:
We want to filter the row chart by "Level". That's simply
var dimLevel = cf.dimension(function (d) { return d.HCLevel1 || ''; });
A group does not observe its own dimension's filters. So we probably want to use the dimension from #1 to produce the data (the group) for the row chart.
Given these two restrictions, maybe we can dimension and group by level, but inside the bins of the group, keep track of the periods that contribute to that bin?
This is a common pattern often used for stacked charts:
var levelPeriodGroup = dimLevel.group().reduce(
function(p, v) {
p[v.Period] = (p[v.Period] || 0) + v.Heads;
return p;
},
function(p, v) {
p[v.Period] -= v.Heads;
return p;
},
function() {
return {};
}
);
Here, we'll just 'peel off' the top stack, dropping any zeros:
function last_period(group, maxPeriod) {
return {
all: function() {
var max = maxPeriod();
return group.all().map(function(kv) {
return {key: kv.key, value: kv.value[max]};
}).filter(function(kv) {
return kv.value > 0;
});
}
};
}
To keep last_period somewhat general, maxPeriod is now a function, which we'll define like this:
function max_period() {
return dimPeriod.top(1)[0].Period;
}
Bringing it all together and supplying it to the row chart:
rowChart
.group(last_period(levelPeriodGroup, max_period))
.dimension(dimLevel)
.elasticX(true);
Since the period is no longer part of the labels of the chart, we can put it in a headline:
<h4>Last Period: <span id="last-period"></span></h4>
and update it whenever the row chart is drawn:
rowChart.on('pretransition', function(chart) {
d3.select('#last-period').text(max_period());
});

Clicking on rowchart (dc.js) changes the percentage

I need to solve a problem with dc and crossfilter, I have two rowcharts in which I show the calculated percentage of each row as:
(d.value/ndx.groupAll().reduceCount().value()*100).toFixed(1)
When you click on a row in the first chart, the text changes to 100% and does not maintain the old percentage value, also the percentages of the rows of the same chart where the row was selected change.
Is it possible to keep the original percentage when I click ?, affecting the other graphics where it was not clicked.
regards
thank you very much
First off, you probably don't want to call ndx.groupAll() inside of the calculation for the percentages, since that will be called many times. This method creates a object which will get updated every time a filter changes.
Now, there are three ways to interpret your specific question. I think the first case is the most likely, but the other two are also legitimate, so I'll address all three.
Percentages affected by other charts
Clearly you don't want the percentage affected by filtering the current chart. You almost never want that. But it often makes sense to have the percentage label affected by filtering on other charts, so that all the bars in the row chart add up to 100%.
The subtle difference between dimension.groupAll() and crossfilter.groupAll() is that the former will not observe that dimensions filters, whereas the latter observes all filters. If we use the row chart dimension's groupAll it will observe the other filters but not filters on this chart:
var totalGroup = rowDim.groupAll().reduceCount();
rowChart.label(function(kv) {
return kv.key + ' (' + (kv.value/totalGroup.value()*100).toFixed(1) + '%)';
});
That's probably what you want, but reading your question literally suggests two other possible answers. So read on if that's not what you were looking for.
Percentages out of the constant total, but affected by other filters
Crossfilter doesn't have any particular way to calculate unfiltered totals, but if want to use the unfiltered total, we can capture the value before any filters are applied.
So:
var total = rowDim.groupAll().reduceCount().value;
rowChart.label(function(kv) {
return kv.key + ' (' + (kv.value/total*100).toFixed(1) + '%)';
});
In this case, the percentages will always show the portion out of the full, unfiltered, total denominator, but the numerators will reflect filters on other charts.
Percentages not affected by filtering at all
If you really want to just completely freeze the percentages and show unfiltered percentages, not affected by any filtering, we'll have to do a little extra work to capture those values.
(This is similar to what you need to do if you want to show a "shadow" of the unfiltered bars behind them.)
We'll copy all the group data into a map we can use to look up the values:
var rowUnfilteredAll = rowGroup.all().reduce(function(p, kv) {
p[kv.key] = kv.value;
return p;
}, {});
Now the label code is similar to before, but we lookup values instead of reading them from the bound data:
var total = rowDim.groupAll().reduceCount().value;
rowChart.label(function(kv) {
return kv.key + ' (' + (rowUnfilteredAll[kv.key]/total*100).toFixed(1) + '%)';
});
(There might be a simpler way to just freeze the labels, but this is what came to mind.)

crossfilter "double grouping" where key is the value of another reduction

Here is my data about mac address. It is recorded per minute. For each minute, I have many unique Mac addresses.
mac_add,created_time
18:59:36:12:23:33,2016-12-07 00:00:00.000
1c:e1:92:34:d7:46,2016-12-07 00:00:00.000
2c:f0:ee:86:bd:51,2016-12-07 00:00:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:00:00.000
...
18:59:36:12:23:33,2016-12-07 00:01:00.000
1c:cd:e5:1e:99:78,2016-12-07 00:01:00.000
1c:e1:92:34:d7:46,2016-12-07 00:01:00.000
5c:cf:7f:22:01:df,2016-12-07 00:01:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:01:00.000
...
I would like to create 2 bar charts using dc.js and crossfilter. Please refer to the image for the charts.
The first bar chart is easy enough to create. It is brushable. I created the "created_time" dimension, and created a group and reduceCount by "mac_add", such as below:
var moveTime = ndx.dimension(function (d) {
return d.dd; //# this is the created_time
});
var timeGroup = moveTime.group().reduceCount(function (d) {
return d.mac_add;
});
var visitorChart = dc.barChart('#visitor-no-bar');
visitorChart.width(990)
.height(350)
.margins({ top: 0, right: 50, bottom: 20, left: 40 })
.dimension(moveTime)
.group(timeGroup)
.centerBar(true)
.gap(1)
.elasticY(true)
.x(d3.time.scale().domain([new Date(2016, 11, 7), new Date(2016, 11, 13)]))
.round(d3.time.minute.round)
.xUnits(d3.time.minute);
visitorChart.render();
The problem is on the second bar chart. The idea is that, one row of the data equals 1 minute, so I can aggregate and sum all minutes of each mac address to get the time length of each mac addresses, by creating another dimension by "mac_add" and do reduceCount on "mac_add" to get the time length. Then the goal is to group the time length by 30 minutes. So we can get how many mac address that have time length of 30 min and less, how many mac_add that have time length between 30 min and 1 hour, how many mac_add that have time length between 1 hour and 1.5 hour, etc...
Please correct me if I am wrong. Logically, I was thinking the dimension of the second bar chart should be the group of time length (such as <30, <1hr, < 1.5hr, etc). But the time length group themselves are not fix. It depends on the brush selection of the first chart. Maybe it only contains 30 min, maybe it only contains 1.5 hours, maybe it contains 1.5 hours and 2 hours, etc...
So I am really confused what parameters to put into the second bar chart. And method to get the required parameters (how to group a grouped data). Please help me to explain the solution.
Regards,
Marvin
I think we've called this a "double grouping" in the past, but I can't find the previous questions.
Setting up the groups
I'd start with a regular crossfilter group for the mac addresses, and then produce a fake group to aggregate by count of minutes.
var minutesPerMacDim = ndx.dimension(function(d) { return d.mac_add; }),
minutesPerMapGroup = minutesPerMacDim.group();
function bin_keys_by_value(group, bin_value) {
var _bins;
return {
all: function() {
var bins = {};
group.all().forEach(function(kv) {
var valk = bin_value(kv.value);
bins[valk] = bins[valk] || [];
bins[valk].push(kv.key);
});
_bins = bins;
// note: Object.keys returning numerical order here might not
// work everywhere, but I couldn't find a browser where it didn't
return Object.keys(bins).map(function(bin) {
return {key: bin, value: bins[bin].length};
})
},
bins: function() {
return _bins;
}
};
}
function bin_30_mins = function(v) {
return 30 * Math.ceil(v/30);
}
var macsPerMinuteCount = bin_keys_by_value(minutesPerMacGroup);
This will retain the mac addresses for each time bin, which we'll need for filtering later. It's uncommon to add a non-standard method bins to a fake group, but I can't think of an efficient way to retain that information, given that the filtering interface will only give us access to the keys.
Since the function takes a binning function, we could even use a threshold scale if we wanted more complicated bins than just rounding up to the nearest 30 minutes. A quantize scale is a more general way to do the rounding shown above.
Setting up the chart
Using this data to drive a chart is simple: we can use the dimension and fake group as usual.
chart
.dimension(minutesPerMacDim)
.group(macsPerMinuteCount)
Setting up the chart so that it can filter is a bit more complicated:
chart.filterHandler(function(dimension, filters) {
if(filters.length === 0)
dimension.filter(null);
else {
var bins = chart.group().bins(); // retrieve cached bins
var macs = filters.map(function(key) { return bins[key]; })
macs = Array.prototype.concat.apply([], macs);
var macset = d3.set(macs);
dimension.filterFunction(function(key) {
return macset.has(key);
})
}
})
Recall that we're using a dimension which is keyed on mac addresses; this is good because we want to filter on mac addresses. But the chart is receiving minute-counts for its keys, and the filters will contain those keys, like 30, 60, 90, etc. So we need to supply a filterHandler which takes minute-count keys and filters the dimension based on those.
Note 1: This is all untested, so if it doesn't work, please post an example as a fiddle or bl.ock - there are fiddles and blocks you can fork to get started on the main page.
Note 2: Strictly speaking, this is not measuring the length of connections: it's counting the total number of minutes connected. Not sure if this matters to you. If a user disconnects and then reconnects within the timeframe, the two sessions will be counted as one. I think you'd have to preprocess to get duration.
EDIT: Based on your fiddle (thank you!) the code above does seem to work. It's just a matter of setting up the x scale and xUnits properly.
chart2
.x(d3.scale.linear().domain([60,1440]))
.xUnits(function(start, end) {
return (end-start)/30;
})
A linear scale will do just fine here - I wouldn't try to quantize that scale, since the 30-minute divisions are already set up. We do need to set the xUnits so that dc.js knows how wide to make the bars.
I'm not sure why elasticX didn't work here, but the <30 bin completely dwarfed everything else, so I thought it was best to leave that out.
Fork of your fiddle: https://jsfiddle.net/gordonwoodhull/2a8ow1ay/2/

d3 linechart - Show 0 on the y-axis without passing in all points?

I have a line chart. Its purpose is to show the amount of transactions per user over a given time period.
To do this I'm getting the dates of all users transactions. I'm working off this example : http://bl.ocks.org/mbostock/3884955 and have the line chart renedering fine.
My x-axis is time and the y-axis is number of transactions. The problem I have is to do with displaying dates when there is no activity.
Say I have 4 transactions on Tuesday and 5 transactions on Thursday..I need to show that there has been 0 transactions on Wednesday. As no data exists in my database explicitly stating that a user has made no transactions on Wedensday do I need to pass in the Wednesday time (and all other times, depending on the timeframe) with a 0 value? or can I do it with d3? I can't seem to find any examples that fit my problem.
This seems like a pretty common issue, so I worked up an example implementation here: http://jsfiddle.net/nrabinowitz/dhW2F/2/
Relevant code:
// get the min/max dates
var extent = d3.extent(data, function(d) { return d.date; }),
// hash the existing days for easy lookup
dateHash = data.reduce(function(agg, d) {
agg[d.date] = true;
return agg;
}, {}),
// note that this leverages some "get all headers but date" processing
// already present in the example
headers = color.domain();
// make even intervals
d3.time.days(extent[0], extent[1])
// drop the existing ones
.filter(function(date) {
return !dateHash[date];
})
// and push them into the array
.forEach(function(date) {
var emptyRow = { date: date };
headers.forEach(function(header) {
emptyRow[header] = null;
});
data.push(emptyRow);
});
// re-sort the data
data.sort(function(a, b) { return d3.ascending(a.date, b.date); });
As you can see, it's a bit convoluted, but seems to work well - you make an array of evenly spaced dates using the handy d3.interval.range method, filter out those dates already present in your data, and use the remaining ones to push empty rows. One downside is that performance could be slow for a big dataset - and this assumes full rows are empty, rather than different empty dates in different series.
An alternate representation, with gaps (using line.defined) instead of zero points, is here: http://jsfiddle.net/nrabinowitz/dhW2F/3/

Resources