crossfilter "double grouping" where key is the value of another reduction - dc.js

Here is my data about mac address. It is recorded per minute. For each minute, I have many unique Mac addresses.
mac_add,created_time
18:59:36:12:23:33,2016-12-07 00:00:00.000
1c:e1:92:34:d7:46,2016-12-07 00:00:00.000
2c:f0:ee:86:bd:51,2016-12-07 00:00:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:00:00.000
...
18:59:36:12:23:33,2016-12-07 00:01:00.000
1c:cd:e5:1e:99:78,2016-12-07 00:01:00.000
1c:e1:92:34:d7:46,2016-12-07 00:01:00.000
5c:cf:7f:22:01:df,2016-12-07 00:01:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:01:00.000
...
I would like to create 2 bar charts using dc.js and crossfilter. Please refer to the image for the charts.
The first bar chart is easy enough to create. It is brushable. I created the "created_time" dimension, and created a group and reduceCount by "mac_add", such as below:
var moveTime = ndx.dimension(function (d) {
return d.dd; //# this is the created_time
});
var timeGroup = moveTime.group().reduceCount(function (d) {
return d.mac_add;
});
var visitorChart = dc.barChart('#visitor-no-bar');
visitorChart.width(990)
.height(350)
.margins({ top: 0, right: 50, bottom: 20, left: 40 })
.dimension(moveTime)
.group(timeGroup)
.centerBar(true)
.gap(1)
.elasticY(true)
.x(d3.time.scale().domain([new Date(2016, 11, 7), new Date(2016, 11, 13)]))
.round(d3.time.minute.round)
.xUnits(d3.time.minute);
visitorChart.render();
The problem is on the second bar chart. The idea is that, one row of the data equals 1 minute, so I can aggregate and sum all minutes of each mac address to get the time length of each mac addresses, by creating another dimension by "mac_add" and do reduceCount on "mac_add" to get the time length. Then the goal is to group the time length by 30 minutes. So we can get how many mac address that have time length of 30 min and less, how many mac_add that have time length between 30 min and 1 hour, how many mac_add that have time length between 1 hour and 1.5 hour, etc...
Please correct me if I am wrong. Logically, I was thinking the dimension of the second bar chart should be the group of time length (such as <30, <1hr, < 1.5hr, etc). But the time length group themselves are not fix. It depends on the brush selection of the first chart. Maybe it only contains 30 min, maybe it only contains 1.5 hours, maybe it contains 1.5 hours and 2 hours, etc...
So I am really confused what parameters to put into the second bar chart. And method to get the required parameters (how to group a grouped data). Please help me to explain the solution.
Regards,
Marvin

I think we've called this a "double grouping" in the past, but I can't find the previous questions.
Setting up the groups
I'd start with a regular crossfilter group for the mac addresses, and then produce a fake group to aggregate by count of minutes.
var minutesPerMacDim = ndx.dimension(function(d) { return d.mac_add; }),
minutesPerMapGroup = minutesPerMacDim.group();
function bin_keys_by_value(group, bin_value) {
var _bins;
return {
all: function() {
var bins = {};
group.all().forEach(function(kv) {
var valk = bin_value(kv.value);
bins[valk] = bins[valk] || [];
bins[valk].push(kv.key);
});
_bins = bins;
// note: Object.keys returning numerical order here might not
// work everywhere, but I couldn't find a browser where it didn't
return Object.keys(bins).map(function(bin) {
return {key: bin, value: bins[bin].length};
})
},
bins: function() {
return _bins;
}
};
}
function bin_30_mins = function(v) {
return 30 * Math.ceil(v/30);
}
var macsPerMinuteCount = bin_keys_by_value(minutesPerMacGroup);
This will retain the mac addresses for each time bin, which we'll need for filtering later. It's uncommon to add a non-standard method bins to a fake group, but I can't think of an efficient way to retain that information, given that the filtering interface will only give us access to the keys.
Since the function takes a binning function, we could even use a threshold scale if we wanted more complicated bins than just rounding up to the nearest 30 minutes. A quantize scale is a more general way to do the rounding shown above.
Setting up the chart
Using this data to drive a chart is simple: we can use the dimension and fake group as usual.
chart
.dimension(minutesPerMacDim)
.group(macsPerMinuteCount)
Setting up the chart so that it can filter is a bit more complicated:
chart.filterHandler(function(dimension, filters) {
if(filters.length === 0)
dimension.filter(null);
else {
var bins = chart.group().bins(); // retrieve cached bins
var macs = filters.map(function(key) { return bins[key]; })
macs = Array.prototype.concat.apply([], macs);
var macset = d3.set(macs);
dimension.filterFunction(function(key) {
return macset.has(key);
})
}
})
Recall that we're using a dimension which is keyed on mac addresses; this is good because we want to filter on mac addresses. But the chart is receiving minute-counts for its keys, and the filters will contain those keys, like 30, 60, 90, etc. So we need to supply a filterHandler which takes minute-count keys and filters the dimension based on those.
Note 1: This is all untested, so if it doesn't work, please post an example as a fiddle or bl.ock - there are fiddles and blocks you can fork to get started on the main page.
Note 2: Strictly speaking, this is not measuring the length of connections: it's counting the total number of minutes connected. Not sure if this matters to you. If a user disconnects and then reconnects within the timeframe, the two sessions will be counted as one. I think you'd have to preprocess to get duration.
EDIT: Based on your fiddle (thank you!) the code above does seem to work. It's just a matter of setting up the x scale and xUnits properly.
chart2
.x(d3.scale.linear().domain([60,1440]))
.xUnits(function(start, end) {
return (end-start)/30;
})
A linear scale will do just fine here - I wouldn't try to quantize that scale, since the 30-minute divisions are already set up. We do need to set the xUnits so that dc.js knows how wide to make the bars.
I'm not sure why elasticX didn't work here, but the <30 bin completely dwarfed everything else, so I thought it was best to leave that out.
Fork of your fiddle: https://jsfiddle.net/gordonwoodhull/2a8ow1ay/2/

Related

Strategies to reduce DOM elements of large data sets

I have a large dataset that I want to display using dc.js. The amount of entries exceeds the available drawing space in pixels on the screen by far. So it does not make sense to render 20k points on a 500px wide chart and also slows down the browser.
I read the Performance teak section of the wiki and thought of some other things:
Aggregating groups using crossfilter (e.g. chunk the dataset in 500 groups if I have a 500px wide svg)
simplify my data using a Douglas–Peucker or Visvalingam’s algorithm
dc.js offers a neat rangeChart that can be used to display range selection that I want to use.
But the more I zoom in the rangeChart the more Detail I want to show. But I don't know on how to get the zoom level of the chart and aggregate a group 'on the fly'. Perhaps someone has a thought about this.
I created a codepan as an example.
This comes up a lot so I've added a focus dynamic interval example.
It's a refinement of the same techniques in the switching time intervals example, except here we determine which d3 time interval to use based on the extent of the brush in the range chart.
Unfortunately I don't have time to tune it right now, so let's iterate on this. IMO it's almost but not quite fast enough - it could sample even less points but I used the built-in time intervals. When you see a jaggy line in the dc line chart
it's usually because you are displaying too many points - there should be dozens not hundreds and never thousands.
The idea is to spawn different groups for different time intervals. Here we'll define a few intervals and the threshold, in milliseconds, at which we should use that interval:
var groups_by_min_interval = [
{
name: 'minutes',
threshold: 60*60*1000,
interval: d3.timeMinute
}, {
name: 'seconds',
threshold: 60*1000,
interval: d3.timeSecond
}, {
name: 'milliseconds',
threshold: 0,
interval: d3.timeMillisecond
}
];
Again, there should be more here - since we will generate the groups dynamically and cache them, it's okay to have a bunch. (It will probably hog memory at some point, but gigabytes are OK in JS these days.)
When we need a group, we'll generate it by using the d3 interval function, which produces the floor, and then reduce total and count:
function make_group(interval) {
return dimension.group(interval).reduce(
function(p, v) {
p.count++;
p.total += v.value;
return p;
},
function(p, v) {
p.count--;
p.total += v.value;
return p;
},
function() {
return {count: 0, total: 0};
}
);
}
Accordingly we will tell the charts to compute the average in their valueAccessors:
chart.valueAccessor(kv => kv.value.total / kv.value.count)
Here's the fun part: when we need a group, we'll scan this list until we find the first spec whose threshold is less than the current extent in milliseconds:
function choose_group(extent) {
var d = extent[1].getTime() - extent[0].getTime();
var found = groups_by_min_interval.find(mg => mg.threshold < d);
console.log('interval ' + d + ' is more than ' + found.threshold + ' ms; choosing ' + found.name +
' for ' + found.interval.range(extent[0], extent[1]).length + ' points');
if(!found.group)
found.group = make_group(found.interval);
return found.group;
}
Hook this up to the filtered event of the range chart:
rangeChart.on('filtered.dynamic-interval', function(_, filter) {
chart.group(choose_group(filter || fullDomain));
});
Run out of time for now. Please ask any questions, and we'll refine this better. We will need custom time intervals (like 10th of a second) and I am failing to find that example right now. There is a good way to do it.
Note: I have one-upped you and increased the number of points by an order of magnitude to half a million. This may be too much for older computers, but on a 2017 computer it proves that data quantity is not the problem, DOM elements are.

dc.js Composite Graph - Plot New Line for Each Person

Good Evening Everyone,
I'm trying to take the data from a database full of hour reports (name, timestamp, hours worked, etc.) and create a plot using dc.js to visualize the data. I would like the timestamp to be on the x-axis, the sum of hours for the particular timestamp on the y-axis, and a new bar graph for each unique name all on the same chart.
It appears based on my objectives that using crossfilter.js the timestamp should be my 'dimension' and then the sum of hours should be my 'group'.
Question 1, how would I then use the dimension and group to further split the data based on the person's name and then create a bar graph to add to my composite graph? I would like for the crossfilter.js functionality to remain intact so that if I add a date range tool or some other user controllable filter, everything updates accordingly.
Question 2, my timestamps are in MySQL datetime format: YYYY-mm-dd HH:MM:SS so how would I go about dropping precision? For instance, if I want to combine all entries from the same day into one entry (day precision) or combine all entries in one month into a single entry (month precision).
Thanks in advance!
---- Added on 2017/01/28 16:06
To further clarify, I'm referencing the Crossfilter & DC APIs alongside the DC NASDAQ and Composite examples. The Composite example has shown me how to place multiple line/bar charts on a single graph. On the composite chart I've created, each of the bar charts I've added a dimension based off of the timestamps in the data-set. Now I'm trying to figure out how to define the groups for each. I want each bar chart to represent the total time worked per timestamp.
For example, I have five people in my database, so I want there to be five bar charts within the single composite chart. Today all five submitted reports saying they worked 8 hours, so now all five bar charts should show a mark at 01/28/2017 on the x-axis and 8 hours on the y-axis.
var parseDate = d3.time.format('%Y-%m-%d %H:%M:%S').parse;
data.forEach(function(d) {
d.timestamp = parseDate(d.timestamp);
});
var ndx = crossfilter(data);
var writtenDimension = ndx.dimension(function(d) {
return d.timestamp;
});
var hoursSumGroup = writtenDimension.group().reduceSum(function(d) {
return d.time_total;
});
var minDate = parseDate('2017-01-01 00:00:00');
var maxDate = parseDate('2017-01-31 23:59:59');
var mybarChart = dc.compositeChart("#my_chart");
mybarChart
.width(window.innerWidth)
.height(480)
.x(d3.time.scale().domain([minDate,maxDate]))
.brushOn(false)
.clipPadding(10)
.yAxisLabel("This is the Y Axis!")
.compose([
dc.barChart(mybarChart)
.dimension(writtenDimension)
.colors('red')
.group(hoursSumGroup, "Top Line")
]);
So based on what I have right now and the example I've provided, in the compose section I should have 5 charts because there are 5 people (obviously this needs to be dynamic in the end) and each of those charts should only show the timestamp: total_time data for that person.
At this point I don't know how to further breakup the group hoursSumGroup based on each person and this is where my Question #1 comes in and I need help figuring out.
Question #2 above is that I want to make sure that the code is both dynamic (more people can be handled without code change), when minDate and maxDate are later tied to user input fields, the charts update automatically (I assume through adjusting the dimension variable in some way), and if I add a names filter that if I unselect names that the chart will update by removing the data for that person.
A Question #3 that I'm now realizing I'll want to figure out is how to get the person's name to show up in the pointer tooltip (the title) along with timestamp and total_time values.
There are a number of ways to go about this, but I think the easiest thing to do is to create a custom reduction which reduces each person into a sub-bin.
First off, addressing question #2, you'll want to set up your dimension based on the time interval you're interested in. For instance, if you're looking at days:
var writtenDimension = ndx.dimension(function(d) {
return d3.time.hour(d.timestamp);
});
chart.xUnits(d3.time.hours);
This will cause each timestamp to be rounded down to the nearest hour, and tell the chart to calculate the bar width accordingly.
Next, here's a custom reduction (from the FAQ) which will create an object for each reduced value, with values for each person's name:
var hoursSumGroup = writtenDimension.group().reduce(
function(p, v) { // add
p[v.name] = (p[v.name] || 0) + d.time_total;
return p;
},
function(p, v) { // remove
p[v.name] -= d.time_total;
return p;
},
function() { // init
return {};
});
I did not go with the series example I mentioned in the comments, because I think composite keys can be difficult to deal with. That's another option, and I'll expand my answer if that's necessary.
Next, we can feed the composite line charts with value accessors that can fetch the value by name.
Assume we have an array names.
compositeChart.shareTitle(false);
compositeChart.compose(
names.map(function(name) {
return dc.lineChart(compositeChart)
.dimension(writtenDimension)
.colors('red')
.group(hoursSumGroup)
.valueAccessor(function(kv) {
return kv.value[name];
})
.title(function(kv) {
return name + ' ' + kv.key + ': ' + kv.value;
});
}));
Again, it wouldn't make sense to use bar charts here, because they would obscure each other.
If you filter a name elsewhere, it will cause the line for the name to drop to zero. Having the line disappear entirely would probably not be so simple.
The above shareTitle(false) ensures that the child charts will draw their own titles; the title functions just add the current name to those titles (which would usually just be key:value).

Prevent a graph from recalculating its own percentages

I have three Row Charts and my code calculates and updates the percentages for each chart whenever a user first lands on the page or clicks a rectangle bar of a chart.  This is how it calculates the percentages
posChart:
% Position= unique StoreNumber counts per Position / unique StoreNumber counts for all POSITIONs
deptChart:
% Departments= POSITION counts per DEPARTMENT/POSITION counts for all DEPARTMENTs
stateChart:
% States= unique StoreNumber counts per STATE / unique StoreNumber counts for all STATEs
What I want is when a user clicks a rectangle bar of a rowChart such as “COUNTS BY STATE”, it should NOT update/recalculate the percentages for that chart (it should not affect its own percentages), however, percentages should be recalculated for the other two charts i.e. “COUNTS BY DEPARTMENT” and “COUNTS BY POSITION”.  The Same scenario holds for the other charts as well. This is what I want
If a user clicks a
“COUNTS BY DEPARTMENT” chart --> recalculate percentages for “COUNTS BY POSITION” and “COUNTS BY STATE” charts
“COUNTS BY POSITION” chart --> recalculate percentages for “COUNTS BY DISTRIBUTOR” and “COUNTS BY STATE” charts
Please Help!!
link:http://jsfiddle.net/mfi_login/z860sz69/
Thanks for the reply.
There is a problem with the solution you provided. I am looking for the global total for all filters but I don’t want those totals to be changed when user clicks on a current chart's rectangular bar.
e.g.
if there are two different POSITIONS (Supervisor, Account Manager) with the same StoreNumber (3), then I want StoreNumber to be counted as 1 not 2
If we take an example of Account Manager % calculation (COUNTS BY POSITION chart)
total unique StoreNumbers=3
Total Account Manager POSITIONs=2
% = 2/3=66%
Is there a way to redraw the other two charts without touching the current one?
It seems to me that what you really want is to use the total of the chart's groups, not the overall total. If you use the overall total then all filters will be observed, but if you use the total for the current group, it will not observe any filters on the current chart.
This will have the effect you want - it's not about preventing any calculations, but about making sure each chart is affected only by the filters on the other charts.
So, instead of bin_counter, let's define sum_group and sum_group_xep:
function sum_group(group, acc) {
acc = acc || function(kv) { return kv.value; };
return d3.sum(group.all().filter(function(kv) {
return acc(kv) > 0;
}), acc);
}
function sum_group_xep(group) {
return sum_group(group, function(kv) {
return kv.value.exceptionCount;
});
}
And we'll use these for each chart, so e.g.:
posChart
.valueAccessor(function (d) {
//gets the total unique store numbers for selected filters
var value=sum_group_xep(posGrp)
var percent=value>0?(d.value.exceptionCount/value)*100:0
//this returns the x-axis percentages
return percent
})
deptChart
.valueAccessor(function (d) {
total=sum_group(deptGrp)
var percent=d.value?(d.value/total)*100:0
return percent
})
stateChart
.valueAccessor(function (d) {
var value=sum_group_xep(stateGrp);
return value>0?(d.value.exceptionCount/value)*100:0
})
... along with the other 6 places these are used. There's probably a better way to organize this without so much duplication of code, but I didn't really think about it!
Fork of your fiddle: http://jsfiddle.net/gordonwoodhull/yggohcpv/8/
EDIT: Reductio might have better shortcuts for this, but I think the principle of dividing by the total of the values in the current chart's group, rather than using a groupAll which observes all filters, is the right start.

How to show "missing" rows in a rowChart using crossfilter and dc.js?

I'm using code similar to that in the dc.js annotated example:
var ndx = crossfilter(data);
...
var dayName=["0.Sun","1.Mon","2.Tue","3.Wed","4.Thu","5.Fri","6.Sat"];
var dayOfWeek = ndx.dimension(function (d) {
var day = d.dd.getDay();
return dayName[day];
});
var dayOfWeekGroup = dayOfWeek.group();
var dayOfWeekChart = dc.rowChart("#day-of-week-chart");
dayOfWeekChart.width(180)
.height(180)
.group(dayOfWeekGroup)
.label(function(d){return d.key.substr(2);})
.dimension(dayOfWeek);
The issue I've got is that only days of the week present in the data are displayed in my rowChart, and there's no guarantee every day will be represented in all of my data sets.
This is desirable behaviour for many types of categories, but it's a bit disconcerting to omit them for short and well-known lists like day and month names and I'd rather an empty row was included instead.
For a barChart, I can use .xUnits(dc.units.ordinal) and something like .x(d3.scale.ordinal.domain(dayName)).
Is there some way to do the same thing for a rowChart so that all days of the week are displayed, whether present in data or not?
From my understanding of the crossfilter library, I need to do this at the chart level, and the dimension is OK as is. I've been digging around in the dc.js 1.6.0 api reference, and the d3 scales documentation but haven't had any luck finding what I'm looking for.
Solution
Based on #Gordon's answer, I've added the following function:
function ordinal_groups(keys, group) {
return {
all: function () {
var values = {};
group.all().forEach(function(d, i) {
values[d.key] = d.value;
});
var g = [];
keys.forEach(function(key) {
g.push({key: key,
value: values[key] || 0});
});
return g;
}
};
}
Calling this as follows will fill in any missing rows with 0s:
.group(ordinal_groups(dayNames, dayOfWeekGroup))
Actually, I think you are better off making sure that the groups exist before passing them off to dc.js.
One way to do this is the "fake group" pattern described here:
https://github.com/dc-js/dc.js/wiki/FAQ#filter-the-data-before-its-charted
This way you can make sure the extra entries are created every time the data changes.
Are you saying that you tried adding the extra entries to the ordinal domain and they still weren't represented in the row chart, whereas this did work for bar charts? That sounds like a bug to me. Specifically, it looks like support for ordinal domains needs to be added to the row chart.

d3 linechart - Show 0 on the y-axis without passing in all points?

I have a line chart. Its purpose is to show the amount of transactions per user over a given time period.
To do this I'm getting the dates of all users transactions. I'm working off this example : http://bl.ocks.org/mbostock/3884955 and have the line chart renedering fine.
My x-axis is time and the y-axis is number of transactions. The problem I have is to do with displaying dates when there is no activity.
Say I have 4 transactions on Tuesday and 5 transactions on Thursday..I need to show that there has been 0 transactions on Wednesday. As no data exists in my database explicitly stating that a user has made no transactions on Wedensday do I need to pass in the Wednesday time (and all other times, depending on the timeframe) with a 0 value? or can I do it with d3? I can't seem to find any examples that fit my problem.
This seems like a pretty common issue, so I worked up an example implementation here: http://jsfiddle.net/nrabinowitz/dhW2F/2/
Relevant code:
// get the min/max dates
var extent = d3.extent(data, function(d) { return d.date; }),
// hash the existing days for easy lookup
dateHash = data.reduce(function(agg, d) {
agg[d.date] = true;
return agg;
}, {}),
// note that this leverages some "get all headers but date" processing
// already present in the example
headers = color.domain();
// make even intervals
d3.time.days(extent[0], extent[1])
// drop the existing ones
.filter(function(date) {
return !dateHash[date];
})
// and push them into the array
.forEach(function(date) {
var emptyRow = { date: date };
headers.forEach(function(header) {
emptyRow[header] = null;
});
data.push(emptyRow);
});
// re-sort the data
data.sort(function(a, b) { return d3.ascending(a.date, b.date); });
As you can see, it's a bit convoluted, but seems to work well - you make an array of evenly spaced dates using the handy d3.interval.range method, filter out those dates already present in your data, and use the remaining ones to push empty rows. One downside is that performance could be slow for a big dataset - and this assumes full rows are empty, rather than different empty dates in different series.
An alternate representation, with gaps (using line.defined) instead of zero points, is here: http://jsfiddle.net/nrabinowitz/dhW2F/3/

Resources