How can I sum binned time series using d3.js? - d3.js

I want a simple graph like:
The data I have is a simple list of transactions with two properties:
timestamp
amount
I tried d3.layout.histogram().bins() but it seems it only supports counting the transactions.
I mustn't be the only one looking for that, am I ?

Ok, so the IRC folks helped me out and pointed to nest, which works great (this is CoffeeScript):
nested_data = d3.nest()
.key((d) -> d3.time.day(d.timestamp))
.rollup((a) -> d3.sum(a, (d) -> d.amount))
.entries(incoming_data) # An array of {timestamp: ..., amount: ...} objects
# Optional
nested_data.map (d) ->
d.date = new Date(d.key)
The trick here is d3.time.day which takes a timestamp, and tells you which day (12 a.m. in the night) that timestamp belongs to. This function and the other ones like d3.time.week, etc.. can bin timeseries very well.
The other trick is the nest().rollup() function, which after being grouped by key(), sum all of the events on a given day.
Last thing I wanted, was to interpolate empty values on the days where I had no transactions. This is the last part of the code:
# Interpolate empty vals
nested_data.sort((a, b) -> d3.descending(a.date, b.date))
ex = d3.extent(nested_data, (d) -> d.date)
each_day = d3.time.days(ex[0], ex[1])
# Build a hashmap with the days we have
data_hash = {}
angular.forEach(data, (d) ->
data_hash[d.date] = d.values
)
# Build a new array for each day, including those where we didn't have transactions
new_data = []
angular.forEach(each_day, (d) ->
val = 0
if data_hash[d]
val = data_hash[d]
new_data.push({date: d, values: val})
)
final_data = new_data
Hope this helps somebody!

The histogram code doesn't support this, but you can easily do the binning yourself. Assuming that you have a date and a count for each transaction, you can bin by day like this.
var bins = {};
transactions.forEach(function(t) {
var key = t.date.toDateString();
bins[key] = bins[key] || 0;
bins[key] += t.amount;
});
You can obviously parse the date string back into a date if you need it; the point of using .toDateString() here is that the time part is chopped off and everything binned by day. If you want to bin by another time interval, you can use the same technique and extract a different part of the date.

Related

When using D3.js, is there a method to automatically get the closest `key` that has data?

When we use D3.js, let's say for simplicity, we have data, which are stock prices:
2021-03-18 $38.10
2021-03-19 $38.60
2021-03-22 $38.80
and we use D3 to plot a line chart for the stock price, and then move the mouse around to "hover" above the prices, and it'd show the price for that date.
Right now I am using
d3.select("svg").on("mousemove", (ev) => {
const hour = xScale.invert(ev.offsetX - dimensions.margin.left).getHours();
to get the hour of where the user is hovering on. The xScale is a scale function scaleTime() from domain to range, and xScale.invert is the function that convert the range back to domain.
If the hour is 12pm or later, I consider it the next day, and if it is before or equal, I consider it the same day. This is because the stock price of 2021-03-19 is considered to be at 12:00am (the midnight), so if I am getting to 9pm, for example, the mouse cursor is really close to the next day.
And then, let's say I identified that it is 2021-03-20, then I check whether there is stock price data. But since it was a Saturday and has no stock data, I use a function to check by the following method:
I first would go back to 2021-03-19 and see if there is a stock price. (I first build a lookup table to map date to data). If there is, then use it.
But if there isn't, I just use the delta of 1 day and move further and further, so it would go to 2021-03-21 and then increment the delta and use -delta to check for 2021-03-18, so I just use a point and go "later" and "before" with an increasing delta, until I am able to find a price
In other words, I have a "first candidate" and a "second candidate". If the first candidate has data, then use it. Otherwise, try the second candidate. If still not work, then work from the first candidate and use delta of 1 day and move "later" or "before", and if not work, use a delta of 2 days, and 3 days, until I am able to find a date with data.
Then I use this price to show on screen, to report what the date and price is
But this method is a bit low level. Does D3.js already have a method to directly do that: to spit out an invert number, which is closest to the key that has data in the dataset?
There are several functions provided by d3.js which can be used, depending on the exact situation:
1. You operate in screen space and want a mapping of the current mouse position on the closest point of the visualization which represents a single data object
In that case, you would probably want to use d3-delaunay.
d3-delaunay is a fast library for computing the Voronoi diagram of a
set of two-dimensional points. One can use delaunay.find to identify the data point closest to the pointer. Here is one example.
2. If you operate in the data domain (e.g. because you have already inverted the mouse position to the data domain)
As #Gerardo Furtado points out, you can use d3.bisect.
d3.bisect finds the position into which a given value can be inserted
into a sorted array while maintaining sorted order. If the value
already exists in the array, d3.bisect will find its position
efficiently. Here is one
example.
See also: D3: What is a Bisector? and d3.bisector using Date() Object does not resolve
Another option d3.js provides is d3.scaleThreshold.
Threshold scales allow you to map arbitrary subsets of the domain to discrete values in the range. The input domain is still continuous, and divided into slices based on a set of threshold values.
The idea is the following:
You create a d3.scaleThreshold to map any date (= continuous domain) to the fixed set of valid dates given your data by mapping it to the closest date. For that you have to specify the domain as an array of n - 1 dates which are residing in between the n valid dates. The range is the array of the valid dates.
It might not be as efficient as d3.bisect depending on your data.
const data_original = [{ date: "2021-03-18", value: "38.10"},
{ date: "2021-03-19", value: "38.60"},
{ date: "2021-03-22", value: "38.80"},
];
const data_types_converted = data_original.map(d => ({"date": new Date(d.date), "value": +d.value}));
const data_just_dates = data_types_converted.map(d => d.date);
let newDate = new Date("2021-03-17");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
newDate = new Date("2021-03-18");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
newDate = new Date("2021-03-19");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
newDate = new Date("2021-03-20");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
newDate = new Date("2021-03-21");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
newDate = new Date("2021-03-22");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
newDate = new Date("2021-03-23");
console.log(newDate + " -> " + getClosestDate(newDate, data_just_dates));
function getClosestDate(newDate, validDates) {
const domain = [];
let midday_local;
let midday_UTC;
validDates.forEach((d,i) => {
if (i < validDates.length - 1) {
midday_local = new Date((validDates[i].getTime() + validDates[i + 1].getTime()) / 2); // midday in local time
midday_UTC = convertDateToUTC(midday_local); // midday in UTC time
domain.push(midday_UTC);
}
});
const scale = d3.scaleThreshold()
.domain(domain)
.range(validDates);
return scale(newDate)
}
function convertDateToUTC(date) {
return new Date(
date.getUTCFullYear(),
date.getUTCMonth(),
date.getUTCDate(),
date.getUTCHours(),
date.getUTCMinutes(),
date.getUTCSeconds()
);
}
<script src="https://d3js.org/d3.v6.min.js"></script>

How to find min and max value in a array of objects

I am new to D3 and Javascript community. I have an array of objects(countries) and each object consists of years represented by key and its value(incomes).
How to get the minimum and maximum value of incomes across all the objects and its key to set the color legend?
Please find the link https://observablehq.com/d/287af954b6aaf349 of the project, hope it would help to explain the question.
Thank you for reading the questions.
To get the global maximum, you can use
const max = d3.max(
countries.map(country =>
d3.max(Object.values(country))));
// 179000
What it does is iterate over the countries, grabbing from each country the maximum of all the values. Then, it takes the maximum of these maxima and returns that.
To also get the corresponding year, it becomes more complex. You can use Object.entries to get the key value pair, instead of just the value.
You can then manually loop over each country, updating the max value, and the corresponding year as you go
data.reduce((max, country) => {
const newMax = d3.max(Object.values(country));
if(newMax > max.value) {
return {
year: Object.entries(country).find(([year, value]) => value === newMax)[0],
value: newMax,
};
}
return max;
}, {value: 0});
// {year: "2040", value: 179000}
Another useful method backed in d3 is the d3.extent() function check this article

Sort a range or array based on two columns that contain the date and time

Currently I'm trying to create a Google Apps Script for Google Sheets which will allow adding weekly recurring events, batchwise, for upcoming events. My colleagues will then make minor changes to these added events (e.g. make date and time corrections, change the contact person, add materials neccessary for the event and so forth).
So far, I have written the following script:
function CopyWeeklyEventRows() {
var ss = SpreadsheetApp.getActiveSheet();
var repeatingWeeks = ss.getRange(5,1).getValue(); // gets how many weeks it should repeat
var startDate = ss.getRange(6, 1).getValue(); // gets the start date
var startWeekday = startDate.getDay(); // gives the weekday of the start date
var regWeek = ss.getRange(9, 2, 4, 7).getValues(); // gets the regular week data
var regWeekdays = new Array(regWeek.length); // creates an array to store the weekdays of the regWeek
var ArrayStartDate = new Array(startDate); // helps to store the We
for (var i = 0; i < regWeek.length; i++){ // calculates the difference between startWeekday and each regWeekdays
regWeekdays[i] = regWeek[i][1].getDay() - startWeekday;
Logger.log(regWeekdays[i]);
// Add 7 to move to the next week and avoid negative values
if (regWeekdays[i] < 0) {
regWeekdays[i] = regWeekdays[i] + 7;
}
// Add days according to difference between startWeekday and each regWeekdays
regWeek[i][0] = new Date(ArrayStartDate[0].getTime() + regWeekdays[i]*3600000*24);
}
// I'm struggling with this line. The array regWeek is not sorted:
//regWeek.sort([{ column: 1, ascending: true }]);
ss.getRange(ss.getLastRow() + 1, 2, 4, 7).setValues(regWeek); // copies weekly events after the last row
}
It allows to add one week of recurring events to the overview section of the spreadsheet based on a start date. If the start date is a Tuesday, the regular week is added starting from a Tuesday. However, the rows are not sorted according to the dates:
.
How can the rows be sorted by ascending date (followed by time) before adding them to the overview?
My search for similar questions revealed Google Script sort 2D Array by any column which is the closest hit I've found. The same error message is shown when running my script with the sort line. I don't understand the difference between Range and array yet which might help to solve the issue.
To give you a broader picture, here's what I'm currently working on:
I've noticed that the format will not necessarily remain when adding
new recurring events. So far I haven't found the rule and formatted by
hand in a second step.
A drawback is currently that the weekly recurring events section is
fixed. I've tried to find the last filled entry and use it to set the
range of regWeek, but got stuck.
Use the column A to exclude recurring events from the addition
process using a dropdown.
Allow my colleagues to add an event to the recurring events using a
dropdown (e.g. A26). This event should then be added with sorting to
the right day of the week and start time. The sorting will come in
handy.
Thanks in advance for your input regarding the sorting as well as suggestions on how to improve the code in general.
A demo version of the spreadsheet
UpdateV01:
Here the code lines which copy and sort (first by date, then by time)
ss.getRange(ss.getLastRow()+1,2,4,7).setValues(regWeek); // copies weekly events after the last row
ss.getRange(ss.getLastRow()-3,2,4,7).sort([{column: 2, ascending: true}, {column: 4, ascending: true}]); // sorts only the copied weekly events chronologically
As #tehhowch pointed out, this is slow. Better to sort BEFORE writing.
I will implement this method and post it here.
UpdateV02:
regWeek.sort(function (r1, r2) {
// sorts ascending on the third column, which is index 2
return r1[2] - r2[2];
});
regWeek.sort(function (r1, r2) {
// r1 and r2 are elements in the regWeek array, i.e.
// they are each a row array if regWeek is an array of arrays:
// Sort ascending on the first column, which is index 0:
// if r1[0] = 1, r2[0] = 2, then 1 - 2 is -1, so r1 sorts before r2
return r1[0] - r2[0];
});
UpdateV03:
Here an attempt to repeat the recurring events over several weeks. Don't know yet how to include the push for the whole "week".
// Repeat week for "A5" times and add to start/end date
for (var j = 0; j < repeatingWeeks; j++){
for (var i = 0; i < numFilledRows; i++){
regWeekRepeated[i+j*6][0] = new Date(regWeek[i][0].getTime() + j*7*3600000*24); // <-This line leads to an error message
regWeekRepeated[i+j*6][3] = new Date(regWeek[i][3].getTime() + j*7*3600000*24);
}
}
My question was answered and I was able to make the code work as intended.
Given your comment - you want to sort the written chunk - you have two methods available. One is to sort written data after writing, by using the Spreadsheet service's Range#sort(sortObject) method. The other is to sort the data before writing, using the JavaScript Array#sort(sortFunction()) method.
Currently, your sort code //regWeek.sort([{ column: 1, ascending: true }]); is attempting to sort a JavaScript array, using the sorting object expected by the Spreadsheet service. Thus, you can simply chain this .sort(...) call to your write call, as Range#setValues() returns the same Range, allowing repeated Range method calling (e.g. to set values, then apply formatting, etc.).
This looks like:
ss.getRange(ss.getLastRow() + 1, 2, regWeek.length, regWeek[0].length)
.setValues(regWeek)
/* other "chainable" Range methods you want to apply to
the cells you just wrote to. */
.sort([{column: 1, ascending: true}, ...]);
Here I have updated the range you access to reference the data you are attempting to write - regWeek - so that it is always the correct size to hold the data. I've also visually broken apart the one-liner so you can better see the "chaining" that is happening between Spreadsheet service calls.
The other method - sorting before writing - will be faster, especially as the size and complexity of the sort increases. The idea behind sorting a range is you need to use a function that returns a negative value when the first index's value should come before the second's, a positive value when the first index's value should come after the second's, and a zero value if they are equivalent. This means a function that returns a boolean is NOT going to sort as one thinks, since false and 0 are equivalent in Javascript, while true and 1 are also equivalent.
Your sort looks like this, assuming regWeek is an array of arrays and you are sorting on numeric values (or at least values which will cast to numbers, like Dates).
regWeek.sort(function (r1, r2) {
// r1 and r2 are elements in the regWeek array, i.e.
// they are each a row array if regWeek is an array of arrays:
// Sort ascending on the first column, which is index 0:
// if r1[0] = 1, r2[0] = 2, then 1 - 2 is -1, so r1 sorts before r2
return r1[0] - r2[0];
});
I strongly recommend reviewing the Array#sort documentation.
You could sort the "Weekly Events" range before you set the regWeek variable. Then the range would be in the order you want before you process it. Or you could sort the whole "Overview" range after setting the data. Here's a quick function you can call to sort the range by multiple columns. You can of course tweak it to sort the "Weekly Events" range instead of the "Overview" range.
function sortRng() {
var ss = SpreadsheetApp.getActiveSheet();
var firstRow = 22; var firstCol = 1;
var numRows = ss.getLastRow() - firstRow + 1;
var numCols = ss.getLastColumn();
var overviewRng = ss.getRange(firstRow, firstCol, numRows, numCols);
Logger.log(overviewRng.getA1Notation());
overviewRng.sort([{column: 2, ascending: true}, {column: 4, ascending: true}]);
}
As for getting the number of filled rows in the Weekly Events section, you need to search a column that will always have data if any row has data (like the start date column b), loop through the values and the first time it finds a blank, return that number. That will give you the number of rows that it needs to copy. Warning: if you don't have at least one blank value in column B between the Weekly Events section and the Overview section, you will probably get unwanted results.
function getNumFilledRows() {
var ss = SpreadsheetApp.getActiveSheet();
var eventFirstRow = 9; var numFilledRows = 0;
var colToCheck = 'B';//the StartDate col which should always have data if the row is filled
var vals = ss.getRange(colToCheck + eventFirstRow + ":" + colToCheck).getValues();
for (i = 0; i < vals.length; i++) {
if (vals[i][0] == '') {
numFilledRows = i;
break;
}
}
Logger.log(numFilledRows);
return numFilledRows;
}
EDIT:
If you just want to sort the array in javascript before writing, and you want to sort by Start Date first, then by Time of day, you could make a temporary array, and add a column to each row that is date and time combined. array.sort() sorts dates alphabetically, so you would need to convert that date to an integer. Then you could sort the array by the new column, then delete the new column from each row. I included a function that does this below. It could be a lot more compact but I thought it might be more legible like this.
function sortDates() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var vals = ss.getActiveSheet().getRange('B22:H34').getDisplayValues(); //get display values because getValues returns time as weird date 1899 and wrong time.
var theDate = new Date(); var newArray = []; var theHour = ''; var theMinutes = '';
var theTime = '';
//Create a new array that inserts date and time as the first column in each row
vals.forEach(function(aRow) {
theTime = aRow[2];//hardcoded - assumes time is the third column that you grabbed
//get the hours (before colon) as a number
theHour = Number(theTime.substring(0,theTime.indexOf(':')));
//get the minutes(after colon) as a number
theMinutes = Number(theTime.substring(theTime.indexOf(':')+1));
theDate = new Date(aRow[0]);//hardcoded - assumes date is the first column you grabbed.
theDate.setHours(theHour);
theDate.setMinutes(theMinutes);
aRow.unshift(theDate.getTime()); //Add the date and time as integer to the first item in the aRow array for sorting purposes.
newArray.push(aRow);
});
//Sort the newArray based on the first item of each row (date and time as number)
newArray.sort((function(index){
return function(a, b){
return (a[index] === b[index] ? 0 : (a[index] < b[index] ? -1 : 1));
};})(0));
//Remove the first column of each row (date and time combined) that we added in the first step
newArray.forEach(function(aRow) {
aRow.shift();
});
Logger.log(newArray);
}

Spark - How to count number of records by key

This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a count column in the dataset that I can use as the value in a groupByKey or reduceByKey. I thought of using a reduceByKey() but that requires a key-value pair and I only want to count the key and make a counter as the value. How do I go about this?
val lines = sc.textFile("/home/cloudera/desktop/file.txt")
val split_lines = lines.map(_.split(","))
val femaleOnly = split_lines.filter(x => x._10 == "Female")
Here is where I am stuck. The country is index 13 in the dataset also.
The output should something look like this:
(Australia, 201000)
(America, 420000)
etc
Any help would be great.
Thanks
You're nearly there! All you need is a countByValue:
val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()
// Prints (Australia, 230), (America, 23242), etc.
(In your example, I assume you meant x(10) rather than x._10)
All together:
sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x(10) == "Female")
.map(_(13))
.countByValue()
Have you considered manipulating your RDD using the Dataframes API ?
It looks like you're loading a CSV file, which you can do with spark-csv.
Then it's a simple matter (if your CSV is titled with the obvious column names) of:
import com.databricks.spark.csv._
val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field
.filter($"gender" === "Female")
.groupBy("country").count().show()
If you want to go deeper in this kind of manipulation, here's the guide:
https://spark.apache.org/docs/latest/sql-programming-guide.html
You can easily create a key, it doesn't have to be in the file/database. For example:
val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x._10 == "Female")
.map(x => (x._13, x._10)) // <<<< here you generate a new key
.groupByKey();

Labeled LDA learn in Stanford Topic Modeling Toolbox

It's ok when I run the example-6-llda-learn.scala as follows:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
}
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
But it's not ok when I change the last line from:
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
to:
TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
And the method of CVB0 cost much memory.I train a corpus of 10,000 documents with about 10 labels each document,it will cost 30G memory.
I've encountered the same situation and indeed I believe it's a bug. Check GIbbsLabeledLDA.scala in edu.stanford.nlp.tmt.model.llda under the src/main/scala folder, from line 204:
val z = doc.labels(zI);
val pZ = (doc.theta(z)+topicSmoothing(z)) *
(countTopicTerm(z)(term)+termSmooth) /
(countTopic(z)+termSmoothDenom);
doc.labels is self-explanatory, and doc.theta records the distribution (counts, actually) of its labels, which has the same size as doc.labels.
zI is index variable iterating doc.labels, while the value z gets the actual label number. Here comes the problem: it's possible this documents has only one label - say 1000 - therefore zI is 0 and z is 1000, then doc.theta(z) gets out of range.
I suppose the solution would be to modify doc.theta(z) to doc.theta(zI).
(I'm trying to check whether the results would be meaningful, anyway this bug has made me not so confident in this toolbox.)

Resources