When users of a web app upload a CSV, I want to display on screen a sample of the uploaded data. They could upload 2 to 20 million rows, so I want to limit the number being read by d3 (for speed) and displayed (for style) to, let's say, 100 rows.
Is this possible?
The documentation didn't make the answer apparent to me. I see that it says
An optional row conversion function may be specified to map and filter
row objects to a more-specific representation;
But I don't really understand these row conversion functions or other filters and if they can only be applied after the file is read, or if they can be used to limit the rows being read.
In your row (or accessor) function, use the index (the second argument) to limit the number of rows:
If a row conversion function is specified, the specified function is invoked for each row, being passed an object representing the current row (d), the index (i) starting at zero for the first non-header row, and the array of column names. (emphasis mine)
For instance, in the following example, I'm limiting the number of rows to 5, even if the CSV has 10 rows (I'm using a <pre> element to simulate the CSV because I can't upload a real CSV using the Stack snippet):
var data = d3.csvParse(d3.select("#csv").text(), row);
function row(d, i) {
if (i < 5) return d;
}
console.log("data length is " + data.length);
console.log("data is " + JSON.stringify(data));
pre {
display: none;
}
<script src="https://d3js.org/d3.v4.min.js"></script>
<pre id="csv">foo,bar
12,23
13,22
43,66
3,4
66,55
43,48
32,11
11,11
21,23
78,17</pre>
Have in mind that if the CSV has in fact 20 million rows, as you said, you'll still have to wait until all the file is downloaded. The row function only limits the data array created when you parse the CSV: it will not magically stop the download/parsing process when you get to a certain row.
Related
I am trying to complete a mortality table, using loops in Visual Foxpro. I have run into one difficulty where the math operation involves doing a sum of of all data in a column for the remaining rows - this needs to be incorporated into a loop. The strategy I thought would work, nesting a SUM REST function into the SCAN REST function, was not successful, and I haven't found a good alternative approach.
In FoxPro, I can successfully use the SCAN function as follows, say:
Go 1
Replace survivors WITH 1000000
SCATTER NAME oprev
SKIP
SCAN rest
replace survivors WITH (1 - oprev.prob) * oprev.survivors
SCATTER NAME oprev
ENDSCAN
(to take the mortality rates in a table and use it to compute number of survivors at each age)
Or, say:
Replace Yearslived WITH 0
SCATTER NAME oprev1
SKIP
SCAN rest
replace Yearslived WITH (oprev1.survivors + survivors) * 0.5
SCATTER NAME oprev1
ENDSCAN
In order to complete a mortality table I want to use the Yearslived and survivors data (which were produced using the SCANs above) to get life expectancy data as follows. Say we have the simplified table:
SURVIVORS YEARSLIVED LIFEEXP
100 0 ?
80 90 ?
60 70 ?
40 50 ?
20 30 ?
0 10 ?
Then each LIFEEXP record should be the sum of the remaining YEARSLIVED records divided by the corresponding Survivors record, i.e:
LIFEEXP (1) = (90+70+50+30+10)/100
LIFEEXP (2) = (70+50+30+10)/80
...and so on.
I attempted to do this with a similar SCAN approach - see below:
Go 1
SCATTER NAME Oprev2
SCAN rest
replace lifeexp WITH ((SUM yearslived Rest) - oprev2.yearslived) / oprev2.survivors
SCATTER NAME oprev2
ENDSCAN
But here I get the error message "Function name is missing)." Help tells me this is probably because the function contains too many arguments.
So I then also tried to break things down and first use SCAN just to get all of my SUM REST data, as follows:
SCAN rest
SUM yearslived REST
END SCAN
... in the hope that I could get this data, define it as a variable, and create a simpler SCAN function above. However, I seem to be doing something wrong here as well, as instead of getting all necessary sums (first the sum of rows 2 to end, then 3 to end, etc.), I only get one sum, of all the yearslived data. In other words, using the sample data, I am given just 250, instead of the list 250, 160, 90, 40, 10.
What am I doing wrong? And more generally, how can I create a loop in Foxpro that includes a function where you Sum up all remaining data in a specific column over and over again (first 2nd through last record, then 3rd through last record, and so on)?
Any help will be much appreciated!
TM
Well you are really hiding the important detail, your table's structure, sample data and desired output. Then it is mostly guess work which have a high chance of to be true.
You seem to be trying to do something like this:
Create Cursor Mortality (Survivors i, YearsLived i, LifeExp b)
Local ix, oprev1
For ix=100 To 0 Step -20
Insert Into Mortality (Survivors, YearsLived) Values (m.ix,0)
Endfor
Locate
Survivors = Mortality.Survivors
Skip
Scan Rest
Replace YearsLived With (m.Survivors + Mortality.Survivors) * 0.5
Survivors = Mortality.Survivors
Endscan
*** Here is the part that deals with your sum problem
Local nRecNo, nSum
Scan
* Save current recnord number
nRecNo = Recno()
Skip
* Sum REST after skipping to next row
Sum YearsLived Rest To nSum
* Position back to row where we started
Go m.nRecNo
* Do the replacement
Replace LifeExp With Iif(Survivors=0,0,m.nSum/Survivors)
* ENDSCAN would implicitly move to next record
Endscan
* We are done. Go first record and browse
Locate
Browse
While there are N ways to do this in VFP, this is one xbase approach to do that and relatively simple to understand IMHO.
Where did you go wrong?
Well, you tried to use SUM as if it were a function, but it is a command. There is SUM() function for SQL as an aggregate function but here you are using the xBase command SUM.
EDIT: And BTW in this code:
SCAN rest
SUM yearslived REST
ENDSCAN
What you are doing is, starting a SCAN with a scope of REST, in loop you are using another scoped command
SUM yearslived REST
This effectively does the summing on the REST of records and places the record pointer to bottom. Endscan further advances it to eof(). Thus it only works for the first record.
I am using IML/SAS in SAS Enterprise Guide for the first time, and want to do the following:
Read some datasets into IML matrices
Average the matrices
Turn the resulting IML matrix back into a SAS data set
My input data sets look something like the following (this is dummy data - the actual sets are larger). The format of the input data sets is also the format I want from the output data sets.
data_set0: d_1 d_2 d_3
1 2 3
4 5 6
7 8 9
I proceed as follows:
proc iml;
/* set the names of the migration matrix columns */
varNames = {"d_1","d_2","d_3"};
/* 1. transform input data set into matrix
USE data_set_0;
READ all var _ALL_ into data_set0_matrix[colname=varNames];
CLOSE data_set_0;
USE data_set_1;
READ all var _ALL_ into data_set1_matrix[colname=varNames];
CLOSE data_set_1;
USE data_set_2;
READ all var _ALL_ into data_set2_matrix[colname=varNames];
CLOSE data_set_2;
USE data_set_3;
READ all var _ALL_ into data_set3_matrix[colname=varNames];
CLOSE data_set_3;
/* 2. find the average matrix */
matrix_sum = (data_set0_matrix + data_set1_matrix +
data_set2_matrix + data_set3_matrix)/4;
/* 3. turn the resulting IML matrix back into a SAS data set */
create output_data from matrix_sum[colname=varNames];
append from matrix_sum;
close output_data;
quit;
I've been trying loads of stuff, but nothing seems to work for me. The error I currently get reads:
ERROR: Matrix matrix_sum has not been set to a value
What am I doing wrong? Thanks up front for the help.
The above code works. In the full version of this code (this is simplified for readability) I had misnamed one of my variables.
I'll leave the question up in case somebody else wants to use SAS / IML to find an average matrix.
I am doing an iterative calculation on maple and I want to store the resulting data (which comes in a column matrix) from each iteration into a specific column of an Excel file. For example, my data is
mydat||1:= <<11,12,13,14>>:
mydat||2:= <<21,22,23,24>>:
mydat||3:= <<31,32,33,34>>:
and so on.
I am trying to export each of them into an excel file and I want each data to be stored in consecutive columns of the same excel file. For example, mydat||1 goes to column A, mydat||2 goes to column B and so on. I tried something like following.
with(ExcelTools):
for k from 1 to 3 do
Export(mydat||k, "data.xlsx", "Sheet1", "A:C"): #The problem is selecting the range.
end do:
How do I select the range appropriately here? Is there any other method to export the data and store in the way that I explained above?
There are couple of ways to do this. The easiest is certainly to put all of your data into one data structure and then export that. For example:
mydat1:= <<11,12,13,14>>:
mydat2:= <<21,22,23,24>>:
mydat3:= <<31,32,33,34>>:
mydata := Matrix( < mydat1 | mydat2 | mydat3 > );
This stores your data in a Matrix where mydat1 is the first column, mydat2 is the second column, etc. With the data in this form, either ExcelTools:-Export or the more generic Export command will work:
ExcelTools:-Export( data, "data.xlsx" );
Export( "data.xlsx", data );
Now since you mention that you are doing an iterative calculation, you may want to write the results out column by column. Here's another method that doesn't involve the creation of another data structure to house the results. This does assume that the data in mydat"i" has been created before the loop.
for i to 3 do
ExcelTools:-Export( cat(`mydat`,i), "data.xlsx", 1, ["A1","B1","C1"][i] );
end do;
If you want to write the data out to a file as you are building it, then just do the Export call after the creation of each of the columns, i.e.
ExcelTools:-Export( mydat1, "data.xlsx", 1, "A1" );
Note that I removed the "||" characters. These are used in Maple for concatenation and caused some issues with the second method.
I'm a beginner in Talend Open Studio, and I'm trying to do the transformation below.
From a SQL Table that contains:
DeltaStock Date
------------------------
+50 (initial stock) J0
+80 J1
-30 J2
... ...
I want to produce this table:
Stock Date
-----------
50 J0
130 J1
100 J2
... ...
Do you think this could be possible using TOS? I thought of using tAggregateRow, but I didn't find it appropriate to my issue.
There's probably an easier way to do this using the tMemorizeRows component but the first thought that comes to mind is to use the globalMap to store a rolling sum.
In Talend it is possible to store an object (any value or any type) in the globalMap so that it can be retrieved later on in the job. This is used automatically if you ever use a tFlowToIterate component which allows you to retrieve the values for that row that is being iterated on from the globalMap.
A very basic sample job might look like this:
In this we have a tJava component that only initialises the rolling sum in the globalMap with the following code:
//Initialise the rollingSum global variable
globalMap.put("rollingSum", 0);
After this we connect this component onSubjobOk to make sure we only carry on if we've managed to put the rollingSum into the globalMap.
I then provide my data using a tFixedFlowInput component which allows me to easily hardcode some values for this example job. You could easily replace this with any input. I have used your sample input data from the question:
We then process the data using a tJavaRow which will do some transformations on the data row by row. I've used the following code which works for this example:
//Initialise the operator and the value variables
String operator = "";
Integer value = 0;
//Get the current rolling sum
Integer rollingSum = (Integer) globalMap.get("rollingSum");
//Extract the operator
Pattern p = Pattern.compile("^([+-])([0-9]+)$");
Matcher m = p.matcher(input_row.deltaStock);
//If we have any matches from the regular expression search then extract the operator and the value
if (m.find()) {
operator = m.group(1);
value = Integer.parseInt(m.group(2));
}
//Conditional to use the operator
if ("+".equals(operator)) {
rollingSum += value;
} else if ("-".equals(operator)) {
rollingSum -= value;
} else {
System.out.println("The operator provided wasn't a + or a -");
}
//Put the new rollingSum back into the globalMap
globalMap.put("rollingSum", rollingSum);
//Output the data
output_row.stock = rollingSum;
output_row.date = input_row.date;
There's quite a lot going on there but basically it starts by getting the current rollingSum from the globalMap.
Next, it uses a regular expression to split up the deltaStock string into an operator and a value. From this it uses the operator provided (plus or minus) to either add the deltaStock to the rollingSum or subtract the deltaStock from the rollingSum.
After this it then adds the new rollingSum back into the globalMap and outputs the 2 columns of stock and date (unchanged).
In my sample job I then output the data using a tLogRow which will print the values of the data to the console. I typically select the table formatting option in it and in this case I get the following output:
.-----+----.
|tLogRow_8 |
|=----+---=|
|stock|date|
|=----+---=|
|50 |J0 |
|130 |J1 |
|100 |J2 |
'-----+----'
Which should be what you were looking for.
You should be able to do it in Talend Open Studio.
I attach here an image with the JOB, the content of the tJavaRow and the execution result.
I left under the tFixedFlowInput used to simulate the input a tJDBCInput that you should use to read the data from your DB. Hopefully you can use a specific tXXXInput for your DB instead of the generic JDBC one.
Here is some simple code in the tJavaRow.
//Code generated according to input schema and output schema
output_row.delta = input_row.delta;
output_row.date = input_row.date;
output_row.rollingSum =
Integer.parseInt(globalMap.get("rollingSum").toString());
int delta = Integer.parseInt(input_row.delta);
output_row.rollingSum += delta;
// Save rolling SUM for next round
globalMap.put("rollingSum", output_row.rollingSum);
Beware of the exceptions in the parseInt(). You should handle them the way you feel right.
In my projects I usually have a SafeParse library that does not throws exceptions but returns a default value I can pass together with the vale to be parsed.
I have input records of the form
2013-07-09T19:17Z,f1,f2
2013-07-09T03:17Z,f1,f2
2013-07-09T21:17Z,f1,f2
2013-07-09T16:17Z,f1,f2
2013-07-09T16:14Z,f1,f2
2013-07-09T16:16Z,f1,f2
2013-07-09T01:17Z,f1,f2
2013-07-09T16:18Z,f1,f2
These represent timestamps and events. I have written these by hand, but actual data should be sorted based on time.
I would like to generate a set of records which would be input to graph plotting function which needs continuous time series. I would like to fill in missing values, i.e. if there are entries for "2013-07-09T19:17Z" and "2013-07-09T19:19Z", I would like to generate entry for "2013-07-09T19:18Z" with predefined value.
My thoughts on doing this:
Use MIN and MAX to find the start and end date in the series
Write UDF which takes min and max and returns relation with missing
timestamps
Join above 2 relations
I cannot get my head around on how to implement this in PIG though. Would appreciate any help.
Thanks!
Generate another file using a script (outside pig)with all time stamps between MIN and MAX , including MIN and MAX. Load this as a second data set. Here is a sample that I used from your data set. Please note I filled in only few gaps not all.
2013-07-09T01:17Z,d1,d2
2013-07-09T01:18Z,d1,d2
2013-07-09T03:17Z,d1,d2
2013-07-09T16:14Z,d1,d2
2013-07-09T16:15Z,d1,d2
2013-07-09T16:16Z,d1,d2
2013-07-09T16:17Z,d1,d2
2013-07-09T16:18Z,d1,d2
2013-07-09T19:17Z,d1,d2
2013-07-09T21:17Z,d1,d2
Do a COGROUP on the original dataset and the generated dataset above. Use a nested FOREACH GENERATE to write output dataset. If first dataset is empty, use the values from second set to generate output dataset else the first dataset. Here is the piece of code I used on these two datasets.
Org_Set = LOAD 'pigMissingData/timeSeries' USING PigStorage(',') AS (timeStamp, fl1, fl2);
Default_set = LOAD 'pigMissingData/timeSeriesFull' USING PigStorage(',') AS (timeStamp, fl1, fl2);
coGrouped = COGROUP Org_Set BY timeStamp, Default_set BY timeStamp;
Filled_Data_set = FOREACH coGrouped {
x = COUNT(times);
y = (x == 0? (Default_set.fl1, Default_set.fl2): (Org_Set.fl1, Org_Set.fl2));
GENERATE FLATTEN(group), FLATTEN(y.$0), FLATTEN(y.$1);
};
if you need further clarification or help let me know
In addition to #Rags answer, you could use the STREAM x THROUGH command and a simple awk script (similar to this one) to generate the date range once you have the min and max dates. Something similar to (untested! - you might need to single line the awk script with semi-colon command delimitation, or better to ship it as a script file)
grunt> describe bounds;
(min:chararray, max:chararray)
grunt> dump bounds;
(2013/01/01,2013/01/04)
grunt> fullDateBounds = STREAM bounds THROUGH `gawk '{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[1] " " s[2] " " s[3] " 0 0 0")
et=mktime(e[1] " " e[2] " " e[3] " 0 0 0")
for (i=st;i<=et;i+=60*24) print strftime("%Y/%m/%d",i)
}'`;