DolphinDB: how to divide time period based on given conditions? - quantitative-finance

Suppose the given stock data are in 1-minute intervals. I’m trying to perform a time split for around every 1.5 million shares traded for each stock so as to obtain data records with different time windows. In this case, around every 1.5 million means that the value at a time point should be added if the sum_volume could be closer to 1.5 million after adding it. Otherwise, the value should not be added.

It can be achieved by the following script. The key to group data lies in the expression: iif(accumulate(caclCumVol{1500000}, volume) ==volume, time, NULL).ffill()
//Define an accumulate function `calcCumVol`. If the value at this point should be included in the current group, the function returns the accumulated volume. Otherwise, it creates a new group and returns the volume at the current point.
def caclCumVol(target, a, b){
newVal = a + b
if(newVal < target) return newVal
else if(newVal - target > target - a) return b
else return newVal
}
// import data
t = loadText("f:/DolphinDB/sample.csv")
// The key lies in the the expression iif(accumulate(caclCumVol{1500000), volume) == volume, time, NULL).ffill()
// If the accumulated value == volume, it means a new group begins.
// If it is the start of a new group, record the current time. Otherwise leave it NULL and fill it with function `ffill`. Therefore, the data in the same group all use the same start time.
output = select first(wind_code) as wind_code, first(date) as date, sum(volume) as sum_volume, last(time) as endTime from t group by iif(accumulate(caclCumVol{1500000}, volume) ==volume, time, NULL).ffill() as startTime

Related

Is there any better way to check if the same data is present in a table in .Net core 3.1?

I'm pulling data from a third party api. The api runs multiple times in a day. So, if the same data is present in the table it should ignore that record, else if there are any changes it should update that record or insert a new record if anything new shows up in the json received.
I'm using the below code for inserting any new data.
var input = JsonConvert.DeserializeObject<List<DeserializeLookup>>(resultJson).ToList();
var entryset = input.Select(y => new Lookup
{
lookupType = "JOBCODE",
code = y.Code,
description = y.Description,
isNew = true,
lastUpdatedDate = DateTime.UtcNow
}).ToList();
await _context.Lookup.AddRangeAsync(entryset);
await _context.SaveChangesAsync();
But, after the first run, when the api runs again it's again inserting the same data in the table. As a result, duplicate entries are getting into table. To handle the same, I used a foreach loop as below before inserting data to the table.
foreach (var item in input)
{
if (!_context.Lookup.Any(r =>
r.code== item.Code))
{
//above insert code
}
}
But, the same doesn't work as expected. Also, the api takes a lot of time to run when I put a foreach loop. Is there a solution to this in .net core 3.1
List<DeserializeLookup> newList=new();
foreach (var item in input)
{
if (!_context.Lookup.Any(r =>
r.code== item.Code))
{
newList.add(item);
//above insert code
}
}
await _context.Lookup.AddRangeAsync(newList);
await _context.SaveChangesAsync();
It will be better if you try this way
I’m on my phone so forgive me for not being able to format the code in my response. The solution to your problem is something I actually just encountered myself while syncing data from an azure function and third party app and into a sql database.
Depending on your table schema, you would need one column with a unique identifier. Make this column a primary key (first step to preventing duplicates). Here’s a resource for that: https://www.w3schools.com/sql/sql_primarykey.ASP
The next step you want to take care of is your stored procedure. You’ll need to perform what’s commonly referred to as an UPSERT. To do this you’ll need to merge a table with the incoming data...on a specified column (whichever is your primary key).
That would look something like this:
MERGE
Table_1 AS T1
USING
Incoming_Data AS source
ON
T1.column1 = source.column1
/// you can use an AND / OR operator in here for matching on additional values or combinations
WHEN MATCHED THEN
UPDATE SET T1.column2= source.column2
//// etc for more columns
WHEN NOT MATCHED THEN
INSERT (column1, column2, column3) VALUES (source.column1, source.column2, source.column3);
First of all, you should decouple the format in which you get your data from your actual data handling. In your case: get rid of the JSon before you actually interpret the data.
Alas, I haven't got a clue what your data represents, so Let's assume your data is a sequence of Customer Orders. When you get new data, you want to Add all new orders, and you want to update changed orders.
So somewhere you have a method with input your json data, and as output a sequence of Orders:
IEnumerable<Order> InterpretJsonData(string jsonData)
{
...
}
You know Json better than I do, besides this conversion is a bit beside your question.
You wrote:
So, if the same data is present in the table it should ignore that record, else if there are any changes it should update that record or insert a new record
You need an Equality Comparer
To detect whether there are Added or Changed Customer Orders, you need something to detect whether Order A equals Order B. There must be at least one unique field by which you can identify an Order, even if all other values are of the Order are changed.
This unique value is usually called the primary key, or the Id. I assume your Orders have an Id.
So if your new Order data contains an Id that was not available before, then you are certain that the Order was Added.
If your new Order data has an Id that was already in previously processed Orders, then you have to check the other values to detect whether it was changed.
For this you need Equality comparers: one that says that two Orders are equal if they have the same Id, and one that says checks all values for equality.
A standard pattern is to derive your comparer from class EqualityComparer<Order>
class OrderComparer : EqualityComparer<Order>
{
public static IEqualityComparer<Order> ByValue = new OrderComparer();
... // TODO implement
}
Fist I'll show you how to use this to detect additions and changes, then I'll show you how to implement it.
Somewhere you have access to the already processed Orders:
IEnumerable<Order> GetProcessedOrders() {...}
var jsondata = FetchNewJsonOrderData();
// convert the jsonData into a sequence of Orders
IEnumerable<Order> orders = this.InterpretJsonData(jsondata);
To detect which Orders are added or changed, you could make a Dictonary of the already Processed orders and check the orders one-by-one if they are changed:
IEqualityComparer<Order> comparer = OrderComparer.ByValue;
Dictionary<int, Order> processedOrders = this.GetProcessedOrders()
.ToDictionary(order => order.Id);
foreach (Order order in Orders)
{
if(processedOrders.TryGetValue(order.Id, out Order originalOrder)
{
// order already existed. Is it changed?
if(!comparer.Equals(order, originalOrder))
{
// unequal!
this.ProcessChangedOrder(order);
// remember the changed values of this Order
processedOrder[order.Id] = Order;
}
// else: no changes, nothing to do
}
else
{
// Added!
this.ProcessAddedOrder(order);
processedOrder.Add(order.Id, order);
}
}
Immediately after Processing the changed / added order, I remember the new value, because the same Order might be changed again.
If you want this in a LINQ fashion, you have to GroupJoin the Orders with the ProcessedOrders, to get "Orders with their zero or more Previously processed Orders" (there will probably be zero or one Previously processed order).
var ordersWithTPreviouslyProcessedOrder = orders.GroupJoin(this.GetProcessedOrders(),
order => order.Id, // from every Order take the Id
processedOrder => processedOrder.Id, // from every previously processed Order take the Id
// parameter resultSelector: from every Order, with its zero or more previously
// processed Orders make one new:
(order, previouslyProcessedOrders) => new
{
Order = order,
ProcessedOrder = previouslyProcessedOrders.FirstOrDefault(),
})
.ToList();
I use GroupJoin instead of Join, because this way I also get the "Orders that have no previously processed orders" (= new orders). If you would use a simple Join, you would not get them.
I do a ToList, so that in the next statements the group join is not done twice:
var addedOrders = ordersWithTPreviouslyProcessedOrder
.Where(orderCombi => orderCombi.ProcessedOrder == null);
var changedOrders = ordersWithTPreviouslyProcessedOrder
.Where(orderCombi => !comparer.Equals(orderCombi.Order, orderCombi.PreviousOrder);
Implementation of "Compare by Value"
// equal if all values equal
protected override bool Equals(bool x, bool y)
{
if (x == null) return y == null; // true if both null, false if x null but y not null
if (y == null) return false; // because x not null
if (Object.ReferenceEquals(x, y) return true;
if (x.GetType() != y.GetType()) return false;
// compare all properties one by one:
return x.Id == y.Id
&& x.Date == y.Date
&& ...
}
For GetHashCode is one rule: if X equals Y then they must have the same hash code. If not equal, then there is no rule, but it is more efficient for lookups if they have different hash codes. Make a tradeoff between calculation speed and hash code uniqueness.
In this case: If two Orders are equal, then I am certain that they have the same Id. For speed I don't check the other properties.
protected override int GetHashCode(Order x)
{
if (x == null)
return 34339d98; // just a hash code for all null Orders
else
return x.Id.GetHashCode();
}

DAX IF measure - return fixed value

This should be a very simple requirement. But it seems impossible to implement in DAX.
Data model, User lookup table joined to many "Cards" linked to each user.
I have a measure setup to count rows in CardUser. That is working fine.
<measureA> = count rows in CardUser
I want to create a new measure,
<measureB> = IF(User.boolean = 1,<measureA>, 16)
If User.boolean = 1, I want to return a fixed value of 16. Effectively, bypassing measureA.
I can't simply put User.boolean = 1 in the IF condition, throws an error.
I can modify measureA itself to return 0 if User.boolean = 1
measureA> =
CALCULATE (
COUNTROWS(CardUser),
FILTER ( User.boolean != 1 )
)
This works, but I still can't find a way to return 16 ONLY if User.boolean = 1.
That's easy in DAX, you just need to learn "X" functions (aka "Iterators"):
Measure B =
SUMX( VALUES(User.boolean),
IF(User.Boolean, [Measure A], 16))
VALUES function generates a list of distinct user.boolean values (1, 0 in this case). Then, SUMX iterates this list, and applies IF logic to each record.

Sort a range or array based on two columns that contain the date and time

Currently I'm trying to create a Google Apps Script for Google Sheets which will allow adding weekly recurring events, batchwise, for upcoming events. My colleagues will then make minor changes to these added events (e.g. make date and time corrections, change the contact person, add materials neccessary for the event and so forth).
So far, I have written the following script:
function CopyWeeklyEventRows() {
var ss = SpreadsheetApp.getActiveSheet();
var repeatingWeeks = ss.getRange(5,1).getValue(); // gets how many weeks it should repeat
var startDate = ss.getRange(6, 1).getValue(); // gets the start date
var startWeekday = startDate.getDay(); // gives the weekday of the start date
var regWeek = ss.getRange(9, 2, 4, 7).getValues(); // gets the regular week data
var regWeekdays = new Array(regWeek.length); // creates an array to store the weekdays of the regWeek
var ArrayStartDate = new Array(startDate); // helps to store the We
for (var i = 0; i < regWeek.length; i++){ // calculates the difference between startWeekday and each regWeekdays
regWeekdays[i] = regWeek[i][1].getDay() - startWeekday;
Logger.log(regWeekdays[i]);
// Add 7 to move to the next week and avoid negative values
if (regWeekdays[i] < 0) {
regWeekdays[i] = regWeekdays[i] + 7;
}
// Add days according to difference between startWeekday and each regWeekdays
regWeek[i][0] = new Date(ArrayStartDate[0].getTime() + regWeekdays[i]*3600000*24);
}
// I'm struggling with this line. The array regWeek is not sorted:
//regWeek.sort([{ column: 1, ascending: true }]);
ss.getRange(ss.getLastRow() + 1, 2, 4, 7).setValues(regWeek); // copies weekly events after the last row
}
It allows to add one week of recurring events to the overview section of the spreadsheet based on a start date. If the start date is a Tuesday, the regular week is added starting from a Tuesday. However, the rows are not sorted according to the dates:
.
How can the rows be sorted by ascending date (followed by time) before adding them to the overview?
My search for similar questions revealed Google Script sort 2D Array by any column which is the closest hit I've found. The same error message is shown when running my script with the sort line. I don't understand the difference between Range and array yet which might help to solve the issue.
To give you a broader picture, here's what I'm currently working on:
I've noticed that the format will not necessarily remain when adding
new recurring events. So far I haven't found the rule and formatted by
hand in a second step.
A drawback is currently that the weekly recurring events section is
fixed. I've tried to find the last filled entry and use it to set the
range of regWeek, but got stuck.
Use the column A to exclude recurring events from the addition
process using a dropdown.
Allow my colleagues to add an event to the recurring events using a
dropdown (e.g. A26). This event should then be added with sorting to
the right day of the week and start time. The sorting will come in
handy.
Thanks in advance for your input regarding the sorting as well as suggestions on how to improve the code in general.
A demo version of the spreadsheet
UpdateV01:
Here the code lines which copy and sort (first by date, then by time)
ss.getRange(ss.getLastRow()+1,2,4,7).setValues(regWeek); // copies weekly events after the last row
ss.getRange(ss.getLastRow()-3,2,4,7).sort([{column: 2, ascending: true}, {column: 4, ascending: true}]); // sorts only the copied weekly events chronologically
As #tehhowch pointed out, this is slow. Better to sort BEFORE writing.
I will implement this method and post it here.
UpdateV02:
regWeek.sort(function (r1, r2) {
// sorts ascending on the third column, which is index 2
return r1[2] - r2[2];
});
regWeek.sort(function (r1, r2) {
// r1 and r2 are elements in the regWeek array, i.e.
// they are each a row array if regWeek is an array of arrays:
// Sort ascending on the first column, which is index 0:
// if r1[0] = 1, r2[0] = 2, then 1 - 2 is -1, so r1 sorts before r2
return r1[0] - r2[0];
});
UpdateV03:
Here an attempt to repeat the recurring events over several weeks. Don't know yet how to include the push for the whole "week".
// Repeat week for "A5" times and add to start/end date
for (var j = 0; j < repeatingWeeks; j++){
for (var i = 0; i < numFilledRows; i++){
regWeekRepeated[i+j*6][0] = new Date(regWeek[i][0].getTime() + j*7*3600000*24); // <-This line leads to an error message
regWeekRepeated[i+j*6][3] = new Date(regWeek[i][3].getTime() + j*7*3600000*24);
}
}
My question was answered and I was able to make the code work as intended.
Given your comment - you want to sort the written chunk - you have two methods available. One is to sort written data after writing, by using the Spreadsheet service's Range#sort(sortObject) method. The other is to sort the data before writing, using the JavaScript Array#sort(sortFunction()) method.
Currently, your sort code //regWeek.sort([{ column: 1, ascending: true }]); is attempting to sort a JavaScript array, using the sorting object expected by the Spreadsheet service. Thus, you can simply chain this .sort(...) call to your write call, as Range#setValues() returns the same Range, allowing repeated Range method calling (e.g. to set values, then apply formatting, etc.).
This looks like:
ss.getRange(ss.getLastRow() + 1, 2, regWeek.length, regWeek[0].length)
.setValues(regWeek)
/* other "chainable" Range methods you want to apply to
the cells you just wrote to. */
.sort([{column: 1, ascending: true}, ...]);
Here I have updated the range you access to reference the data you are attempting to write - regWeek - so that it is always the correct size to hold the data. I've also visually broken apart the one-liner so you can better see the "chaining" that is happening between Spreadsheet service calls.
The other method - sorting before writing - will be faster, especially as the size and complexity of the sort increases. The idea behind sorting a range is you need to use a function that returns a negative value when the first index's value should come before the second's, a positive value when the first index's value should come after the second's, and a zero value if they are equivalent. This means a function that returns a boolean is NOT going to sort as one thinks, since false and 0 are equivalent in Javascript, while true and 1 are also equivalent.
Your sort looks like this, assuming regWeek is an array of arrays and you are sorting on numeric values (or at least values which will cast to numbers, like Dates).
regWeek.sort(function (r1, r2) {
// r1 and r2 are elements in the regWeek array, i.e.
// they are each a row array if regWeek is an array of arrays:
// Sort ascending on the first column, which is index 0:
// if r1[0] = 1, r2[0] = 2, then 1 - 2 is -1, so r1 sorts before r2
return r1[0] - r2[0];
});
I strongly recommend reviewing the Array#sort documentation.
You could sort the "Weekly Events" range before you set the regWeek variable. Then the range would be in the order you want before you process it. Or you could sort the whole "Overview" range after setting the data. Here's a quick function you can call to sort the range by multiple columns. You can of course tweak it to sort the "Weekly Events" range instead of the "Overview" range.
function sortRng() {
var ss = SpreadsheetApp.getActiveSheet();
var firstRow = 22; var firstCol = 1;
var numRows = ss.getLastRow() - firstRow + 1;
var numCols = ss.getLastColumn();
var overviewRng = ss.getRange(firstRow, firstCol, numRows, numCols);
Logger.log(overviewRng.getA1Notation());
overviewRng.sort([{column: 2, ascending: true}, {column: 4, ascending: true}]);
}
As for getting the number of filled rows in the Weekly Events section, you need to search a column that will always have data if any row has data (like the start date column b), loop through the values and the first time it finds a blank, return that number. That will give you the number of rows that it needs to copy. Warning: if you don't have at least one blank value in column B between the Weekly Events section and the Overview section, you will probably get unwanted results.
function getNumFilledRows() {
var ss = SpreadsheetApp.getActiveSheet();
var eventFirstRow = 9; var numFilledRows = 0;
var colToCheck = 'B';//the StartDate col which should always have data if the row is filled
var vals = ss.getRange(colToCheck + eventFirstRow + ":" + colToCheck).getValues();
for (i = 0; i < vals.length; i++) {
if (vals[i][0] == '') {
numFilledRows = i;
break;
}
}
Logger.log(numFilledRows);
return numFilledRows;
}
EDIT:
If you just want to sort the array in javascript before writing, and you want to sort by Start Date first, then by Time of day, you could make a temporary array, and add a column to each row that is date and time combined. array.sort() sorts dates alphabetically, so you would need to convert that date to an integer. Then you could sort the array by the new column, then delete the new column from each row. I included a function that does this below. It could be a lot more compact but I thought it might be more legible like this.
function sortDates() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var vals = ss.getActiveSheet().getRange('B22:H34').getDisplayValues(); //get display values because getValues returns time as weird date 1899 and wrong time.
var theDate = new Date(); var newArray = []; var theHour = ''; var theMinutes = '';
var theTime = '';
//Create a new array that inserts date and time as the first column in each row
vals.forEach(function(aRow) {
theTime = aRow[2];//hardcoded - assumes time is the third column that you grabbed
//get the hours (before colon) as a number
theHour = Number(theTime.substring(0,theTime.indexOf(':')));
//get the minutes(after colon) as a number
theMinutes = Number(theTime.substring(theTime.indexOf(':')+1));
theDate = new Date(aRow[0]);//hardcoded - assumes date is the first column you grabbed.
theDate.setHours(theHour);
theDate.setMinutes(theMinutes);
aRow.unshift(theDate.getTime()); //Add the date and time as integer to the first item in the aRow array for sorting purposes.
newArray.push(aRow);
});
//Sort the newArray based on the first item of each row (date and time as number)
newArray.sort((function(index){
return function(a, b){
return (a[index] === b[index] ? 0 : (a[index] < b[index] ? -1 : 1));
};})(0));
//Remove the first column of each row (date and time combined) that we added in the first step
newArray.forEach(function(aRow) {
aRow.shift();
});
Logger.log(newArray);
}

How to make the filter function on dates in sparkR

'u' is a DataFrame containing ID = 1, 2, 3 .. and time= "2010-01-01", "2012-04-06", ..
ID and time have type string. I convert type of time' to 'Date'
u$time <- cast(u[[2]], "Date")
I now want the first time in u.
first <- first(u$time)
I now make a new time by adding 150 days to the first time
cluster<- first+150
I now want to make a subset. I want to have a new 'u' where the times are from the first 150 days.
ucluster <- filter(u, u$time < cluster)
but this can't run in sparkR. I get this message "returnstatus==0 is not TRUE".
The problem with your approach is that ucluster is a column of one item, rather than a date. If you take the first row and store its time in first, everything is working fine:
df <- data.frame(ID=c(1,2,3,4),time=c("2010-01-01", "2012-04-06", "2010-04-12", "2012-04-09"))
u <- createDataFrame(sqlContext,df)
u$time <- cast(u[[2]], "Date")
first <- take(u,1)$time
cluster <- first + 150
ucluster <- filter(u, u$time < cluster)
collect(ucluster)

Stata: how to get observation value 5 minutes ahead with gabbed time data

I got high frequency data from a limit order book in Stata. Time does not have a regular interval, and some observations are at the same time (in milliseconds). For each observation I need to get the midpoint 5 minutes later in a separate column. So for observation 1 the midpoint would be 10.49, because the last midpoint closest to 09:05:02.579 would be 10.49.
How to do this in Stata?
datetime midpoint
12/02/2012 09:00:02.579 10.5125
12/02/2012 09:00:03.471 10.5125
12/02/2012 09:00:03.471 10.5125
12/02/2012 09:00:03.471 10.51
12/02/2012 09:00:03.471 10.51
12/02/2012 09:00:03.549 10.505
12/02/2012 09:00:03.549 10.5075
......
12/02/2012 09:04:59.785 10.495
12/02/2012 09:05:00.829 10.4925
12/02/2012 09:05:01.209 10.49
12/02/2012 09:05:03.057 10.4875
12/02/2012 09:05:05.055 10.485
.....
My approach would be
generate a new data set shifted by five minutes
append this shifter data set
find closest before and after observations to your five minute delta
use some criteria to pick the better of these two values
You specified closest, but you might want to add some other criteria depending on your book. Also, you mentioned more than one value at a given ms tick, but without more information I'm not sure how to handle that. Do you want to combine those midpoints first? Or are they different stocks?
Here's some code that implements the basics of the approach above.
clear
version 11.2
set seed 2001
* generate some data
set obs 100000
generate double dt = ///
tc(02dec2012 09:00:00.000) + 1000*_n + int(100*rnormal())
format dt %tcDDmonCCYY_HH:MM:SS.sss
sort dt
generate midpt = 100
replace midpt = ///
round(midpt[_n - 1] + 0.1*rnormal(), 0.005) if (_n != 1)
* add back future midpts
preserve
tempfile future
rename midpt fmidpt
rename dt fdt
generate double dt = fdt - tc(00:05:00.000)
save `future'
restore
append using `future'
* generate midpoints before and after 5 minutes in the future
sort dt
foreach v of varlist fdt fmidpt {
clonevar `v'_b = `v'
replace `v'_b = `v'_b[_n - 1] if missing(`v'_b)
}
gsort -dt
foreach v of varlist fdt fmidpt {
clonevar `v'_a = `v'
replace `v'_a = `v'_a[_n - 1] if missing(`v'_a)
}
format fdt* %tcDDmonCCYY_HH:MM:SS.sss
* use some algorithm to pick correct value
sort dt
generate choose_b = ///
((dt + tc(00:05:00.000)) - fdt_b) < (fdt_a - (dt + tc(00:05:00.000)))
generate fdt_c = cond(choose_b, fdt_b, fdt_a)
generate fmidpt_c = cond(choose_b, fmidpt_b, fmidpt_a)
format fdt_c %tcDDmonCCYY_HH:MM:SS.sss
// Construct a variable to look for in the dataset
gen double midpoint_5 = (datetime + 5*60000)
format midpoint_5 %tcNN/DD/CCYY_HH:MM:SS.sss
// will contain the closest observation number and midpoint 5 minutes a head
gen _t = .
gen double midpoint_at5 = .
// How many observations in the sample?
local N = _N
// We will use these variables to skip some observations in the loop
egen obs_in_minute = count(minutes_filter), by(minutes_filter)
egen max_obs_in_minute = max(obs_in_minute)
set more off
// For each observation
forvalues i = 1/`N' {
// If it is a trade
if type[`i'] == "Trade" {
// Set the time to lookup in the data
local lookup = midpoint_5[`i']
// The time should be between the min and max(*5)
local min = `i' + obs_in_minute[`i'] // this might cause errors
local max = `i' + max_obs_in_minute[`i']*5
// For each of these observations
forvalues j = `min'/`max' {
// Check if the lookup date is smaller than the datetime of the observation
if `lookup' < datetime[`j'] {
// Set the observation ID at the lookup ID 1 observation before
quietly replace _t = `j'-1 in `i'
// Set the midpoint at the lookup ID 1 observation before
quietly replace midpoint_at5 = midpoint[`j'-1] in `i'
// We have found the closest 5th min ahead... now stop loop and continue to next observation.
continue, break
}
}
// This is to indicate where we are in the loop
display "`i'/`N'"
}
}

Resources