How to make the filter function on dates in sparkR - sparkr

'u' is a DataFrame containing ID = 1, 2, 3 .. and time= "2010-01-01", "2012-04-06", ..
ID and time have type string. I convert type of time' to 'Date'
u$time <- cast(u[[2]], "Date")
I now want the first time in u.
first <- first(u$time)
I now make a new time by adding 150 days to the first time
cluster<- first+150
I now want to make a subset. I want to have a new 'u' where the times are from the first 150 days.
ucluster <- filter(u, u$time < cluster)
but this can't run in sparkR. I get this message "returnstatus==0 is not TRUE".

The problem with your approach is that ucluster is a column of one item, rather than a date. If you take the first row and store its time in first, everything is working fine:
df <- data.frame(ID=c(1,2,3,4),time=c("2010-01-01", "2012-04-06", "2010-04-12", "2012-04-09"))
u <- createDataFrame(sqlContext,df)
u$time <- cast(u[[2]], "Date")
first <- take(u,1)$time
cluster <- first + 150
ucluster <- filter(u, u$time < cluster)
collect(ucluster)

Related

Sort Range By Column Automatically

What I want to do: sort a range of data by the active column via shortcut key(s). I do this regularly and want to automate. However, the size and location of the range changes each time I do this.
My approach
Select a cell in the range of interest that's in the first row and specific column I want to sort by, call it C4
Select down to last row of the range
Select left and/or right to all columns of the range
Get column number / address of C4 (which should be 3, since C = column 3)
Sort range by column of C4
I've been able to accomplish 1. to 3. I'm struggling with 4. Below is what I've got. I suspect that if I can get 4. sorted, then then last line will accomplish 5.
function sortCol() {
// KGF - 22,49
// Purpose: automatically select a range of data and sort by column of active cell
// Status - wip; haven't determined how to get col # of active cell
// Last updated: 22-12-4
var spreadsheet = SpreadsheetApp.getActive();
var currentCell = spreadsheet.getCurrentCell();
let cc1 = currentCell;
spreadsheet.getSelection().getNextDataRange(SpreadsheetApp.Direction.DOWN).activate();
spreadsheet.getActiveRange().getDataRegion(SpreadsheetApp.Dimension.COLUMNS).activate();
// how to get col # of active cell?? 22,49
let cc1_col = cc1.getColumn
console.log(cc1_col)
Logger.log(cc1_col)
//
cc1.activateAsCurrentCell();
spreadsheet.getActiveRange().sort({column: cc1_col, ascending: true});
};

Get Data of a month from a dated column

Respected all I want to bring data from a dated column with input of a Month like I have a column in database which type is dated but I want data only mentioning just month and year I tried but got errors please check my query
SELECT
M.INV_NUM, M.GD_NUM, M.INV_DATE, M.QTY1,
D.ITEM_CODE, D.HS_CODE, R.QNTY, R.waste_per,
R.WASTE_QNTY, (R.WASTE_QNTY+R.QNTY) TOTAL_CONSUMED
FROM
DOCT_EXPT_SALE_MST M,
DOCT_EXPT_SALE_RAW R,
DOCT_EXPT_SALE_DTL D
WHERE
R.SALE_DET_ID = D.SALE_DET_ID
AND D.SALE_ID = M.SALE_ID
AND M.INV_DATE BETWEEN TO_DATE(TRUNC('072022','MMYYYY')) AND TO_DATE(TRUNC('072022','MMYYYY'))--TO_NUMBER(TO_DATE(TO_CHAR('01072022','DDMMYYYY'))) AND TO_NUMBER(TO_DATE(TO_CHAR('31072022','DDMMYYYY')))
AND M.COMP_CODE = 3;
I tried many things but all is gone in vain. If anybody help me on this, I shall be very thankful my database is 11g
If you are being passed the string '072022' then you can do:
AND M.INV_DATE >= TO_DATE('072022','MMYYYY')
AND M.INV_DATE < ADD_MONTHS(TO_DATE('072022','MMYYYY'), 1)
The TO_DATE('072022','MMYYYY') clause will give you midnight on the first day of that month, so 2022-07-01 00:00:00.
The ADD_MONTHS(TO_DATE('072022','MMYYYY'), 1) clause will take that date and add one month, giving 2022-08-01 00:00:00.
The two comparisons will then find all dates in your column which are greater than or equal to 2022-07-01 00:00:00, and less that 2022-08-01 00:00:00 - which is all possible dates and times during that month.
So your query would be (switching to ANSI joins!):
SELECT
M.INV_NUM, M.GD_NUM, M.INV_DATE, M.QTY1,
D.ITEM_CODE, D.HS_CODE, R.QNTY, R.waste_per,
R.WASTE_QNTY, (R.WASTE_QNTY+R.QNTY) TOTAL_CONSUMED
FROM
DOCT_EXPT_SALE_MST M
JOIN
DOCT_EXPT_SALE_DTL D ON D.SALE_ID = M.SALE_ID
JOIN
DOCT_EXPT_SALE_RAW R ON R.SALE_DET_ID = D.SALE_DET_ID
WHERE
M.INV_DATE >= TO_DATE('072022','MMYYYY')
AND M.INV_DATE < ADD_MONTHS(TO_DATE('072022','MMYYYY'), 1)
AND M.COMP_CODE = 3;

DolphinDB: how to divide time period based on given conditions?

Suppose the given stock data are in 1-minute intervals. I’m trying to perform a time split for around every 1.5 million shares traded for each stock so as to obtain data records with different time windows. In this case, around every 1.5 million means that the value at a time point should be added if the sum_volume could be closer to 1.5 million after adding it. Otherwise, the value should not be added.
It can be achieved by the following script. The key to group data lies in the expression: iif(accumulate(caclCumVol{1500000}, volume) ==volume, time, NULL).ffill()
//Define an accumulate function `calcCumVol`. If the value at this point should be included in the current group, the function returns the accumulated volume. Otherwise, it creates a new group and returns the volume at the current point.
def caclCumVol(target, a, b){
newVal = a + b
if(newVal < target) return newVal
else if(newVal - target > target - a) return b
else return newVal
}
// import data
t = loadText("f:/DolphinDB/sample.csv")
// The key lies in the the expression iif(accumulate(caclCumVol{1500000), volume) == volume, time, NULL).ffill()
// If the accumulated value == volume, it means a new group begins.
// If it is the start of a new group, record the current time. Otherwise leave it NULL and fill it with function `ffill`. Therefore, the data in the same group all use the same start time.
output = select first(wind_code) as wind_code, first(date) as date, sum(volume) as sum_volume, last(time) as endTime from t group by iif(accumulate(caclCumVol{1500000}, volume) ==volume, time, NULL).ffill() as startTime

issues returning pyspark dataframe using for loop

I am applying for loop in pyspark. How can I get the actual values in dataframe . I am doing dataframe joins and filtering too.
I havent added dataset here, I need the approach or psuedo code just to figure out what I am doing worng here.
Help is really appreciated, I am stuck since long.
values1 = values.collect()
temp1 = []
for index, row in enumerate(sorted(values1, key=lambda x:x.w_vote, reverse = False)):
tmp = data_int.filter(data_int.w_vote >= row.w_vote)
# Left join service types to results
it1 = dt.join(master_info,dt.value == master_info.value, 'left').drop(dt.value)
print(tmp)
it1 = it1.withcolumn('iteration',F.lit('index')).otherwise(it1.iteration1)
it1 = it1.collect()[index]
# concatenate the results to the final hh list
temp1.append(it1)
print ('iterations left:', total_values - (index+1), "Threshold:", row.w_vote)
The problem I am facing is the output of temp1 comes as below
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 240 Threshold: 0.1
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 239 Threshold: 0.2
Why my actual values are not getting displayed in uutput as a list
print applied to a Dataframe execute the __repr__ method of the dataframes, which is what you get. If you want to print the content of the dataframe, use either show to display the first 20 lines, or collect to get the full dataframe.

SSRS First, Second, Third, etc?

=First(Fields!PrimeContractor.Value, "DataSet1") + ", " + Last(Fields!PrimeContractor.Value, "DataSet1")
This is good to get the first and last values from a field into one single cell, but how do I get everything else in between? I tried "Second" but that is a time value so I know that doesn't work.
You can use the LookupSet to get the selected values in a dataset then use JOIN to put them all together:
=Join(LookupSet(1, 1, Fields!PrimeContractor.Value, "DataSet1"), ", ")
Since you want all records, use 1 and 1 for the first two arguments (1 = 1). This reads as:
Lookup records where 1 = 1 and return the PrimeContractor in the DataSet1 Dataset.

Resources