Aggregating data in Upsolver and using Athena output to Upsert in Athena - upsolver

I'm getting Kafka stream which I need to aggregate and load into Athena. As each event comes, the aggregates should update to reflect the new event. I want to re-use this aggregated data for multiple outputs so I used an Upsolver intermediate output to first construct the aggregated data. And then creating multiple Athena and Redshift outputs to Upsert from this aggregated intermediate output. Since ingestion happens per minute, issue is that each time a new event arrives, it overrides the aggregates only with data from that minute as opposed to total aggregates from all data processed so far?

Intermediate Upsolver output will by default only process data as it was ingested, so each 1 minute stream of events will get aggregated. Since you want to aggregate all data so far, you need to add a WINDOW clause.
SELECT id, col1, count(col2)
FROM table
GROUP BY id, col1
WINDOW 31 DAYS
You can add any window you need. In this case, the intermediate output will maintain aggregates for 31 past days.
Example: With window=31 days, lets say these are the event streams you received. date is the attribute in your event which shows when this event actually happened, time is the Upsolver attribute which shows when this event was ingested into Upsolver. And you are doing count of id.
id=1 date=11/30/2022 time =11/30/2022 count=1
id=1 date=12/15/2022 time =12/15/2022 count=2
id=1 date=12/20/2022 time =12/20/2022 count=3
id=1 date=12/31/2022 time =12/31/2022 count=4
So each time since your event is within 31 days, the count kept updating. Lets say a very late event arrives, impossible situation in your business case but still happens. Even though no late event was supposed to arrive past 31 days, one event arrives really late.
id=1 date=12/10/2022 time =2/1/2023
this is a really really late event not accounted for and may be it won't happen but if it does happen, it will now re-aggregate upto 31 days from 2/1/2023 and result in a new count of just 1 as this is the only event in this defined window. You can add below where to make it fail safe.
SELECT id, col1, count(col2)
FROM table
WHERE DATE_DIFF_PRECISE('day', TO_DATE("date"), TO_DATE(time)) <= 31
GROUP BY id, col1
WINDOW 31 DAYS
//Here 31 was your window, change accordingly

Related

The difference of a output table generated by aggregation is keyedTable and keyedStreamTable

When the output table generated by aggregation is keyedTable and keyedStreamTable, the results are different
When the aggregation engine uses the tables generated by keyedTable and keyedStreamTable to receive the results, the effect is different. The former can be received, but it cannot be used as a data source for a larger period; the latter does not play an aggregation role, but only intercepts the first record of ticks data per minute.
The code executed by the GUI is as follows:
barColNames=`ActionTime`InstrumentID`Open`High`Low`Close`Volume`Amount`OpenPosition`AvgPrice`TradingDay
barColTypes=[TIMESTAMP,SYMBOL,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT,DOUBLE,DOUBLE,DOUBLE,DATE]
Choose one of the following two lines of code, and find that the results are inconsistent
/////////// Generate a 1-minute K line (barsMin01) This is an empty table
share keyedTable(`ActionTime`InstrumentID,100:0, barColNames, barColTypes) as barsMin01
//////// This code can be used for aggregation, but it cannot be used as a data source for other periods
share keyedStreamTable(`ActionTime`InstrumentID,100:0, barColNames, barColTypes) as barsMin01
////////Choosing this code does not have an aggregation effect, and it is found that only the first tick of every minute is intercepted.
//////////define the data sources
metrics=<[first(LastPrice), max(LastPrice), min(LastPrice), last(LastPrice), sum(Volume), sum(Amount), sum(OpenPosition), sum(Amount)/sum(Volume)/300, last(TradingDay) ]>
////////////Aggregation engine
//////////// generate 1-min k line, Aggregation engine
nMin01=1*60000
tsAggrKlineMin01 = createTimeSeriesAggregator(name="aggr_kline_min01", windowSize=nMin01, step=nMin01, metrics=metrics, dummyTable=ticks, outputTable=barsMin01, timeColumn=`ActionTime, keyColumn=`InstrumentID,updateTime=500, useWindowStartTime=true)
/////////// subscribe and the 1-min k line will be generated
subscribeTable(tableName="ticks", actionName="act_tsaggr_min01", offset=0, handler=append!{getStreamEngine("aggr_kline_min01")}, batchSize=1000, throttle=1, hash=0, msgAsTable=true)
There are some diffenence between keyedTable and keyedStreamTable:
keyedTable: When adding a new record to the table, the system will automatically check the primary key of the new record. If the primary key of the new record is the same as the primary key of the existing record, the corresponding record in the table will be updated.
keyedStreamTable: When adding a new record to the table, the system will automatically check the primary key of the new record. If the primary key of the new record is the same as the primary key of the existing record, the corresponding record will not be updated.
That is, one of them is for updating and the other is for filtering.
The keyedStreamTable you mentioned "does not play an aggregation role, but intercepts the first record of ticks data per minute", is exactly because you set updateTime=500 in createTimeSeriesAggregator. If updateTime is specified, the calculations may occur multiple times in the current window.
You use keyedStreamTable here to subscribe to this result table, so updateTime cannot be used. If you want to force trigger, you can specify the forceTrigger parameter.

Howe to count an event by minute in Big Query

Many years ago I knew SQL quite well but apparently it's been so long I lost my skills and knolwedge.
I have a number of tables that each track a given event with additional metadata. One piece of Metadata is a timestamp in UTC format(2021-08-11 17:27:27.916007 UTC).
Now I need to count how many times the event occurred per minute.
Col 1, Col2
EventName, Timestamp in UTC
I am trying to recall my past knowledge and also how to apply that to BQ. Any help is appreciated.
If I'm understanding well, you could transform your Timestamp into minutes and then group by it.
SELECT count(*) AS number_events,
FLOOR(UNIX_SECONDS(your_timestamp)/60) AS minute
FROM your_table
GROUP BY FLOOR(UNIX_SECONDS(your_timestamp)/60)
So it transforms your timestamps to unix_seconds, then divide by 60 to get minutes and floor() to skip decimals after the division.
If you have multiple type of events in the same table, just add the name of the event to the select and to the group by
The first step would be to group by event column.
Then the Timestamp events can be counted.
Select Col2_EventName, count(Timestamp )
group by 1
Depending on your data, some more transformation have to be done. E.g. ignore the seconds in the timestamp and hold only the full minutes, as done in the answer from Javier Montón.

How should I construct an Oracle function, or other code, to measure time for business processes between two records

I need suggestions for designing tables and records in Oracle to handle business processes, status, and report times between statuses.
We have a transaction table that records an serial numbered record id, a document id, a date and time, and a status. Statuses reflect where a document is in the approval process, reflecting a task that needs to be done on a document. There are up to 40 statuses, showing both who needs to approve and what is the task being done. So there is a document header or parent record, and multiple status records as child records.
The challenge is to analyzes where bottlenecks are, which tasks are taking the longest, etc.
From a business pov, a task receives a document, we have the date and time this happens. We do not have a release or finish date and time for a current task. All we have is the next task's start date and time. Note that a document can only have one status at a time.
For reasons I won't go into, we cannot use ETL to create an end date and time for a status, although I think that is the solution.
Part of the challenge is that statuses are not entirely consecutive or have a fixed order. Some statuses can start, stop, and later in the process start again.
What I would like to report is the time, on a weekly or monthly basis, that each status record takes, time date time end minus date time start. Can anyone suggest a function or other way to accomplish this?
I don't need specific code. I could use some example in pseudo code or just in outline form of how to solve this. Then I could figure out the code.
You can use a trigger after insert and after update on transaction table to record on a LOG_TABLE every change: id transaction, last status, new status, who approve, change date-time (maybe using TiMESTAMP data type is fractional seconds care), terminal, session ID, username.
For insert: you need to define a type of "insert status", diferent for other 40 statuses. For example, is status are of numeric type, a "insert status" can be "-1" (minus one), so last status is "-1" and new status is status of the inserted record on transaction table.
With this LOG_TABLE you can develop a package with functions to calculate time between changed status, display all changes, dispplay last change, etc.

Track the rows which were updated or encrypted

I want to scrub(or encrypt) the email information from a few tables which are older than a few years.
This I am planning to do as part of a job, next time when I run the job how can I omit the rows which are already scrubbed or encrypted.
I am looking for an approach which will be having good performance.
"I want to scrub(or encrypt) the email information from a few tables which are older than a few years"
I hope this means you have a date column on these tables which you can use to determine which ones need to be scrubbed. The most efficient way of tackling the job is to track that date in an operational table, recording the most recent date scrubbed.
For example you have ten years' worth of data, and you need to scrub records which are more than four years old. Now this would work:
update t23
set email = null
where date_created < add_months(sysdate, -48);
But it seems like you want to batch things up. So build a tracking table, which at its simplest would be
create table tracker (
last_date_scrubbed);
Populate the last_date_scrubbed with a really old date say date '2010-01-01'
Now you can write a query like this
update t23
set email = null
where date_created
< (select last_date_scrubbed + interval '1' year from tracker);
That will clean all records older than 2011. Increment the date in the tracker table by one year. Run the query again to clean stuff from 2011. Repeat until you get to your target state of cleanliness. At which point you can switch to running the query monthly , with an interval of one month , or whatever.
Obviously you should proceduralize this. A procedure is the best way to encapsulate the steps and make sure everything is kept in step. Also you can use the database scheduler to run the procedure.
"there is one downside to this approach. I thought that you want to be free upon choosing which rows to be updated."
I don't see any requirement to track which individual rows have been scrubbed. After all, the end state is that every record older than a certain date has been scrubbed. When I have done jobs like this previously all anybody wanted to know was, "how many rows have we done so far and how many have we still got to do?" Which can be answered by tracking the sql%rowcount for each run.
For The best performance, you can add a Flag Column to your main table. a Column like IsEncrypted. then every time you try to run any query for the "not Encrypted rows" you easily use WHERE when IsEncrypted Column is false to condition on those rows only. there are other ways though.
EDIT
another way is to create a logger table. basically what this table does, is that it records any more information you want about a certain ID in another table. have another table called EncryptionLogger, in it you would have at least two columns: EmailTableId, IsEncrypted. then in any query you can simply get any rows WHERE their Ids are NOT IN this table.

Efficient way to find lots of "most recent" events in sqlite3

I've got an sqlite3 database that contains events. Each event is either "on" or "off" of something happening, and contains the time of the event as well as what the event is and some miscellaneous data which varies by event.
I want to query to find that last event of each type. So far this is the query I have come up with:
SELECT * from event where name='event1on' or name='event1off' ORDER BY t DESC LIMIT 1
This works, but it is slow when I have a lot of events I want to find the latest one of. I suspect this is because for each SELECT a full scan of the database must be made (several million rows), but I am at a loss to find a more efficient way to do this.
If you have SQLite 3.7.11 or later, you can use max to select other fields from the record that contains the maximum value:
SELECT *, max(t) FROM event GROUP BY name
To speed up this query, try creating one index on the name and t fields.

Resources