I have a test table in ClickHouse that I am prototyping of storage of FX price data.
The columns in this particular table are something like:
timestamp DateTime64(6),
bank_name String,
tob_bid Float32,
tob_ask Float32
What i'd like to achieve is a rolling array of the last quotes from each bank over time.
So e.g. if the table contained data like
2023-01-22 17:25:23.368889, 'LP1', 1.06782, 1.06784
2023-01-22 17:25:27.393059, 'LP1', 1.06781, 1.06784
2023-01-22 17:25:27.345757, 'LP2', 1.06780, 1.06787
2023-01-22 17:25:27.236824, 'LP3', 1.06781, 1.06785
2023-01-22 17:25:23.321132, 'LP2', 1.06779, 1.06785
2023-01-22 17:25:23.391159, 'LP1', 1.06780, 1.06782
2023-01-22 17:25:38.520492, 'LP3', 1.06779, 1.06783
I would like the results to be
2023-01-22 17:25:23.368889, [ 'LP1' ], [ 1.06782 ], [ 1.06784 ]
2023-01-22 17:25:27.393059, [ 'LP1' ], [ 1.06781 ], [ 1.06784 ]
2023-01-22 17:25:27.345757, [ 'LP1', 'LP2' ], [ 1.06781, 1.06780 ], [ 1.06784, 1.06787 ]
2023-01-22 17:25:27.236824, [ 'LP1', 'LP2', 'LP3' ], [ 1.06781, 1.06780, 1.06781 ], [ 1.06784, 1.06787, 1.06785 ]
2023-01-22 17:25:23.321132, [ 'LP1', 'LP2', 'LP3' ], [ 1.06781, 1.06779, 1.06781 ], [ 1.06784, 1.06785, 1.06785 ]
2023-01-22 17:25:23.391159, [ 'LP1', 'LP2', 'LP3' ], [ 1.06780, 1.06779, 1.06781 ], [ 1.06782, 1.06785, 1.06785 ]
2023-01-22 17:25:38.520492, [ 'LP1', 'LP2', 'LP3' ], [ 1.06780, 1.06779, 1.06779 ], [ 1.06782, 1.06785, 1.06783 ]
I.e. at every timestamp, the array is updating with the latest quote from each unique bank_name.
Is something like this possible in Clickhouse?
Many thanks
Pretty sure can be simplified a lot but does the trick
WITH (
arrayMap(i-> i = 1, arrayReverse(arrayEnumerateUniq(arrayReverse(
groupArray(bank_name) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
))))
) AS mask
SELECT
timestamp,
arrayFilter((x,y) -> y, groupArray(bank_name) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), mask) bank_names,
arrayFilter((x,y) -> y, groupArray(tob_bid) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), mask) tob_bids,
arrayFilter((x,y) -> y, groupArray(tob_ask) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), mask) tob_asks
FROM tt
ORDER BY timestamp ASC
Something like this?
SELECT
timestamp,
groupArray(bank_name) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS bank_names,
groupArray(tob_bid) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS top_bids,
groupArray(tob_ask) OVER (ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS top_asks
FROM fx
ORDER BY timestamp ASC;
Related
I'm trying to get some specific dataset from Scival REST API to oracle database table. Below is the JSON payload that I'm trying to manipulate.
{
"metrics": [{
"metricType": "ScholarlyOutput",
"valueByYear": {
"2017": 4,
"2018": 0,
"2019": 3,
"2020": 1,
"2021": 1
}
}],
"author": {
"link": {
"#ref": "self",
"#href": "https://api.elsevier.com/analytics/scival/author/123456789?apiKey=xxxxxxxxxx&httpAccept=text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2",
"#type": "text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2"
},
"name": "Citizen, John",
"id": 123456789,
"uri": "Author/123456789"
}
}
I'm able to query the 'author' bit with the below SQL.
SELECT jt.*
FROM TABLE d,
JSON_TABLE(d.column format json, '$.author' COLUMNS (
"id" VARCHAR2 PATH '$.id',
"name" VARCHAR2 PATH '$.name')
) jt;
However, I'm not able to get the 'valueByYear' value. I've tried below.
SELECT jt.*
FROM TABLE d,
JSON_TABLE
(d.column, '$.metrics[*]' COLUMNS
(
"metric_Type" VARCHAR2 PATH '$.metricType'
,"Value_By_Year" NUMBER PATH '$.valueByYear'
NESTED PATH '$.valueByYear[1]' COLUMNS
("2021" NUMBER PATH '$.valueByYear[1]'
)
)
) jt;
I would appreciate if you could let me know what I'm missing here. I'm after the latest 'year' value.
You can use:
SELECT jt.*
FROM table_name d,
JSON_TABLE(
d.column_name format json,
'$'
COLUMNS (
id VARCHAR2 PATH '$.author.id',
name VARCHAR2 PATH '$.author.name',
NESTED PATH '$.metrics[*]' COLUMNS (
metricType VARCHAR2(30) PATH '$.metricType',
value2021 NUMBER PATH '$.valueByYear."2021"'
)
)
) jt;
Which, for the sample data:
CREATE TABLE table_name (
column_name CLOB CHECK (column_name IS JSON)
);
INSERT INTO table_name (column_name) VALUES (
'{
"metrics": [{
"metricType": "ScholarlyOutput",
"valueByYear": {
"2017": 4,
"2018": 0,
"2019": 3,
"2020": 1,
"2021": 1
}
}],
"author": {
"link": {
"#ref": "self",
"#href": "https://api.elsevier.com/analytics/scival/author/123456789?apiKey=xxxxxxxxxx&httpAccept=text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2",
"#type": "text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2"
},
"name": "Citizen, John",
"id": 123456789,
"uri": "Author/123456789"
}
}'
);
Outputs:
ID
NAME
METRICTYPE
VALUE2021
123456789
Citizen, John
ScholarlyOutput
1
db<>fiddle here
If you want to do it dynamically in PL/SQL, then you can create the types:
CREATE TYPE scival_row IS OBJECT(
name VARCHAR2(100),
id NUMBER(12),
metricType VARCHAR2(50),
year NUMBER(4),
value NUMBER
);
CREATE TYPE scival_tbl IS TABLE OF scival_row;
and then the pipelined function:
CREATE FUNCTION parseScival(
i_json CLOB,
i_year NUMBER
) RETURN scival_tbl PIPELINED DETERMINISTIC
IS
v_obj JSON_OBJECT_T := JSON_OBJECT_T.parse(i_json);
v_author JSON_OBJECT_T := v_obj.get_Object('author');
v_name VARCHAR2(100) := v_author.get_String('name');
v_id NUMBER(12) := v_author.get_Number('id');
v_metrics JSON_ARRAY_T := v_obj.get_Array('metrics');
v_metric JSON_OBJECT_T;
BEGIN
FOR i IN 0 .. v_metrics.Get_Size - 1 LOOP
v_metric := TREAT(v_metrics.get(i) AS JSON_OBJECT_T);
PIPE ROW(
scival_row(
v_name,
v_id,
v_metric.get_string('metricType'),
i_year,
v_metric.get_object('valueByYear').get_number(i_year)
)
);
END LOOP;
END;
/
Then you can use the query:
SELECT j.*
FROM table_name t
CROSS APPLY TABLE(parseScival(t.column_name, 2021)) j
Which outputs:
NAME
ID
METRICTYPE
YEAR
VALUE
Citizen, John
123456789
ScholarlyOutput
2021
1
db<>fiddle here
I have a multidimensional array which contains employee salaries according to salary year with its respective months. I want to insert salaries of different year at different row with their respective months values. I also have one year column and 12 months column in database table. Please guide me how should I insert salaries of employees at different row in table.
My multidimensional array structure is like this:-
Array
(
[2016] => Array
(
[jan] => 15000
[feb] => 15000
[mar] => 15000
[apr] => 15000
[may] => 15000
[jun] => 15000
[jul] => 15000
[aug] => 15000
[sep] => 15000
[oct] => 15000
[nov] => 15000
[dec] => 15000
)
[2017] => Array
(
[jan] => 20000
[feb] => 20000
[mar] => 20000
[apr] => 20000
[may] => 20000
[jun] => 20000
[jul] => 20000
[aug] => 20000
[sep] => 20000
[oct] => 20000
[nov] => 20000
[dec] => 20000
)
)
You must flatten your array, you need an array like :
$data = [
['year'=>'2016', 'month'=>'1', 'salary' => 15000],
['year'=>'2016', 'month'=>'2', 'salary' => 15000],
// ... and so on
Then you can just insert using your model like :
YourSalaryModel::insert($data);
Q. Why aren't you saving them (or saved them) at that point in time i.e Jan 2017? (but that's an aside q)
I would have a salaries' table with a date column (2016-01-01), user_id, and a salary (whether int, or float/double depending on if they are always integer or can be float).
In your example, it is a case of doing two loops:
foreach ($salaries as $year => $months) {
foreach ($months as $month => $salary) {
// carbon parse to create a date
//insert into the table
}
}
SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).
I've had little luck searching for this over a couple days.
If my avro schema for data in a hive table is:
{
"type" : "record",
"name" : "messages",
"namespace" : "com.company.messages",
"fields" : [ {
"name" : "timeStamp",
"type" : "long",
"logicalType" : "timestamp-millis"
}, {
…
and I use presto to query this, I do not get formatted timestamps.
select "timestamp", typeof("timestamp") as type,
current_timestamp as "current_timestamp", typeof(current_timestamp) as current_type
from db.messages limit 1
timestamp type current_timestamp current_type
1497210701839 bigint 2017-06-14 09:32:43.098 Asia/Seoul timestamp with time zone
I thought it would be a non-issue then to convert them to timestamps with millisecond precision, but I'm finding I have no clear way to do that.
select cast("timestamp" as timestamp) from db.messages limit 1
line 1:16: Cannot cast bigint to timestamp
Also they've changed presto's timestamp casting to always assume the source is in seconds.
https://issues.apache.org/jira/browse/HIVE-3454
So if I used from_unixtime() I have to chop off the milliseconds or else it gives me a very distant date:
select from_unixtime("timestamp") as "timestamp" from db.messages limit 1
timestamp
+49414-08-06 07:15:35.000
Surely someone else who works with Presto more often knows how to express the conversion properly. (I can't restart the Presto nor Hive servers to force the timezone into UTC either btw).
I didn't find direct conversion from Java timestamp (number of milliseconds since 1970) to timestamp, but one can be done with to_unixtime and adding milliseconds as interval:
presto> with t as (select cast('1497435766032' as bigint) a)
-> select from_unixtime(a / 1000) + parse_duration(cast((a % 1000) as varchar) || 'ms') from t;
_col0
-------------------------
2017-06-14 12:22:46.032
(1 row)
(admittedly cumbersome, but works)
select from_unixtime(cast(event_time as bigint) / 1000000) + parse_duration(cast((cast(event_time as bigint) % 1000) as varchar) || 'ms') from TableName limit 10;
Ideally, I'd like something packaged like SAS proc compare that can give me:
The count of rows for each dataset
The count of rows that exist in one dataset, but not the other
Variables that exist in one dataset, but not the other
Variables that do not have the same format in the two files (I realize this would be rare for AVRO files, but would be helpful to know quickly without deciphering errors)
The total number of mismatching rows for each column, and a presentation of all the mismatches for a column or any 20 mismatches (whichever is smallest)
I've worked out one way to make sure the datasets are equivalent, but it is pretty inefficient. Lets assume we have two avro files with 100 rows and 5 columns (one key and four float features). If we join the tables and create new variables that are the difference between the matching features from the datasets then any non-zero difference is some mismatch in the data. From there it could be pretty easy to determine the entire list of requirements above, but it just seems like there may be more efficient ways possible.
AVRO files store the schema and data separately. This means that beside the AVRO file with the data you should have a schema file, usually it is something like *.avsc. This way your task can be split in 3 parts:
Compare the schema. This way you can get the fields that have different data types in these files, have different set of fields and so on. This task is very easy and can be done outside of the Hadoop, for instance in Python:
import json
schema1 = json.load(open('schema1.avsc'))
schema2 = json.load(open('schema2.avsc'))
def print_cross (s1set, s2set, message):
for s in s1set:
if not s in s2set:
print message % s
s1names = set( [ field['name'] for field in schema1['fields'] ] )
s2names = set( [ field['name'] for field in schema2['fields'] ] )
print_cross(s1names, s2names, 'Field "%s" exists in table1 and does not exist in table2')
print_cross(s2names, s1names, 'Field "%s" exists in table2 and does not exist in table1')
def print_cross2 (s1dict, s2dict, message):
for s in s1dict:
if s in s2dict:
if s1dict[s] != s2dict[s]:
print message % (s, s1dict[s], s2dict[s])
s1types = dict( zip( [ field['name'] for field in schema1['fields'] ], [ str(field['type']) for field in schema1['fields'] ] ) )
s2types = dict( zip( [ field['name'] for field in schema2['fields'] ], [ str(field['type']) for field in schema2['fields'] ] ) )
print_cross2 (s1types, s2types, 'Field "%s" has type "%s" in table1 and type "%s" in table2')
Here's an example of the schemas:
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int"]},
{"name": "favorite_color", "type": ["string", "null"]},
{"name": "test", "type": "int"}
]
}
Here's the output:
[localhost:temp]$ python compare.py
Field "test" exists in table2 and does not exist in table1
Field "favorite_number" has type "[u'int', u'null']" in table1 and type "[u'int']" intable2
If the schemas are equal (and you probably don't need to compare the data if the schemas are not equal), then you can do the comparison in the following way. Easy way that matches any case: calculate md5 hash for each of the rows, join two tables based on the value of this md5 hash. If will give you amount of rows that are the same in both tables, amount of rows specific to table1 and amount of rows specific for table2. It can be easily done in Hive, here's the code of the MD5 UDF: https://gist.github.com/dataminelab/1050002
For comparing the field-to-field you have to know the primary key of the table and join two tables on primary key, comparing the fields
Previously I've developed comparison functions for tables, and they usually looked like this:
Check that both tables exists and available
Compare their schema. If there are some mistmatches in schema - break
If the primary key is specified:
Join both tables on primary key using full outer join
Calculate md5 hash for each row
Output primary keys with diagnosis (PK exists only in table1, PK exists only in table2, PK exists in both tables but the data does not match)
Get the 100 rows same of each problematic class, join with both tables and output into "mistmatch example" table
If the primary key is not specified:
Calculate md5 hash for each row
Full outer join of table1 with table2 on md5hash value
Count number of matching rows, number of rows exists in table1 only, number of rows exists in table2 only
Get 100 rows sample of each mistmatch type and output to "mistmatch example" table
Usually development and debugging such a function takes 4-5 business days