Clickhouse data prep for multi level sankey chart in superset - clickhouse

Is it possible to get the data in this format. My table is in format
cust_id|event_x|status|timestamp|
1 |event_1|succ |t |
1 |event_2|succ |t+1 |
1 |event_3|succ |t+3 |
1 |event_4|succ |t+4 |
1 |event_5|succ |t+5 |
and i want the data in this format to create multi level sankey chart in superset.
event_1|event_2
event_2|event_3
event_3|event_4
event_4|event_5
Pls help.

Try to use
SELECT DISTINCT sankey_events FROM (
SELECT cust_id, arrayStringConcat(groupArray(event_x),'|') AS sankey_events
FROM table GROUP BY cust_id
)

Related

Extracting XML From CLOB data table But returning non distinct multiple values in single cell

Can you please advise me for the below?
I'm extracting XML From CLOB data table But returning non distinct multiple values in a cell. I required in a column or the first node value or the last node value.
Select d.b_id
xmltype('<rds><startRD>'||u.con_data||'</startRD></rds>').extract('//rds/startRD/text()').getStringVal() start_RD
From sp_d_a d
Left Outer Join c1_u u
On d.b_id = u.b_id
Where d.b_id In ('4017')
returning as below:
353.000000524.000000527.306000722.430000
I require as below like in a column:
b_id | startRD
==================
12 | 353.000000
12 | 524.000000
12 | 527.306000
12 | 722.430000

How many ROS containers would be created if I am loading a table with 100 rows in a 3 node cluster

I have a 3 node cluster. There is 1 database and 1 table. I have not created a projection. If I load 100 rows in the table using copy command then:
How many projections would be created? I suspect only 1 super projection, am I correct?
If I am using segmentation then would that distribute data evenly (~33 rows) per node? Does that mean I have now 3 Read Optimised Storage (ROS) one per node and the projection has 3 ROSes?
If I use KSafety value as 1 then a copy of each ROS (buddy) would be stored in another node? So DO I have 6 ROSes now, each containing 33 rows?
Well, let's play the scenario ...
You will see that you get a projection and its identical buddy projection ...
And you can query the catalogue to count the rows and identify the projections ..
-- load a file with 100 random generated rows into table example;
-- generate the rows from within Vertica, and export to file
-- then create a new table and see what the projections look like
CREATE TABLE rows100 AS
SELECT
(ARRAY['Ann','Lucy','Mary','Bob','Matt'])[RANDOMINT(5)] AS fname,
(ARRAY['Lee','Ross','Smith','Davis'])[RANDOMINT(4)] AS lname,
'2001-01-01'::DATE + RANDOMINT(365*10) AS hdate,
(10000 + RANDOM()*9000)::NUMERIC(7,2) AS salary
FROM (
SELECT tm FROM (
SELECT now() + INTERVAL ' 1 second' AS t UNION ALL
SELECT now() + INTERVAL '100 seconds' AS t -- Creates 100 rows
) x TIMESERIES tm AS '1 second' OVER(ORDER BY t)
) y
;
-- set field separator to vertical bar (the default, actually...)
\pset fieldsep '|'
-- toggle to tuples only .. no column names and no row count
\tuples_only
-- spool to example.bsv - in bar-separated-value format
\o example.bsv
SELECT * FROM rows100;
-- spool to file off - closes output file
\o
-- create a table without bothering with projections matching the test data
DROP TABLE IF EXISTS example;
CREATE TABLE example LIKE rows100;
-- load the new table ...
COPY example FROM LOCAL 'example.bsv';
-- check the nodes ..
SELECT node_name FROM nodes;
-- out node_name
-- out ----------------
-- out v_sbx_node0001
-- out v_sbx_node0002
-- out v_sbx_node0003
SELECT
node_name
, projection_schema
, anchor_table_name
, projection_name
, row_count
FROM v_monitor.projection_storage
WHERE anchor_table_name='example'
ORDER BY projection_name, node_name
;
-- out node_name | projection_schema | anchor_table_name | projection_name | row_count
-- out ----------------+-------------------+-------------------+-----------------+-----------
-- out v_sbx_node0001 | public | example | example_b0 | 38
-- out v_sbx_node0002 | public | example | example_b0 | 32
-- out v_sbx_node0003 | public | example | example_b0 | 30
-- out v_sbx_node0001 | public | example | example_b1 | 30
-- out v_sbx_node0002 | public | example | example_b1 | 38
-- out v_sbx_node0003 | public | example | example_b1 | 32

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

How to get multiple values in same cell in Oracle

I have a table in Oracle where there are two columns. In the first column, sometimes there are duplicate values that corresspond to a different value in the second column. How can I write a query that shows only unique values of the first column and all possible values from the second column?
The table looks somewhat like below
COLUMN_1 | COLUMN_2
NUMBER_1 | 4
NUMBER_2 | 4
NUMBER_3 | 1
NUMBER_3 | 6
NUMBER_4 | 3
NUMBER_4 | 4
NUMBER_4 | 5
NUMBER_4 | 6
You can use listagg() if you are using Oracle 11G or higher like
SELECT
COLUMN_1,
LISTAGG(COLUMN_2, '|') WITHIN GROUP (ORDER BY COLUMN_2) "ListValues"
FROM table1
GROUP BY COLUMN_1
Else, see this link for an alternative for lower versions
Oracle equivalent of MySQL group_concat

Oracle Identify not unique values in a clob column of a table

I want to identify all rows whose content in a clob column is not unique.
The query I use is:
select
id,
clobtext
from
table t
where
(select count(*) from table innerT where dbms_lob.compare(innerT.clobtext, t.clobtext) = 0)>1
However this query is very slow. Any suggestions to speed it up? I already tried to use the dbms_lob.getlength function to eliminate more elements in the subquery but I didn't really improve the performance (feels the same).
To make it more clear an example:
table
ID | clobtext
1 | a
2 | b
3 | c
4 | d
5 | a
6 | d
After running the query. I'd like to get (order doesn't matter):
1 | a
4 | d
5 | a
6 | d
In the past I've generated checksums (in my C# code) for each clob.
Whilst this will inccur a one off increase in io (to generate the checksum)
subsequent scans will be quicker, and you can index the value too
TK has a good PL\SQL example here:
Ask Tom

Resources