how to do set union operations on arrays? - distinct

Here's what I want to accomplish -
I want to have an array[string] column in a singlestore columnar table
Do a set union operation on this column over some time range/conditions.
I don't know how to express this in Singlestore sql.
Singlestore doesn't support array columns, so I've resorted to using json col.
Any pointers ?
Sample table
| key | locations |
| ----- | --------------------------------- |
| alice | ["sanjose", "sancarlos", "miami"] |
| alice | ["sanjose","milpitas"] |
| alice | ["miami"] |
| alice | ["sanmateo","sanfrancisco"] |
| alice | ["redwoodshores","sanfrancisco"] |
| bob | ["sanjose","milpitas"] |
| bob | ["freemont","onioncity"] |
| bob | ["onioncity"] |
| bob | ["newark","milpitas"] |
| bob | ["sanjose"] |
| bob | ["santacruz"] |
Expected output
| key | array_agg(distinct elem) OR array_distinct(flatten(array_agg(locations))) |
| ----- | ------------------------------------------------------------------------------ |
| alice | [ miami, milpitas, redwoodshores, sancarlos, sanfrancisco, sanjose, sanmateo ] |
| bob | [ freemont, milpitas, newark, onioncity, sanjose, santacruz ] |
Any pointers on how I can accomplish this with Singlestore ?
How I would do this in other DBs/Frameworks:
Postgres - It was relatively simple to construct this query. I can create an array column and do array_agg(distinct elem) - http://www.sqlfiddle.com/#!17/58628/1
Spark - as simple as array_distinct( array_agg( col ) )

You can use JSON_AGG function to achieve this.
This function does not support distinct like json_agg(distinct(key_col)), so using the distinct clause in CTE itself.
Docs: https://docs.singlestore.com/managed-service/en/reference/sql-reference/json-functions/json_agg.html
Sample Code
create table json_array_example (key_col text, locations json);
insert into json_array_example values('alice', '["sanjose", "sancarlos", "miami"]');
insert into json_array_example values('alice', '["sanjose","milpitas"]');
insert into json_array_example values('alice', '["miami"]');
insert into json_array_example values('alice', '["sanmateo","sanfrancisco"]');
insert into json_array_example values('alice', '["redwoodshores","sanfrancisco"]');
insert into json_array_example values('bob' , '["sanjose","milpitas"]');
insert into json_array_example values('bob' , '["freemont","onioncity"]');
insert into json_array_example values('bob' , '["onioncity"]');
insert into json_array_example values('bob' , '["newark","milpitas"]');
insert into json_array_example values('bob' , '["sanjose"]');
insert into json_array_example values('bob' , '["santacruz"]');
WITH t AS (SELECT distinct key_col, table_col AS locations
FROM json_array_example
JOIN TABLE(JSON_TO_ARRAY(locations)))
select key_col, json_agg(locations) as locations
from t
group by key_col;

Related

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.
You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

See number of live/dead tuples in MonetDB

I'm trying to get some precise row counts for all tables, given that some have deleted rows. I have been using sys.storage.count. But this seems to count the deleted ones also.
I assume using sys.storage would be simpler and faster than looping through count(*) queries, though both strategies may be fine in practice.
Maybe there is some column that counts modifications so I could just subtract the two counts?
If all you need to know is the number of actual rows in a table, I'd recommend just using a count(*) query. It's very fast. Even if you have N tables, it's easy to do a count(*) for each table.
sys.storage gives you information from the raw storage. With that, you can get pretty low-level information, but it has some edges. sys.storage.count returns the count in the storage, hence, indeed, it includes the delete rows since they are not actually deleted. As of Jul2021 version of MonetDB, deleted rows are automatically overwritten by new inserts (i.e. auto-vacuuming). So, to get the actual row count, you need to look up the 'deletes' from sys.deltas('<schema>', '<table>'). For instance:
sql>create table tbl (id int, city string);
operation successful
sql>insert into tbl values (1, 'London'), (2, 'Paris'), (3, 'Barcelona');
3 affected rows
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 2 | Paris |
| 3 | Barcelona |
+------+-----------+
3 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 0 |
| 15570 | 0 |
+-------+---------+
2 tuples
After we delete one row, the actual row count is sys.storage.count - sys.deltas ('sys', 'tbl').deletes:
sql>delete from tbl where id = 2;
1 affected row
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 3 | Barcelona |
+------+-----------+
2 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 1 |
| 15570 | 1 |
+-------+---------+
2 tuples
After we insert a new row, the deleted row is overwritten:
sql>insert into tbl values (4, 'Praag');
1 affected row
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 4 | Praag |
| 3 | Barcelona |
+------+-----------+
3 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 0 |
| 15570 | 0 |
+-------+---------+
2 tuples
So, the formula to compute the actual row count (sys.storage.count - sys.deltas ('sys', 'tbl').deletes) is generally applicable. sys.deltas() keeps stats for every column of a table, but the count and deletes are table wide, so you only need to check one column.

Querying HIVE Metadata

I need to query the following table and view information from my Apache HIVE cluster:
Each row needs to contain the following:
TABLE SCHEMA
TABLE NAME
TABLE DESCRIPTION
COLUMN NAME
COLUMN DATA TYPE
COLUMN LENGTH
COLUMN PRECISION
COLUMN SCALE
NULL OR NOT NULL
PRIMARY KEY INDICATOR
This can be easily queried from most RDBMS (metadata tables/views), but I am struggling to find much information about the equivalent metadata tables/views in HIVE.
Please help :)
This information is available from the Hive metastore. The below example query is for a MySQL-backed metastore (Hive version 1.2).
SELECT
DBS.NAME AS TABLE_SCHEMA,
TBLS.TBL_NAME AS TABLE_NAME,
TBL_COMMENTS.TBL_COMMENT AS TABLE_DESCRIPTION,
COLUMNS_V2.COLUMN_NAME AS COLUMN_NAME,
COLUMNS_V2.TYPE_NAME AS COLUMN_DATA_TYPE_DETAILS
FROM DBS
JOIN TBLS ON DBS.DB_ID = TBLS.DB_ID
JOIN SDS ON TBLS.SD_ID = SDS.SD_ID
JOIN COLUMNS_V2 ON COLUMNS_V2.CD_ID = SDS.CD_ID
JOIN
(
SELECT DISTINCT TBL_ID, TBL_COMMENT
FROM
(
SELECT TBLS.TBL_ID TBL_ID, TABLE_PARAMS.PARAM_KEY, TABLE_PARAMS.PARAM_VALUE, CASE WHEN TABLE_PARAMS.PARAM_KEY = 'comment' THEN TABLE_PARAMS.PARAM_VALUE ELSE '' END TBL_COMMENT
FROM TBLS JOIN TABLE_PARAMS
ON TBLS.TBL_ID = TABLE_PARAMS.TBL_ID
) TBL_COMMENTS_INTERNAL
) TBL_COMMENTS
ON TBLS.TBL_ID = TBL_COMMENTS.TBL_ID;
Sample output:
+--------------+----------------------+-----------------------+-------------------+------------------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_DESCRIPTION | COLUMN_NAME | COLUMN_DATA_TYPE_DETAILS |
+--------------+----------------------+-----------------------+-------------------+------------------------------+
| default | temp003 | This is temp003 table | col1 | string |
| default | temp003 | This is temp003 table | col2 | array<string> |
| default | temp003 | This is temp003 table | col3 | array<string> |
| default | temp003 | This is temp003 table | col4 | int |
| default | temp003 | This is temp003 table | col5 | decimal(10,2) |
| default | temp004 | | col11 | string |
| default | temp004 | | col21 | array<string> |
| default | temp004 | | col31 | array<string> |
| default | temp004 | | col41 | int |
| default | temp004 | | col51 | decimal(10,2) |
+--------------+----------------------+-----------------------+-------------------+------------------------------+
Metastore tables referred in query:
DBS: Details of databases/schemas.
TBLS: Details of tables.
COLUMNS_V2: Details about columns.
SDS: Details about storage.
TABLE_PARAMS: Details about table parameters (key-value pairs)

insert id number only in sql

I have a SQL Server table like this
+----+-----------+------------+
| id | acoount | date |
+----+-----------+------------+
| | John | 2/6/2016 |
| | John | 2/6/2016 |
| | John | 4/6/2016 |
| | John | 4/6/2016 |
| | Andi | 5/6/2016 |
| | Steve | 4/6/2016 |
+----+-----------+------------+
i want insert the id coloumn like this.
+-----------+-----------+------------+
| id | acoount | date |
+-----------+-----------+------------+
| 020616001 | John | 2/6/2016 |
| 020616002 | John | 2/6/2016 |
| 040616001 | John | 4/6/2016 |
| 040616002 | John | 4/6/2016 |
| 050616001 | Andi | 5/6/2016 |
| 040616003 | Steve | 4/6/2016 |
+-----------+-----------+------------+
I want to generate id number of the date provided like this. 02+06+16(from date)+001 = 020616001. if have same date, id + 1.
I have tried but still failed .
I want make it in oracle sql develop.
Someone help me.
Thanks.
Try the below SQL as per the given data, Its in SQL Server 2012....
select REPLACE(CONVERT(VARCHAR(10),convert(date,t.[date]), 101), '/', '')
+'00'+convert(varchar(2),row_number()over(partition by account,[date] order by t.[date])) as ID,
t.account,
t.date
from (values ('John','2/6/2016'),
('John','2/6/2016'),
('John','4/6/2016'),
('John','4/6/2016'),
('Andi','5/6/2016'),
('Steve','4/6/2016'))T(account,[date])
Update your table using statement .
update table set id= replace(CONVERT(VARCHAR(10),CONVERT(datetime ,date,103),3) ,'/', '') + Right('00'+convert(varchar(2),row_number()over(partition by account,[date] order by t.[date])) ,3)
MySql
i can give you the logic of 020616001 this part right now .......
for same id +1 i have to work on it....that i ll let u know after my work
insert into table_name(id)
select concat
(
if(length (day(current_date))>1,day(current_date),Concat(0,day(current_date))),
if(length (month(current_date))>1,month(current_date),Concat(0,month(current_date))),
(right(year(current_date),2)),'001'
)as id
you cannot convert your dates column to datetime type in normal way because it is dd/mm/yyyy.
Try this,
declare #t table(acoount varchar(50),dates varchar(20))
insert into #t values
('John','2/6/2016')
,('John','2/6/2016')
,('John','4/6/2016')
,('John','4/6/2016')
,('Andi','5/6/2016')
,('Steve','4/6/2016')
;With CTE as
(select * , SUBSTRING(dates,0,charindex('/',dates)) dd
,SUBSTRING(stuff(dates,1,charindex('/',dates),''),0, charindex('/',stuff(dates,1,charindex('/',dates),''))) MM
,right(dates,2) yy
from #t
)
,CTE1 as
(
select *
,ROW_NUMBER()over(partition by yy,mm,dd order by yy,mm,dd)rn from cte c
)
select *, REPLICATE('0',2-len(dd))+cast(dd as varchar(2))
+REPLICATE('0',2-len(MM))+cast(MM as varchar(2))
+yy+REPLICATE('0',3-len(rn))+cast(rn as varchar(2))
from cte1

LISTAGG function with two columns

I have one table like this (report)
--------------------------------------------------
| user_id | Department | Position | Record_id |
--------------------------------------------------
| 1 | Science | Professor | 1001 |
| 1 | Maths | | 1002 |
| 1 | History | Teacher | 1003 |
| 2 | Science | Professor | 1004 |
| 2 | Chemistry | Assistant | 1005 |
--------------------------------------------------
I'd like to have the following result
---------------------------------------------------------
| user_id | Department+Position |
---------------------------------------------------------
| 1 | Science,Professor;Maths, ; History,Teacher |
| 2 | Science, Professor; Chemistry, Assistant |
---------------------------------------------------------
That means I need to preserve the empty space as ' ' as you can see in the result table.
Now I know how to use LISTAGG function but only for one column. However, I can't exactly figure out how can I do for two columns at the sametime. Here is my query:
SELECT user_id, LISTAGG(department, ';') WITHIN GROUP (ORDER BY record_id)
FROM report
Thanks in advance :-)
It just requires judicious use of concatenation within the aggregation:
select user_id
, listagg(department || ',' || coalesce(position, ' '), '; ')
within group ( order by record_id )
from report
group by user_id
i.e. aggregate the concatentation of department with a comma and position and replace position with a space if it is NULL.

Resources