Get checksum (cityHash64) of first n rows of ClickHouse table - clickhouse

According to https://clickhouse.tech/docs/en/sql-reference/functions/hash-functions/,
I can get a checksum of entire table this way:
SELECT groupBitXor(cityHash64(*)) FROM table
What is the most accurate way to get a checksum of first N rows of a table?
As an example, I'm using a table with GenerateRandom engine as stated here.
CREATE TABLE test (name String, value UInt32) ENGINE = GenerateRandom(1, 5, 3)
I tried using LIMIT clause, but with no luck yet.

Consider using sub-query:
SELECT groupBitXor(cityHash64(*))
FROM (
SELECT *
FROM table
LIMIT x)
SELECT groupBitXor(cityHash64(*))
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
/*
┌─groupBitXor(cityHash64(number))─┐
│ 9791317254842948406 │
└─────────────────────────────────┘
*/

Related

argMax of two columns in clickhouse

Is that possible to get id for maximum value of timestamp and duration. I am looking for query like
SELECT name, argMax(id, (timestamp, duration)) FROM tables GROUP BY name
It's unclear what you mean by the maximum.
Clickhouse is able to compare tuples from the left to the right
https://clickhouse.com/docs/en/sql-reference/data-types/tuple/
select (2022, 1, 1) > (2021, 12, 31);
┌─greater((2022, 1, 1), (2021, 12, 31))─┐
│ 1 │
└───────────────────────────────────────┘
In this case you should use
SELECT name, argMax(id, (timestamp, duration))
FROM tables
GROUP BY name
And Clickhouse has a function greatest https://clickhouse.com/docs/en/sql-reference/functions/other-functions/#greatesta-b
select greatest(2021, 2023);
┌─greatest(2021, 2023)─┐
│ 2023 │
└──────────────────────┘
Then you should use
SELECT name, argMax(id, greatest(timestamp, duration))
FROM tables
GROUP BY name

Trying to display top 3 amount from a table using sql query in oracle 11g..column is of varchar type

Am trying to list top 3 records from atable based on some amount stored in a column FTE_TMUSD which is of varchar datatype
below is the query i tried
SELECT *FROM
(
SELECT * FROM FSE_TM_ENTRY
ORDER BY FTE_TMUSD desc
)
WHERE rownum <= 3
ORDER BY FTE_TMUSD DESC ;
o/p i got
972,9680,963 -->FTE_TMUSD values which are not displayed in desc
I am expecting an o/p which will display the top 3 records of values
That should work; inline view is ordered by FTE_TMUSD in descending order, and you're selecting values from it.
What looks suspicious are values you specified as the result. It appears that FTE_TMUSD's datatype is VARCHAR2 (ah, yes - it is, you said so). It means that values are sorted as strings, not numbers - and it seems that you expect numbers. So, apply TO_NUMBER to that column. Note that it'll fail if column contains anything but numbers (for example, if there's a value 972C).
Also, an alternative to your query might be use of analytic functions, such as row_number:
with temp as
(select f.*,
row_number() over (order by to_number(f.fte_tmusd) desc) rn
from fse_tm_entry f
)
select *
from temp
where rn <= 3;

How to set range for limit clause in hive

How to set range for limit clause in hive , I have tried the below query but failed with syntax error . Can someone please help
select * from table limit 1000,2000;
You can use Row_Number window function and set the range limit.
Below Query will result only the first 20 records from the table
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 0 and rowid <=20;
Using Between operator to specify range
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid between 0 and 20;
To fetch rows from 20 to 40 then increase the lower/upper bound values
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 20 and rowid <=40;
The LIMIT clause is used to set a ceiling on the number of rows in the result set. You are getting a syntax error because of an incorrect usage of this HQL clause.
The query could be written as the following to return no more than 2000 rows:
SELECT * FROM table LIMIT 2000;
You could also write it like so to return no more than 1000 rows:
SELECT * FROM table LIMIT 1000;
However you cannot combine both into the same argument for LIMIT. The LIMIT argument must evaluate to a constant value.
I will try and expand on this information a bit to try and help solve your problem. If you are attempting to "paginate" your results the following may be of use.
FIRST I would recommend against leaning on HQL for pagination, in most situations that would be more efficiently implemented on the application logic side (query large result set, cache what you need, paginate with application logic). If you have no choice but to pull out ranges of rows you can get the desired effect through a combination of the LIMIT, ORDER BY, and OFFSET clauses.
LIMIT : This will limit your result set to a maximum number of row
ORDER BY: This will sort/order your result set based on one or more columns
OFFSET: This will start your result set at a certain row after the logical first entry in the table.
You may combine these three clauses to effectively query "pages" of your table. For example the following three queries show how to get the first 3 blocks of data from a table where each block contains 1000 rows and the target table's 'column1' is used to determine logical order.
SELECT title as "Page 1", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 0;
SELECT title as "Page 2", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 1000;
SELECT title as "Page 3", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 2000;
Each query declares 'column1' as the sorting value with ORDER BY. The queries will return no more than 1000 rows due to the LIMIT clause. Each result set will start at a different row due to the OFFSET being incremented by the "page size" for each query.
I am not sure what you are trying to achieve, but ...
That will return the 1001 and the 2001 record in the query results set only if you are using hive a hive version greater than 2.0.0
hive --version
(https://issues.apache.org/jira/browse/HIVE-11531)
Limit in Hive gives 'n' number of records randomly. It's not to print a range of records.
You may use order by in conjunction with limit to get what you want

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING

Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message

UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

Resources