How to Improve Cross Join Performance in Hive TEZ? - performance

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.
For achieving this I am doing a cross join like
select *
from table1 join table 2
ON 1 = 1;
This is taking 5 hours to run with the highest possible memory parameters.
Is there any other short or easier way to achieve this in less time ?

Turn on map-join:
set hive.auto.convert.join=true;
select *
from table1 cross join table2;
The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.

Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.
with t1 as (
selct col1, col2,..., 0 as k from table1
)
,t2 as (
selct col3, col4,..., 0 as k from table2
)
selct
*
from t1 join t2
on t1.k = t2.k
Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

Related

Hive queries are constantly failing. How to optimally join very big tables?

I'm trying to (inner) join two tables on HDFS partitioned by 'day' (date) for multiple days (say 2 weeks). Both tables have 100s of columns, but I'm only trying to query 10s of them. Each day has more than a billion rows.
My HIVE query looks like following.
INSERT OVERWRITE TABLE join1 partition (day)
SELECT a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day
from (
select x1, x2, x3... xn
from table1
where day between day1 and day2
) a
join (
select x1, y1, y2,... ym, day
from table2 where day between day1 and day2
) b
on a.x1=b.x1;
First problem- it takes a real long time (12+ hours) to do this join even for smaller period (1-7 days).
Second problem- it fails every time I try to do it for more than 10 days or so. It uses around 504 mappers and 250 reducers which is the default (I've also tried with 500 reducers).
I know this error is not real (What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask), but even the real error wasn't very useful (sorry I can't get it now).
What could be the reason for this crashing? Can anyone suggest a better way to join such huge tables?
This is too long for a comment.
Some databases have problems when optimizing subqueries. I could imagine that this is a problem with Hive. So, I would recommend:
select a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day
from table1 a join
table2 b
on a.x1 = b.x1
where a.day between a.day1 and a.day2 and
b.day between b.day1 and b.day2;
I also wonder if you want a condition a.day = b.day in the on clause. Using the existing partitioning key in the join should help performance.
About the error:
Since you are using dynamic partitioning on join1, did you set correctly the max number of partition which can be created?
About the speed:
Are yours table1 and table2 defined like this ?
CREATE table1 (
x1 string,
x2 string,
:
) PARTITIONED BY ( day int )
CLUSTERED BY ( 'x1' )
SORTED BY ( x1 ) INTO 400 BUCKETS;
This table is partitioned by day, so that accessing any day requires only access the corresponding partition and not the whole file. This will speed up your inner queries.
It uses also bucketing, so when you are doing joins on x1, all the rows with the same x1 values are sticked together in the same place, this will speed up your join, don't mind a such great delta. Only if join is made at Map stage ( thanks to bucketing ) the difference is visible.

How to define if table is a good candidate for a clustered columnstore index?

I have read (here,here and here) about clustered columnstore indexes introduced in SQL Server 2014. Basically, now:
Column store indexes can be updatable
Table schema can be modified (without drop column store indexes)
Structure of the base table can be columnar
Space saved by compression effects (with a column store index, you
can save between 40 to 50 percent of initial space used for the
table)
In addition, they support:
Row mode and Batch mode processing
BULK INSERT statement
More data types
AS I have understood there are some restrictions, like:
Unsupported data types
Other indexes cannot be created
But as it is said:
With a clustered column store index, all filter possibilities are
already covered; Query Processor, using Segment Elimination, will be
able to consider only the segments required by the query clauses. On
the columns where it cannot apply the Segment Elimination, all scans
will be faster than B-Tree index scans because data are compressed so
less I/O operations will be required.
I am interested in the following:
Does the statement above say that a clustered column store index is always better for extracting data than a B-Tree index when a lot of duplicated values exist?
What about the performance between clustered column store index and non-clustered B-Tree covering index, when the table has many columns for example?
Can I have a combination of clustered and non-clustered columnstores indexes on one table?
And most importantly, can anyone tell how to determine whether a table is a good candidate for a columned stored index?
It is said that the best candidates are tables for which update/delete/insert operations are not performed often. For example, I have a table with storage size above 17 GB (about 70 millions rows) and new records are inserted and deleted constantly. On the other hand, a lot of queries using its columns are performed. Or I have a table with storage size about 40 GB (about 60 millions rows) with many inserts performed each day - it is not queried often but I want to reduce its size.
I know the answer is mostly in running production tests but before that I need to pick the better candidates.
One of the most important restrictions for Clustered Columnstore is their locking, you can find some details over here: http://www.nikoport.com/2013/07/07/clustered-columnstore-indexes-part-8-locking/
Regarding your questions:
1) Does the statement above say that a clustered column store index is always better for extracting data then a B-Tree index when a lot of duplicated values exist
Not only duplicates are faster scanned by Batch Mode, but for data reading the mechanisms for Columnstore Indexes are more effective, when reading all data out of a Segment.
2) What about the performance between clustered column store index and non-clustered B-Tree covering index, when the table has many columns for example
Columnstore Index has a significantly better compression than Page or Row, available for the Row Store, Batch Mode shall make the biggest difference on the processing side and as already mentioned even reading of the equally-sized pages & extents should be faster for Columnstore Indexes
3) Can I have a combination of clustered and non clustered columnstores indexes on one table
No, at the moment this is impossible.
4) ... can anyone tell how to define if a table is a good candidate for a columned stored index?
Any table which you are scanning & processing in big amounts (over 1 million rows), or maybe even whole table with over 100K scanned entirely might be a candidate to consider.
There are some restrictions on the used technologies related to the table where you want to build Clustered Columnstore indexes, here is a query that I am using:
select object_schema_name( t.object_id ) as 'Schema'
, object_name (t.object_id) as 'Table'
, sum(p.rows) as 'Row Count'
, cast( sum(a.total_pages) * 8.0 / 1024. / 1024
as decimal(16,3)) as 'size in GB'
, (select count(*) from sys.columns as col
where t.object_id = col.object_id ) as 'Cols Count'
, (select count(*)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id and
UPPER(tp.name) in ('VARCHAR','NVARCHAR')
) as 'String Columns'
, (select sum(col.max_length)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id
) as 'Cols Max Length'
, (select count(*)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id and
(UPPER(tp.name) in ('TEXT','NTEXT','TIMESTAMP','HIERARCHYID','SQL_VARIANT','XML','GEOGRAPHY','GEOMETRY') OR
(UPPER(tp.name) in ('VARCHAR','NVARCHAR') and (col.max_length = 8000 or col.max_length = -1))
)
) as 'Unsupported Columns'
, (select count(*)
from sys.objects
where type = 'PK' AND parent_object_id = t.object_id ) as 'Primary Key'
, (select count(*)
from sys.objects
where type = 'F' AND parent_object_id = t.object_id ) as 'Foreign Keys'
, (select count(*)
from sys.objects
where type in ('UQ','D','C') AND parent_object_id = t.object_id ) as 'Constraints'
, (select count(*)
from sys.objects
where type in ('TA','TR') AND parent_object_id = t.object_id ) as 'Triggers'
, t.is_tracked_by_cdc as 'CDC'
, t.is_memory_optimized as 'Hekaton'
, t.is_replicated as 'Replication'
, coalesce(t.filestream_data_space_id,0,1) as 'FileStream'
, t.is_filetable as 'FileTable'
from sys.tables t
inner join sys.partitions as p
ON t.object_id = p.object_id
INNER JOIN sys.allocation_units as a
ON p.partition_id = a.container_id
where p.data_compression in (0,1,2) -- None, Row, Page
group by t.object_id, t.is_tracked_by_cdc, t.is_memory_optimized, t.is_filetable, t.is_replicated, t.filestream_data_space_id
having sum(p.rows) > 1000000
order by sum(p.rows) desc

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

Oracle SQL: How to SELECT N records for each "group" / "cluster"

I've got a table big_table, with 4 million record, they are clustered in 40 groups through a column called "process_type_cod". The list of values that this column may assume is in a second table. Let's call it small_table.
So, we have big_table with a NOT NULL FK called process_type_cod that points to small_table (assume the colum name is the same on both tables).
I want N record (i.e. 10) from big_table, for each record of the small_table.
I.e.
10 record from big_table related to the first record of small_table
UNION
10 different record from big_table related to the second record of small table, and so on.
Is it possible to obtain with a single SQL function?
I recommend an analytical function such as rank() or row_number(). You could do this with hard-coded unions, but the analytical function does all the hard work for you.
select *
from
(
select
bt.col_a,
bt.col_b,
bt.process_type_cod,
row_number() over ( partition by process_type_cod order by col_a nulls last ) rank
from small_table st
inner join big_table bt
on st.process_type_cod = bt.process_type_cod
)
where rank < 11
;
You may not even need that join since big_table has all of the types you care about. In that case, just change the 'from clause' to use big_table and drop the join.
What this does is performs the query and then sorts the records using the 'order by' operator in the partition statement. For a given group (here we grouped by col_a), a numerical row number (i.e. 1, 2, 3, 4, 5, n+1...) is applied to each record consecutively. In the outer where clause, just filter by the records with a number lower than N.

What does PARTITION BY 1 mean?

For a pair of cursors where the total number of rows in the resultset is required immediately after the first FETCH, ( after some trial-and-error ) I came up with the query below
SELECT
col_a,
col_b,
col_c,
COUNT(*) OVER( PARTITION BY 1 ) AS rows_in_result
FROM
myTable JOIN theirTable ON
myTable.col_a = theirTable.col_z
GROUP BY
col_a, col_b, col_c
ORDER BY
col_b
Now when the output of the query is X rows, rows_in_result reflects this accurately.
What does PARTITION BY 1 mean?
I think it probably tells the database to partition the results into pieces of 1-row each
It is an unusual use of PARTITION BY. What it does is put everything into the same partition so that if the query returns 123 rows altogether, then the value of rows_in_result on each row will be 123 (as its alias implies).
It is therefore equivalent to the more concise:
COUNT(*) OVER ()
Databases are quite free to add restrictions to the OVER() clause. Sometimes, either PARTITION BY [...] and/or ORDER BY [...] are mandatory clauses, depending on the aggregate function. PARTITION BY 1 may just be a dummy clause used for syntax integrity. The following two are usually equivalent:
[aggregate function] OVER ()
[aggregate function] OVER (PARTITION BY 1)
Note, though, that Sybase SQL Anywhere and CUBRID interpret this 1 as being a column index reference, similar to what is possible in the ORDER BY [...] clause. This might appear to be a bit surprising as it imposes an evaluation order to the query's projection. In your case, this would then mean that the following are equivalent
COUNT(*) OVER (PARTITION BY 1)
COUNT(*) OVER (PARTITION BY col_a)
This curious deviation from other databases' interpretation allows for referencing more complex grouping expressions.

Resources