Hive queries are constantly failing. How to optimally join very big tables? - hadoop

I'm trying to (inner) join two tables on HDFS partitioned by 'day' (date) for multiple days (say 2 weeks). Both tables have 100s of columns, but I'm only trying to query 10s of them. Each day has more than a billion rows.
My HIVE query looks like following.
INSERT OVERWRITE TABLE join1 partition (day)
SELECT a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day
from (
select x1, x2, x3... xn
from table1
where day between day1 and day2
) a
join (
select x1, y1, y2,... ym, day
from table2 where day between day1 and day2
) b
on a.x1=b.x1;
First problem- it takes a real long time (12+ hours) to do this join even for smaller period (1-7 days).
Second problem- it fails every time I try to do it for more than 10 days or so. It uses around 504 mappers and 250 reducers which is the default (I've also tried with 500 reducers).
I know this error is not real (What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask), but even the real error wasn't very useful (sorry I can't get it now).
What could be the reason for this crashing? Can anyone suggest a better way to join such huge tables?

This is too long for a comment.
Some databases have problems when optimizing subqueries. I could imagine that this is a problem with Hive. So, I would recommend:
select a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day
from table1 a join
table2 b
on a.x1 = b.x1
where a.day between a.day1 and a.day2 and
b.day between b.day1 and b.day2;
I also wonder if you want a condition a.day = b.day in the on clause. Using the existing partitioning key in the join should help performance.

About the error:
Since you are using dynamic partitioning on join1, did you set correctly the max number of partition which can be created?
About the speed:
Are yours table1 and table2 defined like this ?
CREATE table1 (
x1 string,
x2 string,
:
) PARTITIONED BY ( day int )
CLUSTERED BY ( 'x1' )
SORTED BY ( x1 ) INTO 400 BUCKETS;
This table is partitioned by day, so that accessing any day requires only access the corresponding partition and not the whole file. This will speed up your inner queries.
It uses also bucketing, so when you are doing joins on x1, all the rows with the same x1 values are sticked together in the same place, this will speed up your join, don't mind a such great delta. Only if join is made at Map stage ( thanks to bucketing ) the difference is visible.

Related

How to Improve Cross Join Performance in Hive TEZ?

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.
For achieving this I am doing a cross join like
select *
from table1 join table 2
ON 1 = 1;
This is taking 5 hours to run with the highest possible memory parameters.
Is there any other short or easier way to achieve this in less time ?
Turn on map-join:
set hive.auto.convert.join=true;
select *
from table1 cross join table2;
The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.
Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.
with t1 as (
selct col1, col2,..., 0 as k from table1
)
,t2 as (
selct col3, col4,..., 0 as k from table2
)
selct
*
from t1 join t2
on t1.k = t2.k
Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

Oracle tuning for query with query annidate

i am trying to better a query. I have a dataset of ticket opened. Every ticket has different rows, every row rappresent an update of the ticket. There is a field (dt_update) that differs it every row.
I have this indexs in the st_remedy_full_light.
IDX_ASSIGNMENT (ASSIGNMENT)
IDX_REMEDY_INC_ID (REMEDY_INC_ID)
IDX_REMDULL_LIGHT_DTUPD (DT_UPDATE)
Now, the query is performed in 8 second. Is high for me.
WITH last_ticket AS
( SELECT *
FROM st_remedy_full_light a
WHERE a.dt_update IN
( SELECT MAX(dt_update)
FROM st_remedy_full_light
WHERE remedy_inc_id = a.remedy_inc_id
)
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
This is the plan
How i could to better this query?
P.S. This is just a part of a big query
Additional information:
- The table st_remedy_full_light contain 529.507 rows
You could try:
WITH last_ticket AS
( SELECT remedy_inc_id, ASSIGNMENT,
rank() over (partition by remedy_inc_id order by dt_update desc) rn
FROM st_remedy_full_light a
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
where rn = 1;
The best alternative query, which is also much easier to execute, is this:
select remedy_inc_id
, max(assignment) keep (dense_rank last order by dt_update)
from st_remedy_full_light
group by remedy_inc_id
This will use only one full table scan and a (hash/sort) group by, no self joins.
Don't bother about indexed access, as you'll probably find a full table scan is most appropriate here. Unless the table is really wide and a composite index on all columns used (remedy_inc_id,dt_update,assignment) would be significantly quicker to read than the table.

How to define if table is a good candidate for a clustered columnstore index?

I have read (here,here and here) about clustered columnstore indexes introduced in SQL Server 2014. Basically, now:
Column store indexes can be updatable
Table schema can be modified (without drop column store indexes)
Structure of the base table can be columnar
Space saved by compression effects (with a column store index, you
can save between 40 to 50 percent of initial space used for the
table)
In addition, they support:
Row mode and Batch mode processing
BULK INSERT statement
More data types
AS I have understood there are some restrictions, like:
Unsupported data types
Other indexes cannot be created
But as it is said:
With a clustered column store index, all filter possibilities are
already covered; Query Processor, using Segment Elimination, will be
able to consider only the segments required by the query clauses. On
the columns where it cannot apply the Segment Elimination, all scans
will be faster than B-Tree index scans because data are compressed so
less I/O operations will be required.
I am interested in the following:
Does the statement above say that a clustered column store index is always better for extracting data than a B-Tree index when a lot of duplicated values exist?
What about the performance between clustered column store index and non-clustered B-Tree covering index, when the table has many columns for example?
Can I have a combination of clustered and non-clustered columnstores indexes on one table?
And most importantly, can anyone tell how to determine whether a table is a good candidate for a columned stored index?
It is said that the best candidates are tables for which update/delete/insert operations are not performed often. For example, I have a table with storage size above 17 GB (about 70 millions rows) and new records are inserted and deleted constantly. On the other hand, a lot of queries using its columns are performed. Or I have a table with storage size about 40 GB (about 60 millions rows) with many inserts performed each day - it is not queried often but I want to reduce its size.
I know the answer is mostly in running production tests but before that I need to pick the better candidates.
One of the most important restrictions for Clustered Columnstore is their locking, you can find some details over here: http://www.nikoport.com/2013/07/07/clustered-columnstore-indexes-part-8-locking/
Regarding your questions:
1) Does the statement above say that a clustered column store index is always better for extracting data then a B-Tree index when a lot of duplicated values exist
Not only duplicates are faster scanned by Batch Mode, but for data reading the mechanisms for Columnstore Indexes are more effective, when reading all data out of a Segment.
2) What about the performance between clustered column store index and non-clustered B-Tree covering index, when the table has many columns for example
Columnstore Index has a significantly better compression than Page or Row, available for the Row Store, Batch Mode shall make the biggest difference on the processing side and as already mentioned even reading of the equally-sized pages & extents should be faster for Columnstore Indexes
3) Can I have a combination of clustered and non clustered columnstores indexes on one table
No, at the moment this is impossible.
4) ... can anyone tell how to define if a table is a good candidate for a columned stored index?
Any table which you are scanning & processing in big amounts (over 1 million rows), or maybe even whole table with over 100K scanned entirely might be a candidate to consider.
There are some restrictions on the used technologies related to the table where you want to build Clustered Columnstore indexes, here is a query that I am using:
select object_schema_name( t.object_id ) as 'Schema'
, object_name (t.object_id) as 'Table'
, sum(p.rows) as 'Row Count'
, cast( sum(a.total_pages) * 8.0 / 1024. / 1024
as decimal(16,3)) as 'size in GB'
, (select count(*) from sys.columns as col
where t.object_id = col.object_id ) as 'Cols Count'
, (select count(*)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id and
UPPER(tp.name) in ('VARCHAR','NVARCHAR')
) as 'String Columns'
, (select sum(col.max_length)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id
) as 'Cols Max Length'
, (select count(*)
from sys.columns as col
join sys.types as tp
on col.system_type_id = tp.system_type_id
where t.object_id = col.object_id and
(UPPER(tp.name) in ('TEXT','NTEXT','TIMESTAMP','HIERARCHYID','SQL_VARIANT','XML','GEOGRAPHY','GEOMETRY') OR
(UPPER(tp.name) in ('VARCHAR','NVARCHAR') and (col.max_length = 8000 or col.max_length = -1))
)
) as 'Unsupported Columns'
, (select count(*)
from sys.objects
where type = 'PK' AND parent_object_id = t.object_id ) as 'Primary Key'
, (select count(*)
from sys.objects
where type = 'F' AND parent_object_id = t.object_id ) as 'Foreign Keys'
, (select count(*)
from sys.objects
where type in ('UQ','D','C') AND parent_object_id = t.object_id ) as 'Constraints'
, (select count(*)
from sys.objects
where type in ('TA','TR') AND parent_object_id = t.object_id ) as 'Triggers'
, t.is_tracked_by_cdc as 'CDC'
, t.is_memory_optimized as 'Hekaton'
, t.is_replicated as 'Replication'
, coalesce(t.filestream_data_space_id,0,1) as 'FileStream'
, t.is_filetable as 'FileTable'
from sys.tables t
inner join sys.partitions as p
ON t.object_id = p.object_id
INNER JOIN sys.allocation_units as a
ON p.partition_id = a.container_id
where p.data_compression in (0,1,2) -- None, Row, Page
group by t.object_id, t.is_tracked_by_cdc, t.is_memory_optimized, t.is_filetable, t.is_replicated, t.filestream_data_space_id
having sum(p.rows) > 1000000
order by sum(p.rows) desc

How to avoid expensive Cartesian product using row generator

I'm working on a query (Oracle 11g) that does a lot of date manipulation. Using a row generator, I'm examining each date within a range of dates for each record in another table. Through another query, I know that my row generator needs to generate 8500 dates, and this amount will grow by 365 days each year. Also, the table that I'm examining has about 18000 records, and this table is expected to grow by several thousand records a year.
The problem comes when joining the row generator to the other table to get the range of dates for each record. SQLTuning Advisor says that there's an expensive Cartesian product, which makes sense given that the query currently could generate up to 8500 x 18000 records. Here's the query in its stripped down form, without all the date logic etc.:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on origdate + n - 1 <= closeddate -- here's the problem join
order by t.id, t.origdate;
Is there an alternate way to join these two tables without the Cartesian product?
I need to calculate the elapsed time for each of these records, disallowing weekends and federal holidays, so that I can sort on the elapsed time. Also, the pagination for the table is done server-side, so we can't just load into the table and sort client-side.
The maximum age of a record in the system right now is 3656 days, and the average is 560, so it's not quite as bad as 8500 x 18000; but it's still bad.
I've just about resigned myself to adding a field to store the opendays, computing it once and storing the elapsed time, and creating a scheduled task to update all open records every night.
I think that you would get better performance if you rewrite the join condition slightly:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on Closeddate - Origdate + 1 <= n --you could even create a function-based index
order by t.id, t.origdate;

How to otimize select from several tables with millions of rows

Have the following tables (Oracle 10g):
catalog (
id NUMBER PRIMARY KEY,
name VARCHAR2(255),
owner NUMBER,
root NUMBER REFERENCES catalog(id)
...
)
university (
id NUMBER PRIMARY KEY,
...
)
securitygroup (
id NUMBER PRIMARY KEY
...
)
catalog_securitygroup (
catalog REFERENCES catalog(id),
securitygroup REFERENCES securitygroup(id)
)
catalog_university (
catalog REFERENCES catalog(id),
university REFERENCES university(id)
)
Catalog: 500 000 rows, catalog_university: 500 000, catalog_securitygroup: 1 500 000.
I need to select any 50 rows from catalog with specified root ordered by name for current university and current securitygroup. There is a query:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root = 100
AND cs.catalog = c.id
AND cs.securitygroup = 200
AND cu.catalog = c.id
AND cu.university = 300
ORDER BY name
) cc
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
Where 100 - some catalog, 200 - some securitygroup, 300 - some university. This query return 50 rows from ~ 170 000 in 3 minutes.
But next query return this rows in 2 sec:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c
WHERE c.root = 100
ORDER BY name
) cc
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
I build next indexes: (catalog.id, catalog.name, catalog.owner), (catalog_securitygroup.catalog, catalog_securitygroup.index), (catalog_university.catalog, catalog_university.university).
Plan for first query (using PLSQL Developer):
http://habreffect.ru/66c/f25faa5f8/plan2.jpg
Plan for second query:
http://habreffect.ru/f91/86e780cc7/plan1.jpg
What are the ways to optimize the query I have?
The indexes that can be useful and should be considered deal with
WHERE c.root = 100
AND cs.catalog = c.id
AND cs.securitygroup = 200
AND cu.catalog = c.id
AND cu.university = 300
So the following fields can be interesting for indexes
c: id, root
cs: catalog, securitygroup
cu: catalog, university
So, try creating
(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)
and
(catalog_university.catalog, catalog_university.university)
EDIT:
I missed the ORDER BY - these fields should also be considered, so
(catalog.name, catalog.id)
might be beneficial (or some other composite index that could be used for sorting and the conditions - possibly (catalog.root, catalog.name, catalog.id))
EDIT2
Although another question is accepted I'll provide some more food for thought.
I have created some test data and run some benchmarks.
The test cases are minimal in terms of record width (in catalog_securitygroup and catalog_university the primary keys are (catalog, securitygroup) and (catalog, university)). Here is the number of records per table:
test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
?column? | ?column? | ?column?
----------+----------+----------
500000 | 1497501 | 500000
(1 row)
Database is postgres 8.4, default ubuntu install, hardware i5, 4GRAM
First I rewrote the query to
SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50
AND cs.catalog = c.id
AND cu.catalog = c.id
AND cs.securitygroup < 200
AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100
note: the conditions are turned into less then to maintain comparable number of intermediate rows (the above query would return 198,801 rows without the LIMIT clause)
If run as above, without any extra indexes (save for PKs and foreign keys) it runs in 556 ms on a cold database (this is actually indication that I oversimplified the sample data somehow - I would be happier if I had 2-4s here without resorting to less then operators)
This bring me to my point - any straight query that only joins and filters (certain number of tables) and returns only a certain number of the records should run under 1s on any decent database without need to use cursors or to denormalize data (one of these days I'll have to write a post on that).
Furthermore, if a query is returning only 50 rows and does simple equality joins and restrictive equality conditions it should run even much faster.
Now let's see if I add some indexes, the biggest potential in queries like this is usually the sort order, so let me try that:
CREATE INDEX test1 ON catalog (name, id);
This makes execution time on the query - 22ms on a cold database.
And that's the point - if you are trying to get only a page of data, you should only get a page of data and execution times of queries such as this on normalized data with proper indexes should take less then 100ms on decent hardware.
I hope I didn't oversimplify the case to the point of no comparison (as I stated before some simplification is present as I don't know the cardinality of relationships between catalog and the many-to-many tables).
So, the conclusion is
if I were you I would not stop tweaking indexes (and the SQL) until I get the performance of the query to go below 200ms as rule of the thumb.
only if I would find an objective explanation why it can't go below such value I would resort to denormalisation and/or cursors, etc...
First I assume that your University and SecurityGroup tables are rather small. You posted the size of the large tables but it's really the other sizes that are part of the problem
Your problem is from the fact that you can't join the smallest tables first. Your join order should be from small to large. But because your mapping tables don't include a securitygroup-to-university table, you can't join the smallest ones first. So you wind up starting with one or the other, to a big table, to another big table and then with that large intermediate result you have to go to a small table.
If you always have current_univ and current_secgrp and root as inputs you want to use them to filter as soon as possible. The only way to do that is to change your schema some. In fact, you can leave the existing tables in place if you have to but you'll be adding to the space with this suggestion.
You've normalized the data very well. That's great for speed of update... not so great for querying. We denormalize to speed querying (that's the whole reason for datawarehouses (ok that and history)). Build a single mapping table with the following columns.
Univ_id, SecGrp_ID, Root, catalog_id. Make it an index organized table of the first 3 columns as pk.
Now when you query that index with all three PK values, you'll finish that index scan with a complete list of allowable catalog Id, now it's just a single join to the cat table to get the cat item details and you're off an running.
The Oracle cost-based optimizer makes use of all the information that it has to decide what the best access paths are for the data and what the least costly methods are for getting that data. So below are some random points related to your question.
The first three tables that you've listed all have primary keys. Do the other tables (catalog_university and catalog_securitygroup) also have primary keys on them?? A primary key defines a column or set of columns that are non-null and unique and are very important in a relational database.
Oracle generally enforces a primary key by generating a unique index on the given columns. The Oracle optimizer is more likely to make use of a unique index if it available as it is more likely to be more selective.
If possible an index that contains unique values should be defined as unique (CREATE UNIQUE INDEX...) and this will provide the optimizer with more information.
The additional indexes that you have provided are no more selective than the existing indexes. For example, the index on (catalog.id, catalog.name, catalog.owner) is unique but is less useful than the existing primary key index on (catalog.id). If a query is written to select on the catalog.name column, it is possible to do and index skip scan but this starts being costly (and most not even be possible in this case).
Since you are trying to select based in the catalog.root column, it might be worth adding an index on that column. This would mean that it could quickly find the relevant rows from the catalog table. The timing for the second query could be a bit misleading. It might be taking 2 seconds to find 50 matching rows from catalog, but these could easily be the first 50 rows from the catalog table..... finding 50 that match all your conditions might take longer, and not just because you need to join to other tables to get them. I would always use create table as select without restricting on rownum when trying to performance tune. With a complex query I would generally care about how long it take to get all the rows back... and a simple select with rownum can be misleading
Everything about Oracle performance tuning is about providing the optimizer enough information and the right tools (indexes, constraints, etc) to do its job properly. For this reason it's important to get optimizer statistics using something like DBMS_STATS.GATHER_TABLE_STATS(). Indexes should have stats gathered automatically in Oracle 10g or later.
Somehow this grew into quite a long answer about the Oracle optimizer. Hopefully some of it answers your question. Here is a summary of what is said above:
Give the optimizer as much information as possible, e.g if index is unique then declare it as such.
Add indexes on your access paths
Find the correct times for queries without limiting by rowwnum. It will always be quicker to find the first 50 M&Ms in a jar than finding the first 50 red M&Ms
Gather optimizer stats
Add unique/primary keys on all tables where they exist.
The use of rownum is wrong and causes all the rows to be processed. It will process all the rows, assigned them all a row number, and then find those between 0 and 50. When you want to look for in the explain plan is COUNT STOPKEY rather than just count
The query below should be an improvement as it will only get the first 50 rows... but there is still the issue of the joins to look at too:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c
WHERE c.root = 100
ORDER BY name
) cc
where rownum <= 50
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
Also, assuming this for a web page or something similar, maybe there is a better way to handle this than just running the query again to get the data for the next page.
try to declare a cursor. I dont know oracle, but in SqlServer would look like this:
declare #result
table (
id numeric,
name varchar(255)
);
declare __dyn_select_cursor cursor LOCAL SCROLL DYNAMIC for
--Select
select distinct
c.id, c.name
From [catalog] c
inner join university u
on u.catalog = c.id
and u.university = 300
inner join catalog_securitygroup s
on s.catalog = c.id
and s.securitygroup = 200
Where
c.root = 100
Order by name
--Cursor
declare #id numeric;
declare #name varchar(255);
open __dyn_select_cursor;
fetch relative 1 from __dyn_select_cursor into #id,#name declare #maxrowscount int
set #maxrowscount = 50
while (##fetch_status = 0 and #maxrowscount <> 0)
begin
insert into #result values (#id, #name);
set #maxrowscount = #maxrowscount - 1;
fetch next from __dyn_select_cursor into #id, #name;
end
close __dyn_select_cursor;
deallocate __dyn_select_cursor;
--Select temp, final result
select
id,
name
from #result;

Resources