Hive SELECT col, COUNT(*) mismatch - hadoop

Let me start by saying, I am very new to Hive, so I'm not sure what information folks will need to help me out. Please let me know what information would be useful. Also, while I'd usually create a small dataset to recreate the problem with, I think this problem has to do with the scale of my dataset, because I can't seem to recreate the problem on a smaller dataset. Let me know if you have suggestions to make this more easy to answer.
Okay now that's out of the way, here's my problem. I have a huge dataset, partitioned by month, with about 500 million rows per month. I have a column with an ID number in it (I'll call it idcol), and I want to closely examine a couple of examples where there's a high number of repeated IDs and a very low number. So, I used this:
SELECT idcol, COUNT(*) FROM table WHERE month = 7 GROUP BY idcol LIMIT 10;
And got:
000005185884381 13
000035323848000 24
000017027256315 531
000010121767109 54
000039844553332 3
000013731352481 309
000024387407996 3
000028461234451 67
000016564844672 1
000032933040806 17
So, I went to investigate the first idvar with a count of 3, with:
SELECT * FROM table WHERE month = 7 AND idcol = '000039844553332';
I expected to see just 3 rows, but ended up with 469 rows found! That was strange enough, but then I just happened to run the original line of code above but with LIMIT 5 instead and ended up with:
000005185884381 13
000017027256315 75
000010121767109 25
000013731352481 59
000024387407996 1
And, it may be hard to see because the idcol is so long, but idvar 000017027256315 ended up with a count of 531 when I did LIMIT 10 and just 75 when I did LIMIT 5.
What am I missing?! How can I get a correct count of just a small number of values so I can investigate further?!
BTW my first thought was to make the counting part a sub-query, but that didn't change a thing. I used:
SELECT * FROM (SELECT idcol, COUNT(*) FROM table WHERE month = 7 GROUP BY idcol) x LIMIT 10;
...same EXACT results

Most likely the counts are being computed from statistics.See here for the bug and the related discussion.
hive.compute.query.using.stats = FALSE
If this doesn't fix it try the ANALYZE command before running the count(*)
ANALYZE TABLE table_name PARTITION(month) COMPUTE STATISTICS;

Related

Time range query algorithm

I have a table with ID, and start & end time in milliseconds
ID Start End
1..0.....15
2..17....23
3..23....30
4..35....45
and so on.
I have a query find records with range of 18 and 28. The query will select rows which time range covers the range from query time. For above query, record 2 and 3 are valid.
My approach would be
select * from table where
start between 18 and 28 or // record 3 is selected
end between 18 and 28; // record 2 is selected
That's good enough.
Then I have another case where find records with range of 5 to 10.
The above query won't return anything. So, I add an extra statement.
select * from table where
start between 5 and 10 or
end between 5 and 10 or
(start < 5 and end > 10); // record 1 is selected.
My question is to verify if my approach is correct or is there any well-known algorithm that takes care of this problem?
I am pretty sure there are other questions of similar nature. I couldn't think of the correct keyword to find them.
Thanks.
To check if one range intersects another range you might use the following predicate:
start < 10 AND end > 5

Simulate pipelined order by in oracle 11g

I have been working with an application that is integrated with spring and Hibernate 4.X.X and its transaction is managed by JTA in Weblogic application server. After 3 years, there are about 40 million records only into one table from 100 tables that exist in my DB. The DB is Oracle 11g. The response time of a query is about 5 minutes because of increasing the count of records of this tables.
I customized the query and put it into Sql Developer and run the query advisor plan for suggestion some Index. Totally after doing such this, its response time is reduced to 2 minute. But even so, this response time does not satisfy the Custumer. To further clarify I put the query, It is as following:
select *
from (select (count(storehouse0_.ID) over()) as col_0_0_,
storehouse3_.storeHouse_ID as col_1_0_,
(DBPK_PUB_STOREHOUSE.get_Storehouse_Title(storehouse5_.id, 1)) as col_2_0_,
storehouse5_.Organization_Code as col_3_0_,
publicgood1_.Goods_Item_Id as col_4_0_,
storehouse0_.storeHouse_Inventory_Id as col_5_0_,
storehouse0_.Id as col_6_0_,
storehouse3_.samapel_Item_Id as col_7_0_,
samapelite10_.MAINNAME as col_8_0_,
publicgood1_.serial_Number as col_9_0_,
publicgood1_1_.production_Year as col_10_0_,
samapelpar2_.ID_SourceInfo as col_11_0_,
samapelpar2_.Pn as col_12_0_,
storehouse3_.expire_Date as col_13_0_,
publicgood1_1_.Status_Id as col_14_0_,
baseinform12_.Topic as col_15_0_,
publicgood1_.public_Num as col_16_0_,
cast(publicgood1_1_.goods_Status as number(10, 0)) as col_17_0_,
publicgood1_1_.goods_Status as col_18_0_,
publicgood1_1_.deleted as col_19_0_
from amd.Core_StoreHouse_Inventory_Item storehouse0_,
amd.Core_STOREHOUSE_INVENTORY storehouse3_,
amd.Core_STOREHOUSE storehouse5_,
amd.SMP_SAMAPEL_CODE samapelite10_
cross join amd.Core_Goods_Item_Public publicgood1_
inner join amd.Core_Goods_Item publicgood1_1_
on publicgood1_.Goods_Item_Id = publicgood1_1_.Id
left outer join amd.SMP_SOURCEINFO samapelpar2_
on publicgood1_1_.Samapel_Part_Number_Id =
samapelpar2_.ID_SourceInfo, amd.App_BaseInformation
baseinform12_
where not exists
(select ssec.samapelITem_id
from core_security_samapelitem ssec
inner join core_goods_item g
on ssec.samapelitem_id = g.samapel_item_id
where not exists (SELECT aa.groupid
FROM app_actiongroup aa
where aa.groupid in
(select au.groupid
from app_usergroup au
where au.userid = 1)
and aa.actionid = 9054)
and ssec.isenable = 1
and storehouse0_.goods_Item_ID = g.id)
and not exists
(select *
from CORE_POWER_SECURITY cps
where not exists (SELECT aa.groupid
FROM app_actiongroup aa
where aa.groupid in
(select au.groupid
from app_usergroup au
where au.userid = 1)
and aa.actionid = 9055)
and cps.inventory_id =
storehouse0_.storeHouse_Inventory_Id
and cps.goodsitemtype = 6)
and storehouse0_.storeHouse_Inventory_Id = storehouse3_.Id
and storehouse3_.storeHouse_ID = storehouse5_.Id
and storehouse3_.samapel_Item_Id = samapelite10_.MAINCODE
and publicgood1_1_.Status_Id = baseinform12_.ID
and 1 <> 2
and storehouse0_.goods_Item_ID = publicgood1_.Goods_Item_Id
and publicgood1_1_.edited = 0
and publicgood1_1_.deleted = 0
and (exists (select storehouse13_.Id
from amd.Core_STOREHOUSE storehouse13_
cross join amd.core_power power16_
cross join amd.core_power power17_
where storehouse5_.powerID = power16_.Id
and storehouse13_.powerID = power17_.Id
and (storehouse13_.Id in (741684217))
and storehouse13_.storeHouseType = 2
and (power16_.hierarchiCode like
power17_.hierarchiCode || '%')) or
(storehouse3_.storeHouse_ID in (741684217)) and
storehouse5_.storeHouseType = 1)
and (storehouse5_.storeHouse_Status not in (2, 3))
order by storehouse3_.samapel_Item_Id)
where rownum <= 10
[Note: This query is generated by Hibernate].
It is clear that order by 40 million holds so much time.
I find the main issue of this query. I omitted the “order by” and run the query, its response time was reduced to about 5 second. I was wonderful why the “order by” affects so much the response time.
(Some body may think that if this table is partitioned or use another facility of oracle, it may get better response time. Ok it may be right but my emphasis is the “order by” performance. If there is a way that do the “order by” responsibility, why not to do it). Any way I am not able to omit the “order by” because the Customer needs to order and it is necessary for paging. I find a solution that is explained by an example. This solution I order only some records that is needed. How, I will explain later. It is clear when oracle wants to sort 40 million records, it naturally takes so much time. I replace “order by” with “where clause”. With doing this replacement the response time was reduces from 2 minute to about 5 second and this is very exciting for me.
I explain my solution via an example, anybody that read this Post tells me whether this solution is good or there are another solution that I do not know exists.
Another hand I have a solution that is explained later, if it is ok or not. Whether I use it or not.
I explain my solution:
Let’s assumed that there are two table as below:
Post table
Id Others fields
1
2
3
4
5
… …
Post_comment table
Id post_id
1 5
2 5
3 5
4 5
6 5
7 2
8 2
9 2
10 3
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 4
27 4
There is a form that shows the result of join between POST table and POST_COMMENT table.
I explain both query with “order by” all records of that table and “order by” only specific records that are needed. The result of two query are exactly the same but the response time of second approach is the better than that one.
You assume that the page size is 10 and you are in page 3.
The first query with the “order by” all records of that table:
select *
from (Select res.*, rownum as rownum_
from (Select * from POST_COMMENT Order by post_id asc) res
Where rownum <= 30)
where rownum_ > 20
The second solution:
Before execution the query, I query as below:
select *
from (select post_id, count(id) from POST_COMMENT group by post_id)
order by post_id asc
So the result of it is the below:
Post_id Count(id) Sum(count(id))
1 15 15
2 3 18
3 1 19
4 2 21
5 5 26
It needs to say that the third column that is "Sum(count(id))" is calculated after that query.Any entry of this column is sum all before records.
So there is a formula that specifics which post_id must be selected. The formula is the below:
pageSize = 10, pageNumber = 3
from : (pageNumber – 1) * pageCount  2 * 10 = 20
to : (pageNumber – 1) * pageCount + pageCount  20 + 10 = 30
So I need the posts that are between (20, 30] of Sum(count(id)). According to this, I need only two post_id that have value 4,5. According to this the main query of second approach is:
select *
from (select rownum as rownum_, res.*
from (select *
from (select * from POST_COMMENT where post_id in (4, 5))
order by post_id asc) res
where rownum <= 30)
where rownum_ > 20
If you look at both query, you will see the biggest difference. The second query only selects the records of POST_COMENT that have post_id that are 4 and 5. After that, orders this records not all records of that table.
After posting this post, I have searched. finally I am redirected to HERE . I can reach to the response time that is very excited for me. It is reduced from 3 minutes to less than 3 seconds. It is necessary to know, I only use one tip from all of the query optimization guidelines that are in that site that is Duplicate constant condition for different tables whenever possible.
Note: Before doing this tip, there are some indexs on fields that are in where-clause and order-by.

Split amount into multiple rows if amount>=$10M or <=$-10B

I have a table in oracle database which may contain amounts >=$10M or <=$-10B.
99999999.99 chunks and also include remainder.
If the value is less than or equal to $-10B, I need to break into one or more 999999999.99 chunks and also include remainder.
Your question is somewhat unreadable, but unless you did not provide examples here is something for start, which may help you or someone with similar problem.
Let's say you have this data and you want to divide amounts into chunks not greater than 999:
id amount
-- ------
1 1500
2 800
3 2500
This query:
select id, amount,
case when level=floor(amount/999)+1 then mod(amount, 999) else 999 end chunk
from data
connect by level<=floor(amount/999)+1
and prior id = id and prior dbms_random.value is not null
...divides amounts, last row contains remainder. Output is:
ID AMOUNT CHUNK
------ ---------- ----------
1 1500 999
1 1500 501
2 800 800
3 2500 999
3 2500 999
3 2500 502
SQLFiddle demo
Edit: full query according to additional explanations:
select id, amount,
case
when amount>=0 and level=floor(amount/9999999.99)+1 then mod(amount, 9999999.99)
when amount>=0 then 9999999.99
when level=floor(-amount/999999999.99)+1 then -mod(-amount, 999999999.99)
else -999999999.99
end chunk
from data
connect by ((amount>=0 and level<=floor(amount/9999999.99)+1)
or (amount<0 and level<=floor(-amount/999999999.99)+1))
and prior id = id and prior dbms_random.value is not null
SQLFiddle
Please adjust numbers for positive and negative borders (9999999.99 and 999999999.99) according to your needs.
There are more possible solutions (recursive CTE query, PLSQL procedure, maybe others), this hierarchical query is one of them.

How to know if a record DOESN'T exists on a table in Oracle

I'm dealing whit this for a couple of hours and I can't find the way to get the answer.
I've a table with a maximun of 4 records for a product (let's call it that way) for a diferent period (column name with a number). I'm trying to return the ones that DO NOT has a particular type of CONSUMPTION_TYPE_ID. But it doesn't work.
I'll explain it simple. I've a table with these fields (there are more, but these one are just fine)
product_id - CONSUMPTION_TYPE_ID - consumption_period
123 103 1
123 104 1
123 107 1
123 108 1
I need to return the ones that don't has one particular type of consumption, let's say that the type 107 is missing (the row doesn't exists), the select query should show the other 3 or any present. I don't mind doing the same select 4 times, I could also try to do a cursor for it and use loop to check every one. The point is, that the type of query with "not in" or "not exists" doesn't work. It gives me a result like the one given below, but when I query the "consumption_period" it shows me the missing "CONSUMPTION_TYPE_ID" and that's because the "not in" clause it's only hidding the results.
this is what I need.
select * from t1 where CONSUMPTION_TYPE_ID != 108;
product_id - CONSUMPTION_TYPE_ID - consumption_period
123 103 1
123 104 1
123 107 1
I hope you can help me with this. I'm stucked, it maybe simple, but I'm having one of those stucked times. Thanks in advance for any help
You probably should've posted that NOT EXISTS query that doesn't work, because that is the right way to do this.
If I got your requirements right: all products that do not have a record for a specific consumption_type_id.
SELECT DISTINCT product_id
FROM t1 t
WHERE NOT EXISTS
(SELECT 1 FROM t1
WHERE t.product_id = product_id
AND Consumption_Type_ID = ?)
The obvious answer here is to search for CONSUMPTION_TYPE_ID = 108 and have the surrounding code check for a lack of rows, rather than the existence of rows.
If you really need a row return for each consumption_type_id that's not in this table, then you should probably be selecting from the lookup table for consumption_type_id:
select *
from consumption_type ct
where not exists (select *
from t1
where t1.consumption_type_id = ct.consumption_type_id)
and ct.consumption_type_id = 108

Oracle - using multiple exists to check record availability

I have a situation in my application for displaying the count of data which match different criterion. Since the performance of counting is degrading with respect to the growth of database, we decided to show only the availability information using the exists clause.
Below is my table structure
Table: DocInfo
---------------------------------------
DocId number
DocName varchar(250)
DocStatus number
SignedBy number
ForwardedBy number
ForwardCount number
DocOwner number
MgrID number
ProjectId number
The current query which does the counting is like this
SELECT NVL(SUM(CASE
WHEN (DocStatus IN (1150,1155,1170,1182,1190) AND
DocOwner=56366 AND
ForwardCount=0)
THEN 1
ELSE 0
END), 0) "ForReview",
NVL(SUM(CASE
WHEN (DocStatus IN (1200) And
MgrID = 56366 AND
ForwardCount = 0 )
THEN 1
ELSE 0
END), 0) "Accepted" ,
NVL(SUM(CASE
WHEN (DocStatus IN (1150,1155,1170,1182,1190) AND
DocOwner=56366 AND
MgrID = 0 )
THEN 1
ELSE 0
END), 0) "Waiting"
FROM DocInfo
WHERE ProjectId = 313 and
(DocOwner = 56366 or MgrID = 56366)
I need to change the counting to an exists clause so that i can show whether documents are available or not in each category.
Since this change is to improve the performance, running this as different queries is also not advisable. Please help me, I have ran out of my limited knowledge.
Sorry to miss the part which i have already tried.
I have changed the above query to a union with exists clause in each like below.
SELECT 'ForReview' AS A
FROM DUAL
WHERE EXISTS (SELECT NULL
FROM DocInfo
WHERE ProjectId = 313 and
(DocOwner = 56366 or MgrID = 56366) and
(DocStatus IN (1150,1155,1170,1182,1190) AND
DocOwner=56366 AND
ForwardCount=0))
UNION
SELECT 'Accepted' AS A
FROM DUAL
WHERE EXISTS (SELECT NULL
FROM DocInfo
WHERE ProjectId = 313 and
(DocOwner = 56366 or MgrID = 56366) and
(DocStatus IN (1200) And
MgrID = 56366 AND
ForwardCount = 0 ))
UNION
SELECT 'Waiting' AS A
FROM DUAL
WHERE EXISTS (SELECT NULL
FROM DocInfo
WHERE ProjectId = 313 and
(DocOwner = 56366 or MgrID = 56366) and
(DocStatus IN (1150,1155,1170,1182,1190) AND
DocOwner=56366 AND
MgrID = 0))
I have mentioned only 3 conditions, whereas my actual application has 8 different criteria to be added into this query. so when i have 8 Exists clauses, it runs internally as 8 different queries, and in effect it takes more time - single segment in the entire union query takes only 560 ms whereas all queries together takes around 7 seconds to generate the output.
Since my requirement is only to identify the Availability of any such record i do not want to navigate through the entire recordset and count it.
Is there anyway to optimize/rewrite this query
Thank You
"so when i have 8 Exists clauses, it runs internally as 8 different
queries, and in effect it takes more time - single segment in the
entire union query takes only 560 ms whereas all queries together
takes around 7 seconds to generate the output."
Surprise, surprise. Running what amounts to the same query eight times will not be faster than running that query once.
Now it is true that EXISTS can be faster, because it only needs to find a single row which matches the given criteria, rather than retrieving an entire data set. However you have just shifted the retrieved data into the WHERE clause so the database still has to do the same amount of work. In fact, it is apparently doing a lot more work, because 7s > (560ms * 8).
To solve your problem properly you need to understand how the database works and how to tune it. Find out more.
For a start, define a tuning goal. Your original query takes half a second to run: that's not lightning fast but it is pretty quick. Why is this a problem? How quickly do you want it to run?
Next, run an EXPLAIN PLAN. Is the query using indexes? How efficiently is its index usage> What percentage of the rows are being selected?
Now you also need to undersatnd your data. Is the selected data evenly distributed throughout the table or are there clusters? Do some projects, owners or managers have more records than others? How does that distribution effect performance?
Please bear in mind, tuning is a science and it is complicated: there are whole books on the subject and some people make very fine livings as performance troubleshooters. It requires a lot of information about your system, both knowledge of what your application does and low-level information on which activities your database is doing. We can help you in your quest to find a more performant solution but we cannot just look at a shonky query and tell you how to re-write so it runs quicker.

Resources