Presto for loop - performance

I am new to presto and I would like to know if there is any way to have for loop. I have a query that aggregates some data date by date, and when i run it it throws an error of: exceeded max memory size of 30GB.
I can use other suggestions if looping is not an option.
the query I am using:
select dt as DATE_KPI,brand,count(distinct concat(cast(post_visid_high as varchar),
cast(post_visid_low as varchar)))as kpi_value
from hive.adobe.tbl
and dt >= date '2017-05-15' and dt <= date '2017-06-13'
group by 1,2

Assuming you are using, Hive you can write the source data to a table bucketed bucketed on brand, and then process groups of buckets with WHERE "$bucket" % 32 = <N>.
Otherwise, you can fragment the query into n queries and then process 1/n of the "brand" in each query. You use WHERE abs(from_big_endian_64(xxhash64(to_utf8(brand)))) % 32 = <N> to bucketize the brands.

Related

Hadoop Hive MAX gives multiple results

I am trying to get a maximum value from a count selecting 2 label srcip and max, but everytime I include srcip I have to use group by srcip at the end and gives me result as the max wasnt even there.
When I write the query like this it gives me the correct max value but I want to select srcip as well.
Select max(count1) as maximum
from (SELECT srcip,count(srcip) as count1 from data group by srcip)t;
But when I do include srcip in the select I get result as there was no max function
Select srcip,max(count1) as maximum
from (SELECT srcip,count(srcip) as count1 from data group by srcip)t
group by srcip;
I would expect from this a single result but I get multiple.
Anyone has any ideas?
You may do ORDER BY count DESC with LIMIT 1 to get the scrip with MAX of count.
SELECT srcip, count(srcip) as count1
from data group by srcip
ORDER BY count1 DESC LIMIT 1
let's consider you have a data like this.
Table
Let's see what happens when you run following query, what happens to data.
Query
SELECT srcip,count(srcip) as count1 from data group by srcip
Output: table1
Now let's see what happens you run your outer query on above table .
Select srcip,max(count1) as maximum from table1 group by srcip
Same Output
Reason being your query says to select srcip and maximum of count from each group of srcip. And we have 3 groups, so 3 rows.
The query below returns exact one row having the max count and the associated scrip. This is the query based on the expected result; you would rather look more into sql and earlier comments, then progress to hive analytical queries.
Some people could argue that there is better way to optimize this query for your expected result but this should give you a motivation to look more into Hive analytical queries.
select scrip, count1 as maximum from (select srcip, count(scrip) over (PARTITION by scrip) as count1, row_number() over (ORDER by scrip desc) as row_num from data) q1 having row_num = 1;

How to set range for limit clause in hive

How to set range for limit clause in hive , I have tried the below query but failed with syntax error . Can someone please help
select * from table limit 1000,2000;
You can use Row_Number window function and set the range limit.
Below Query will result only the first 20 records from the table
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 0 and rowid <=20;
Using Between operator to specify range
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid between 0 and 20;
To fetch rows from 20 to 40 then increase the lower/upper bound values
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 20 and rowid <=40;
The LIMIT clause is used to set a ceiling on the number of rows in the result set. You are getting a syntax error because of an incorrect usage of this HQL clause.
The query could be written as the following to return no more than 2000 rows:
SELECT * FROM table LIMIT 2000;
You could also write it like so to return no more than 1000 rows:
SELECT * FROM table LIMIT 1000;
However you cannot combine both into the same argument for LIMIT. The LIMIT argument must evaluate to a constant value.
I will try and expand on this information a bit to try and help solve your problem. If you are attempting to "paginate" your results the following may be of use.
FIRST I would recommend against leaning on HQL for pagination, in most situations that would be more efficiently implemented on the application logic side (query large result set, cache what you need, paginate with application logic). If you have no choice but to pull out ranges of rows you can get the desired effect through a combination of the LIMIT, ORDER BY, and OFFSET clauses.
LIMIT : This will limit your result set to a maximum number of row
ORDER BY: This will sort/order your result set based on one or more columns
OFFSET: This will start your result set at a certain row after the logical first entry in the table.
You may combine these three clauses to effectively query "pages" of your table. For example the following three queries show how to get the first 3 blocks of data from a table where each block contains 1000 rows and the target table's 'column1' is used to determine logical order.
SELECT title as "Page 1", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 0;
SELECT title as "Page 2", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 1000;
SELECT title as "Page 3", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 2000;
Each query declares 'column1' as the sorting value with ORDER BY. The queries will return no more than 1000 rows due to the LIMIT clause. Each result set will start at a different row due to the OFFSET being incremented by the "page size" for each query.
I am not sure what you are trying to achieve, but ...
That will return the 1001 and the 2001 record in the query results set only if you are using hive a hive version greater than 2.0.0
hive --version
(https://issues.apache.org/jira/browse/HIVE-11531)
Limit in Hive gives 'n' number of records randomly. It's not to print a range of records.
You may use order by in conjunction with limit to get what you want

cassandra long latency for query if many rows in result

exp: table schema:
Create Table tbl {
key int,
seq int,
name text,
Primary key(key, seq) };
For each key, there are multiple rows(1000K suppose);
Suppose I want to query content for a specific key, My query is:
select * from tbl where key = 'key1'
(actually I use the cpp driver in program, and use the paging interface)
Result contains 1000k rows, and it costs about 10s for this query.
I think data for each query is stored together on disk, so it should be very fast to return.
Why it costs so long time?
Is there any way to optimize???
Why it costs so long time?
There are almost 1000K=1000,000=1M rows returned from your query. That's why it costs too long time.
Is there any way to optimize???
Yes!! there are.
Try using limit and pivoting/pagination in the query.
From table definition, it seems that you have a clustering key seq you can easily use this seq value to optimize your query. Assuming clustering key(seq) has default ascending order. Changed your query to:
select * from tbl where key = 'key1' and seq > [pivot] limit 100
replace [pivot] with the last value of your result set. for the first query use Integer.MIN_VALUE as [pivot].
For example:
select * from tbl where key = 'key1' and seq > -100 limit 100

SQLite SELECT with max() performance

I have a table with about 1.5 million rows and three columns. Column 'timestamp' is of type REAL and indexed. I am accessing the SQLite database via PHP PDO.
The following three selects run in less than a millisecond:
select timestamp from trades
select timestamp + 1 from trades
select max(timestamp) from trades
The following select needs almost half a second:
select max(timestamp) + 1 from trades
Why is that?
EDIT:
Lasse has asked for a "explain query plan", I have run this within a PHP PDO query since I have no direct SQLite3 command line tool access at the moment. I guess it does not matter, here is the result:
explain query plan select max(timestamp) + 1 from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SCAN TABLE trades (~1000000 rows)
explain query plan select max(timestamp) from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SEARCH TABLE trades USING COVERING INDEX tradesTimestampIdx (~1 rows)
The reason this query
select max(timestamp) + 1 from trades
takes so long is that the query engine must, for each record, compute the MAX value and then add one to it. Computing the MAX value involves doing a full table scan, and this must be repeated for each record because you are adding one to the value.
In the query
select timestamp + 1 from trades
you are doing a calculation for each record, but the engine only needs to scan the entire table once. And in this query
select max(timestamp) from trades
the engine does have to scan the entire table, however it also does so only once.
From the SQLite documentation:
Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table.
I emphasized might from the documentation, because it appears that a full table scan may be necessary for a query of the form SELECT MAX(x)+1 FROM table
if column x be not the left-most column of an index.

How to avoid expensive Cartesian product using row generator

I'm working on a query (Oracle 11g) that does a lot of date manipulation. Using a row generator, I'm examining each date within a range of dates for each record in another table. Through another query, I know that my row generator needs to generate 8500 dates, and this amount will grow by 365 days each year. Also, the table that I'm examining has about 18000 records, and this table is expected to grow by several thousand records a year.
The problem comes when joining the row generator to the other table to get the range of dates for each record. SQLTuning Advisor says that there's an expensive Cartesian product, which makes sense given that the query currently could generate up to 8500 x 18000 records. Here's the query in its stripped down form, without all the date logic etc.:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on origdate + n - 1 <= closeddate -- here's the problem join
order by t.id, t.origdate;
Is there an alternate way to join these two tables without the Cartesian product?
I need to calculate the elapsed time for each of these records, disallowing weekends and federal holidays, so that I can sort on the elapsed time. Also, the pagination for the table is done server-side, so we can't just load into the table and sort client-side.
The maximum age of a record in the system right now is 3656 days, and the average is 560, so it's not quite as bad as 8500 x 18000; but it's still bad.
I've just about resigned myself to adding a field to store the opendays, computing it once and storing the elapsed time, and creating a scheduled task to update all open records every night.
I think that you would get better performance if you rewrite the join condition slightly:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on Closeddate - Origdate + 1 <= n --you could even create a function-based index
order by t.id, t.origdate;

Resources