Hive: Exception when using LAG with window function

Hive: Exception when using LAG with window function - hadoop

I'm trying to calculate a time difference between 2 rows and applied the solution from this SO question. However I get an exception:
> org.apache.hive.service.cli.HiveSQLException: Error while compiling
> statement: FAILED: SemanticException Failed to breakup Windowing
> invocations into Groups. At least 1 group must only depend on input
> columns. Also check for circular dependencies. Underlying error:
> Expecting left window frame boundary for function
> LAG((tok_table_or_col time), 1, 0) Window
> Spec=[PartitioningSpec=[partitionColumns=[(tok_table_or_col
> client_id)]orderColumns=[(tok_table_or_col time) ASC
> NULLS_FIRST]]window(type=ROWS, start=1 PRECEDING, end=currentRow)] as
> LAG_window_0 to be unbounded. Found : 1
HiveQL:
SELECT id, loc, LAG(time, 1, 0) OVER (PARTITION BY id, loc ORDER BY time ROWS 1 PRECEDING) - time AS response_time FROM mytable
How to I fix this? What is the issue?
EDIT:
Sample data:
id loc time
0 1 1414250523591
0 1 1414250523655
1 2 1414250523655
1 2 1414250523661
1 3 1414250523661
1 3 1414250523662
And what I want is the difference of time between rows with same id and loc (always pairs of 2).
EDIT2: I should also mention I'm new to hadoop/hive ecosystem.
So as the error said, the window should be unbounded. So I just removed the ROWS clause and now at least it is doing something but it still is wrong. So I just wanted to check what the LAG value actually is:
SELECT id, loc, LAG(time, 1) OVER (PARTITION BY id, loc ORDER BY time) AS lag_col FROM mytable
And I get this as output:
id loc lag_col
1 2 null
1 2 -1
1 3 null
1 3 -1
The null is clear because I removed the default value but why -1? Are the large values in time column leading to somekind of overflow? Column is defined as bigint so it should actually fit without problem but maybe there is a conversion to int during the query?

Related

Delta Detection in Oracle table with 2 billion records using composite key

Initial Load on Day 1
id
key
fkid
1
0
100
1
1
200
2
0
300
Load on Day 2
id
key
fkid
1
0
100
1
1
200
2
0
300
3
1
400
4
0
500
Need to find delta records
Load on Day 2
id
key
address
3
1
400
4
0
500
Problem Statement
Need to find delta records in minimum time with following facts
1: I have to process around 2 billion records initially from a table as mentioned below
2: Also need to find delta with minimal time so that I can process it quickly
Questions :
1: Will it be a time consuming process to identify delta especially during production downtime ?
2: How long should it take to identify delta with 3 numeric columns in a table out of which
id & key forms a composite key.
Solution tried :
1: Use full join and extract delta with case nvl condition but looks to be costly.
nvl(node1.id, node2.id) id,
nvl(node1.key, node2.key) key,
nvl(node1.fkid, node2.fkid) fkid
FROM
TABLE_DAY_1 node1
FULL JOIN TABLE_DAY_2 node2 ON node2.id = node1.id
WHERE
node2.id IS NULL
OR node1.id IS NULL;```

You need two separate statements to handle this, one to detect new & changed rows, a separate one to detect deleted rows.
While it is cumberson to write, the fastest comparison is field-by-field, so:
SELECT /*+ parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ *
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id = node2.id(+)
AND (node2.id IS NULL -- new rows
OR node1.col1 <> node2.col2 -- changed val on non-nullable col
OR NVL(node1.col3,' ') <> NVL(node2.col3,' ') -- changed val on nullable string
OR NVL(node1.col4,-1) <> NVL(node2.col4,-1) -- changed val on nullable numeric, etc..
)
Then for deleted rows:
SELECT /*+ parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ node2.id
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id(+) = node2.id
AND node1.id IS NULL -- deleted rows
You will want to make sure Oracle does a full table scan. If you have lots of CPUs and parallel query is enabled on your database, make sure the query uses parallel query (hence the hint). And you want a hash join between them. Work with your DBA to ensure you have enough temporary space to pull this off, and enough PGA to at least handle this with a single pass workarea rather than multipass.

Using oracle loop to concatanete strings

I have someting like this
id day descrition
1 1 hi
1 1 today
1 1 is a beautifull
1 1 day
1 2 exemplo
1 2 for
1 2 this case
I need to do a funtion that for each day concatenate the descrtiomn colunm and return the result like this
id day descrition
1 1 hi today is a beautifull thay
1 2 exemplo for this case
Anny ideia about how can i do this usisng a loop in a function in oracle

You need a way of determining which order the values should be aggregated. The snippet below will rely on the implicit order in which Oracle reads the rows from the datafiles - if you have row movement enabled then you may get inconsistent results as the rows can be read in different orders as they are relocated in the underlying datafiles.
SELECT LISTAGG( description, ' ' ) WITHIN GROUP ( ORDER BY ROWNUM ) AS description
FROM your_table
GROUP BY id, day
It would be better to have another column that stores the order within each day.

multiply records acrross a rows - pl/sql [duplicate]

In SQL there are aggregation operators, like AVG, SUM, COUNT. Why doesn't it have an operator for multiplication? "MUL" or something.
I was wondering, does it exist for Oracle, MSSQL, MySQL ? If not is there a workaround that would give this behaviour?

By MUL do you mean progressive multiplication of values?
Even with 100 rows of some small size (say 10s), your MUL(column) is going to overflow any data type! With such a high probability of mis/ab-use, and very limited scope for use, it does not need to be a SQL Standard. As others have shown there are mathematical ways of working it out, just as there are many many ways to do tricky calculations in SQL just using standard (and common-use) methods.
Sample data:
Column
1
2
4
8
COUNT : 4 items (1 for each non-null)
SUM : 1 + 2 + 4 + 8 = 15
AVG : 3.75 (SUM/COUNT)
MUL : 1 x 2 x 4 x 8 ? ( =64 )
For completeness, the Oracle, MSSQL, MySQL core implementations *
Oracle : EXP(SUM(LN(column))) or POWER(N,SUM(LOG(column, N)))
MSSQL : EXP(SUM(LOG(column))) or POWER(N,SUM(LOG(column)/LOG(N)))
MySQL : EXP(SUM(LOG(column))) or POW(N,SUM(LOG(N,column)))
Care when using EXP/LOG in SQL Server, watch the return type http://msdn.microsoft.com/en-us/library/ms187592.aspx
The POWER form allows for larger numbers (using bases larger than Euler's number), and in cases where the result grows too large to turn it back using POWER, you can return just the logarithmic value and calculate the actual number outside of the SQL query
* LOG(0) and LOG(-ve) are undefined. The below shows only how to handle this in SQL Server. Equivalents can be found for the other SQL flavours, using the same concept
create table MUL(data int)
insert MUL select 1 yourColumn union all
select 2 union all
select 4 union all
select 8 union all
select -2 union all
select 0
select CASE WHEN MIN(abs(data)) = 0 then 0 ELSE
EXP(SUM(Log(abs(nullif(data,0))))) -- the base mathematics
* round(0.5-count(nullif(sign(sign(data)+0.5),1))%2,0) -- pairs up negatives
END
from MUL
Ingredients:
taking the abs() of data, if the min is 0, multiplying by whatever else is futile, the result is 0
When data is 0, NULLIF converts it to null. The abs(), log() both return null, causing it to be precluded from sum()
If data is not 0, abs allows us to multiple a negative number using the LOG method - we will keep track of the negativity elsewhere
Working out the final sign
sign(data) returns 1 for >0, 0 for 0 and -1 for <0.
We add another 0.5 and take the sign() again, so we have now classified 0 and 1 both as 1, and only -1 as -1.
again use NULLIF to remove from COUNT() the 1's, since we only need to count up the negatives.
% 2 against the count() of negative numbers returns either
--> 1 if there is an odd number of negative numbers
--> 0 if there is an even number of negative numbers
more mathematical tricks: we take 1 or 0 off 0.5, so that the above becomes
--> (0.5-1=-0.5=>round to -1) if there is an odd number of negative numbers
--> (0.5-0= 0.5=>round to 1) if there is an even number of negative numbers
we multiple this final 1/-1 against the SUM-PRODUCT value for the real result

No, but you can use Mathematics :)
if yourColumn is always bigger than zero:
select EXP(SUM(LOG(yourColumn))) As ColumnProduct from yourTable

I see an Oracle answer is still missing, so here it is:
SQL> with yourTable as
2 ( select 1 yourColumn from dual union all
3 select 2 from dual union all
4 select 4 from dual union all
5 select 8 from dual
6 )
7 select EXP(SUM(LN(yourColumn))) As ColumnProduct from yourTable
8 /
COLUMNPRODUCT
-------------
64
1 row selected.
Regards,
Rob.

With PostgreSQL, you can create your own aggregate functions, see http://www.postgresql.org/docs/8.2/interactive/sql-createaggregate.html
To create an aggregate function on MySQL, you'll need to build an .so (linux) or .dll (windows) file. An example is shown here: http://www.codeproject.com/KB/database/mygroupconcat.aspx
I'm not sure about mssql and oracle, but i bet they have options to create custom aggregates as well.

You'll break any datatype fairly quickly as numbers mount up.
Using LOG/EXP is tricky because of numbers <= 0 that will fail when using LOG. I wrote a solution in this question that deals with this

Using CTE in MS SQL:
CREATE TABLE Foo(Id int, Val int)
INSERT INTO Foo VALUES(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
;WITH cte AS
(
SELECT Id, Val AS Multiply, row_number() over (order by Id) as rn
FROM Foo
WHERE Id=1
UNION ALL
SELECT ff.Id, cte.multiply*ff.Val as multiply, ff.rn FROM
(SELECT f.Id, f.Val, (row_number() over (order by f.Id)) as rn
FROM Foo f) ff
INNER JOIN cte
ON ff.rn -1= cte.rn
)
SELECT * FROM cte

Not sure about Oracle or sql-server, but in MySQL you can just use * like you normally would.
mysql> select count(id), count(id)*10 from tablename;
+-----------+--------------+
| count(id) | count(id)*10 |
+-----------+--------------+
| 961 | 9610 |
+-----------+--------------+
1 row in set (0.00 sec)

DateWise Query with sum count in Oracle

Below is the one is my actual resulset in oracle database
TIMESTAMP SUCESS FAILURE
26-01-2017 1 0
31-01-2017 0 1
If i select from 26-01-2017 to 31-01-2017 .Query has to return like this below
expected resultset
Timestamp 26-01-2017 27-01-2017 28-01-2017 29-01-2017 30-01-2017 31-01-2017
Sucess 1 0 0 0 0 0 |
Failure 0 0 0 0 0 0
Please can anyone give me suggestions to write logic for above expected resultset?

You would need a PIVOT (I made an assumption that you always have either success or failure):
select * from (
select decode(success, 1, 'success', 'failure') as res_name,
success+failure as res,
to_char(time_stamp, 'DD-MM-YYYY') ts
from your_table)
pivot (max(res) for ts in ('26-01-2017', '27-01-2017', '28-01-2017', '29-01-2017', '30-01-2017', '31-012017'))
List of columns is always defined up front, so if you need a variable list of columns you need either generate above query or use PIVOT XML. With PIVOT XML you can use subquery instead of predefined list of variables, but you get XML back.

Checking time difference, MS Access

I have an MS Access Database table that records communication status of values from several meters. The data is logged directly to the table, but I need to make sure that the table is populating. From the sample data you can see that the Comm columns doesn't read false or 0, so I want to return a log whenever the difference between now and "Date / Time" is greater than 5 minutes.
Date / Time FCB Comm BOF Comm EAF Comm FGP Comm
9/6/2011 10:29:10 1 1 1 1
9/6/2011 10:28:01 1 1 1 1
9/6/2011 10:27:11 1 1 1 1
9/6/2011 10:26:20 1 1 1 1
9/2/2011 08:17:01 1 1 1 1
9/2/2011 08:16:10 1 1 1 1
9/2/2011 08:15:02 1 1 1 1
9/2/2011 08:14:08 1 1 1 1
I wanted to know if anyone could tell me if this could like a reasonable query to run?
SELECT Data.[Date / Time], Data.[Ford Chiller Building Comm Okay],
Data.[Basic Oxygen Furnace Comm Okay], Data.[Electro-Arc Furnace Comm Okay],
Data.[J-9 Shop Comm Okay], Data.[Ford Glass Plant Comm Okay]
FROM Data
where DateDiff("n",now(), Data.[Date / Time] ) < 5;

You need something running continuously that generates a notification whenever expected data doesn't appear, and there's a couple of approaches you can take to do that.
One is to continuously run a query like the one you have above, checking the most recent date in the table against the value of the now() function.
Another approach is to take the latest date in your table, wait (sleep) for 5 minutes, and then check the table again for any newer entries. My expectation is that this approach will generate fewer hits on your table.
You could also just check the most recent date every 5 minutes regardless of the previous time checked and see if data hasn't come in.
You need to set up your notification loop first, then you can experiment with different approaches.
all you should really need to do is return the number of rows in the table whose timestamp is within 5 minutes of now(). You shouldn't need additional row detail, just is the count 0 or not?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive: Exception when using LAG with window function - hadoop

Related

Delta Detection in Oracle table with 2 billion records using composite key

Using oracle loop to concatanete strings

multiply records acrross a rows - pl/sql [duplicate]

DateWise Query with sum count in Oracle

Checking time difference, MS Access

Categories

Resources