SQL Server: Combine multiple rows into one row - windows

I've looked at a few other similar questions, but none of them fits the particular situation I find myself in.
I am a relative beginner at SQL.
I am writing a query to create a report. I have read-only access to this DB. I am trying to combine three rows into one row. Any method that only requires read access will work.
That being said, the three rows I have, were obtained by a very long sub-query. Here is the outer shell:
SELECT Availability,
Start_Date,
End_Date
FROM (
-- long subquery goes here (it is several UNION ALLs)
...
) AS dual
Here are the rows:
Availability | Start_Date | End_Date
-------------------------------------
99.983 | NULL | NULL
NULL | 1/10/2013 | NULL
NULL | NULL | 1/28/2013
What I am trying to do is combine the three rows into one row, like so:
Availability | Start_Date | End_Date
-------------------------------------
99.983 | 1/10/2013 | 1/28/2013
I am aware that I could use COALESCE() to put them in one column, but I would prefer to keep the three separate columns.
I can't create or use stored procedures.
Is it possible to do this? Can I have an example for the general case?

Have you tried using an aggregate function:
SELECT max(Availability) Availability,
max(Start_Date) Start_Date,
max(End_Date) End_Date
FROM (
-- long subquery goes here (it is several UNION ALLs)
...
) AS dual
If you have additional columns, then you would need to add a GROUP BY clause

Related

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

whats the purpose of using over and rank keywords in hive sql?

What is the meaning/purpose of using over and rank keywords in hive sql?
select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name;
+------+---------+---------------+
| rank | name | net_worth |
+------+---------+---------------+
| 1 | Solomon | 2000000000.00 |
| 2 | Croesus | 1000000000.00 |
| 2 | Midas | 1000000000.00 |
| 4 | Crassus | 500000000.00 |
| 5 | Scrooge | 80000000.00 |
+------+---------+---------------+
The OVER clause is powerful in that you can have aggregates over different ranges ("windowing"), whether you use a GROUP BY or not
OVER clause defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results
Over clause can be used in association with aggregate function and ranking function. The over clause determine the partitioning and ordering of the records before associating with aggregate or ranking function.
suppose you use only rank() function then how sql will understand on which bases rank will be calculated. example table have 3 columns name, net_worth and net_profit. Name with highest net_profit will be first rank. so you have to tell the sql that calculate rank on the bases of highest net_profit.
over() works on a "window" of attributes.
In your example, select rank() over (order by net_worth desc), you have instructed to rank the table with net_worth column in descending order. Due to that reason, ranking is done on descending order of net_worth.
over() is powerful it was used along with partition by.
Have a look at this article, which provides good examples to understand the concepts.
If you have sales table with Territory & Sales Amount, you can provide rank on order of Sales Amount Or create a partition for Territory and rank the Sales amount with in a Territory.
Have a look at this article to get understanding on WindowingAndAnalytics. It will explain how to use aggregate functions in HiveQL.

Oracle Domain index and sorting

The following query performs very poorly, due to the "order by". My goal is to get only a small subset of the resultset (using ROWNUM, for example). However, when I add "order by" it goes through the entire resultset performing an index lookup for each record, which makes it extremely slow. Without sorting the query is about 100 times faster when I limit the resultset to, for example, 1000 records.
QUERY:
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field;
THIS IS HOW I CREATED THE INDEX:
CREATE INDEX myindex ON mytable (text_field) INDEXTYPE IS ctxsys.context FILTER BY another_field
EXECUTION PLAN:
---------------------------------------------------------------
| Id | Operation | Name |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | TABLE ACCESS BY INDEX ROWID| MYTABLE |
|* 3 | DOMAIN INDEX | MYINDEX |
---------------------------------------------------------------
I also used CTXCAT instead of CONTEXT, and no improvement. I think the problem is, when I want the results sorted (only top 1000), it performs an index lookup for each record in the "entire" resultset. Is there a way to avoid that?
Thank you.
To have the ordering applied before the rownum filter, you need to use an in-line view:
SELECT text_file
from (
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field
)
where rownum <= 1000;
With your index in place Oracle should optimise this to do as little work as possible. You should see 'sort order by stopkey' and 'count stopkey' steps in the plan, which is Oracle being clever and knowing it only needs to get 1000 values from the index.
If you don't use the in-line view but just add the rownum to your original query it will still optimise it but as you state it will order the first 1000 random (or indeterminate, anyway) rows it finds, because of the sequence of operations it performs.

Oracle PLSQL: Performance issue when using TABLE functions

I am currently facing a performance problem when using the table functions. I will explain.
I am working with Oracle types and one of them is defined like below:
create or replace TYPE TYPESTRUCTURE AS OBJECT
(
ATTR1 VARCHAR2(30),
ATTR2 VARCHAR2(20),
ATTR3 VARCHAR2(20),
ATTR4 VARCHAR2(20),
ATTR5 VARCHAR2(20),
ATTR6 VARCHAR2(20),
ATTR7 VARCHAR2(20),
ATTR8 VARCHAR2(20),
ATTR9 VARCHAR2(20),
ATTR10 VARCHAR2(20),
ATTR11 VARCHAR2(20),
ATTR12 VARCHAR2(20),
ATTR13 VARCHAR2(10),
ATTR14 VARCHAR2(50),
ATTR15 VARCHAR2(13)
);
Then I have one table of this type like:
create or replace TYPE TYPESTRUCTURE_ARRAY AS TABLE OF TYPESTRUCTURE ;
I have one procedure with the following variables:
arr QCSTRUCTURE_ARRAY;
arr2 QCSTRUCTURE_ARRAY;
ARR is only containing one single instance of TYPESTRUCTURE with all its attributes set to NULL except ATTR4 which is set to 'ABC'
ARR2 is completelly empty.
Here comes the part which is giving me the performance issue.
The purpose is to take some values from a view (depending on the value on ATTR4) and fill those in same or similar structure. So I do the following:
SELECT TYPESTRUCTURE(MV.A,null,null,MV.B,MV.C,MV.D,null,null,MV.E,null,null,MV.F,MV.F,MV.G,MV.H)
BULK COLLECT INTO arr2
FROM TABLE(arr) PARS
JOIN MYVIEW MV
ON MV.B = PARS.ATTR4;
The code here works correctly except for the fact that is taking 15 seconds to execute the query...
This query is filling into ARR around 20 instances of TYPESTRUCTURE (or rows).
It could look like there may be lots of data on the view. But what gets me strange is that if I change the query and I set something hardcoded like the one below then is completelly fast (miliseconds)
SELECT TYPESTRUCTURE(MV.A,null,null,MV.B,MV.C,MV.D,null,null,MV.E,null,null,MV.F,MV.F,MV.G,MV.H)
BULK COLLECT INTO arr2
FROM (SELECT 'ABC' ATTR4 FROM DUAL) PARS
JOIN MYVIEW MV
ON MV.B = PARS.ATTR4;
In this new query I am directly hardcoding the value but keeping the join in order to try to test something as much similar as the one above but without the TABLE() function..
So here my question.... Is it possible that this TABLE() function is creating such a big delay with only having one single record inside? I would like to know whether someone can give me some advice on what is wrong in my approach and whether there may be some other way to achieve...
Thanks!!
This problem is likely caused by a poor optimizer estimate for the number of rows returned by the TABLE function. The CARDINALITY or DYNAMIC_SAMPLING hints may be the best way to solve the problem.
Cardinality estimate
Oracle gathers statistics on tables and indexes in order to estimate the cost of accessing those objects. The most important estimate is how many rows will be returned by an object. Procedural code does not have statistics, by default, and Oracle does not make any attempt to parse the code and estimate how many rows will be produced. Whenever Oracle sees a procedural row source it uses a static number. On my database, the number is 16360. On most databases the estimate is 8192, as beherenow pointed out.
explain plan for
select * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 16360 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 16360 |
--------------------------------------------------------------
Fix #1: CARDINALITY hint
As beherenow suggested, the CARDINALITY hint can solve this problem by statically telling Oracle how many rows to estimate.
explain plan for
select /*+ cardinality(1) */ * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 16360 |
--------------------------------------------------------------
Fix #2: DYNAMIC_SAMPLING hint
A more "official" solution is to use the DYNAMIC_SAMPLING hint. This hint tells Oracle to sample some data at run time before it builds the explain plan. This adds some cost to building the explain plan, but it will return the true number of rows. This may work much better if you don't know the number ahead of time.
explain plan for
select /*+ dynamic_sampling(2) */ * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 3 |
--------------------------------------------------------------
But what's really slow?
We don't know exactly was slow in your query. But whenever things are slow it's usually best to focus on the worst cardinality estimate. Row estimates are never perfect, but being off by several orders of magnitude can have a huge impact on an execution plan. In the simplest case it may change an index range scan to a full table scan.

Use Oracle unnested VARRAY's instead of IN operator

Let's say users have 1 - n accounts in a system. When they query the database, they may choose to select from m acounts, with m between 1 and n. Typically the SQL generated to fetch their data is something like
SELECT ... FROM ... WHERE account_id IN (?, ?, ..., ?)
So depending on the number of accounts a user has, this will cause a new hard-parse in Oracle, and a new execution plan, etc. Now there are a lot of queries like that and hence, a lot of hard-parses, and maybe the cursor/plan cache will be full quite early, resulting in even more hard-parses.
Instead, I could also write something like this
-- use any of these
CREATE TYPE numbers AS VARRAY(1000) of NUMBER(38);
CREATE TYPE numbers AS TABLE OF NUMBER(38);
SELECT ... FROM ... WHERE account_id IN (
SELECT column_value FROM TABLE(?)
)
-- or
SELECT ... FROM ... JOIN (
SELECT column_value FROM TABLE(?)
) ON column_value = account_id
And use JDBC to bind a java.sql.Array (i.e. an oracle.sql.ARRAY) to the single bind variable. Clearly, this will result in less hard-parses and less cursors in the cache for functionally equivalent queries. But is there anything like general a performance-drawback, or any other issues that I might run into?
E.g: Does bind variable peeking work in a similar fashion for varrays or nested tables? Because the amount of data associated with every account may differ greatly.
I'm using Oracle 11g in this case, but I think the question is interesting for any Oracle version.
I suggest you try a plain old join like in
SELECT Col1, Col2
FROM ACCOUNTS ACCT
TABLE TAB,
WHERE ACCT.User = :ParamUser
AND TAB.account_id = ACCT.account_id;
An alternative could be a table subquery
SELECT Col1, Col2
FROM (
SELECT account_id
FROM ACCOUNTS
WHERE User = :ParamUser
) ACCT,
TABLE TAB
WHERE TAB.account_id = ACCT.account_id;
or a where subquery
SELECT Col1, Col2
FROM TABLE TAB
WHERE TAB.account_id IN
(
SELECT account_id
FROM ACCOUNTS
WHERE User = :ParamUser
);
The first one should be better for perfomance, but you better check them all with explain plan.
Looking at V$SQL_BIND_CAPTURE in a 10g database, I have a few rows where the datatype is VARRAY or NESTED_TABLE; the actual bind values were not captured. In an 11g database, there is just one such row, but it also shows that the bind value is not captured. So I suspect that bind value peeking essentially does not happen for user-defined types.
In my experience, the main problem you run into using nested tables or varrays in this way is that the optimizer does not have a good estimate of the cardinality, which could lead it to generate bad plans. But, there is an (undocumented?) CARDINALITY hint that might be helpful. The problem with that is, if you calculate the actual cardinality of the nested table and include that in the query, you're back to having multiple distinct query texts. Perhaps if you expect that most or all users will have at most 10 accounts, using the hint to indicate that as the cardinality would be helpful. Of course, I'd try it without the hint first, you may not have an issue here at all.
(I also think that perhaps Miguel's answer is the right way to go.)
For medium sized list (several thousand items) I would use this approach:
First:generate a prepared statement with an XMLTABLE in join with your main table.
For instance:
String myQuery = "SELECT ...
+" FROM ACCOUNTS A,"
+ "XMLTABLE('tab/row' passing XMLTYPE(?) COLUMNS id NUMBER path 'id') t
+ "WHERE A.account_id = t.id"
then loop through your data and build a StringBuffer with this content:
StringBuffer idList = "<tab><row><id>101</id></row><row><id>907</id></row> ...</tab>";
eventually, prepare and submit your statement, then fetch the results.
myQuery.setString(1, idList);
ResultSet rs = myQuery.executeQuery();
while (rs.next()) {...}
Using this approach is also possible to pass multi-valued list, as in the select statement
SELECT * FROM TABLE t WHERE (t.COL1, t.COL2) in (SELECT X.COL1, X.COL2 FROM X);
In my experience performances are pretty good, and the approach is flexible enough to be used in very complex query scenarios.
The only limit is the size of the string passed to the DB, but I suppose it is possible to use CLOB in place of String for arbitrary long XML wrapper to the input list;
This binding a variable number of items into an in list problem seems to come up a lot in various form. One option is to concatenate the IDs into a comma separated string and bind that, and then use a bit of a trick to split it into a table you can join against, eg:
with bound_inlist
as
(
select
substr(txt,
instr (txt, ',', 1, level ) + 1,
instr (txt, ',', 1, level+1) - instr (txt, ',', 1, level) -1 )
as token
from (select ','||:txt||',' txt from dual)
connect by level <= length(:txt)-length(replace(:txt,',',''))+1
)
select *
from bound_inlist a, actual_table b
where a.token = b.token
Bind variable peaking is going to be a problem though.
Does the query plan actually change for larger number of accounts, ie would it be more efficient to move from index to full table scan in some cases, or is it borderline? As someone else suggested, you could use the CARDINALITY hint to indicate how many IDs are being bound, the following test case proves this actually works:
create table actual_table (id integer, padding varchar2(100));
create unique index actual_table_idx on actual_table(id);
insert into actual_table
select level, 'this is just some padding for '||level
from dual connect by level <= 1000;
explain plan for
with bound_inlist
as
(
select /*+ CARDINALITY(10) */
substr(txt,
instr (txt, ',', 1, level ) + 1,
instr (txt, ',', 1, level+1) - instr (txt, ',', 1, level) -1 )
as token
from (select ','||:txt||',' txt from dual)
connect by level <= length(:txt)-length(replace(:txt,',',''))+1
)
select *
from bound_inlist a, actual_table b
where a.token = b.id;
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 840 | 2 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | | | | |
| 2 | NESTED LOOPS | | 10 | 840 | 2 (0)| 00:00:01 |
| 3 | VIEW | | 10 | 190 | 2 (0)| 00:00:01 |
|* 4 | CONNECT BY WITHOUT FILTERING| | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 6 | INDEX UNIQUE SCAN | ACTUAL_TABLE_IDX | 1 | | 0 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID | ACTUAL_TABLE | 1 | 65 | 0 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Another option is to always use n bind variables in every query. Use null for m+1 to n.
Oracle ignores repeated items in the expression_list. Your queries will perform the same way and there will be fewer hard parses. But there will be extra overhead to bind all the variables and transfer the data. Unfortunately I have no idea what the overall affect on performance would be, you'd have to test it.

Resources