select data betwenn two date - hadoop

I'm working in Hive on data set below
+++++++++++++++++++++++++++++++++++++++++++
code dateJ capa
+++++++++++++++++++++++++++++++++++++++++++
1988 2015-08-22 23
0470 2015-07-26 455
... ..... ...
5884 2015-08-01 54
4587 2015-06-05 100
I would like to pick up "code" from the table between two dates. query below works :
SELECT code FROM tabl WHERE dateJ BETWEEN '2015-06-05' AND '2015-08-22'
But when I use nested/sub-queries I doesn't work :
SELECT code FROM tabl WHERE dateJ BETWEEN (SELECT MIN(dateJ) FROM tabl) and (SELECT MAX(dateJ) FROM tabl)
Does any body could help on how I can fix the problem (with the second query). hive don't support subqueries.
Thx

I found a solution. Here it is :
select code from tabl,
(select min(dateJ) mindate, max(dateJ) maxdate from tabl) tmp
where dateJ between tmp.mindate and tmp.maxdate

This functionality is not supported in Hive, unfortunately. You can only use a select statement in the where clause in conjunction with IN, NOT IN, EXISTS, and NOT EXISTS. If you can find a way to construct the functionality you need using those and joins, then that's the way to go.

Related

How to modify Oracle SQL query to Snowflake

Using Oracle SQL, there is a function, noted below, that will allow you to create a "list" of names, phone numbers, etc., without using
multiple DUAL queries and UNION/UNION ALL to get more than one record.
The query below produces a list in this case of 10 names.
SELECT COLUMN_VALUE USERNAME
FROM TABLE(SYS.DBMS_DEBUG_VC2COLL(
'WARNER,JEFF',
'MALITO,CARL',
'MOODY,JEANNE',
'PHILLIPS,HUGH & KELLY',
'PATSANTARAS,VICTORIA',
'BROWN,ROLAND',
'RADOSEVICH,MIKE',
'RIDER,JACK',
'MACLEOD,LENARD',
'SCOTT,DAN' ))
However, when trying to run this same query in Snowflake, it will not work.
I receive this error: SQL compilation error: Invalid identifier SYS.DBMS_DEBUG_VC2COLL
Is there a "Snowflake version" of this query that can be used?
Here are some options, you can see which works best for you.
This works if you can get your SQL to look similar:
SELECT $1::VARCHAR AS column_value
FROM (VALUES ('WARNER,JEFF'), ('MACLEOD,LENARD'), ('SCOTT,DAN'));
This also works if you can get your list to be in a single string, delimited by a pipe or similar:
SELECT value::VARCHAR AS column_value
FROM LATERAL FLATTEN(INPUT=>SPLIT('WARNER,JEFF|MACLEOD,LENARD|SCOTT,DAN', '|'));
If you have the strings in the format 'a','b' and find it painful to do one of the above, I'd do something like this:
SELECT value::VARCHAR AS column_value
FROM LATERAL FLATTEN(INPUT=>SPLIT(ARRAY_TO_STRING(ARRAY_CONSTRUCT('WARNER,JEFF', 'MALITO,CARL', 'MOODY,JEANNE'), '|'), '|'));
Similar to the above suggestions, you can try this:
SELECT VALUE::VARCHAR as column_name
FROM TABLE(FLATTEN(INPUT => ARRAY_CONSTRUCT('WARNER,JEFF', 'MALITO,CARL', 'MOODY,JEANNE'), MODE => 'array'));

Nested Subquery Limitations in Oracle

So, I have been doing a fair amount of reading on this in various forums and resource sites but have not yet found found a solution I believe applies to my case. Also, I can't believe how difficult this is proving to be; I would think this kind of query would be fairly common.
Essentially what I am doing here is querying two historical tables (tbl_b and tbl_c), via union, for a specific milestone date - for which there may be multiple results... I then wish to find the most recent of these results, using max. This date is then returned as a column in the main query.
My problem is that, in the 3rd tier subquery, I need to reference an identifier value from the table in the top query (tbl_a).
I know that correlated queries only are able to reference their parent query - so, I am stuck.
Edit 1
The target date I am searching for will most likely, but not necessarily, be unique within the result set. It is a timestamp of the data record. I am looking for the most recent entry in the history that correlates to each column in tbl_a. Creating an SQL Fiddle for this.
See sample below:
select tbl_a.col_a,
tbl_a.col_b,
(
select max(target_date)
from
(
select tbl_b.target_date
from tbl_b
where tbl_b.tbl_a_id = tbl_a.id and
tbl_b.flag = 1 and
tbl_b.milestone_id = tbl_a.milestone_id
union
select tbl_c.target_date
from tbl_c
where tbl_c.tbl_a_id = tbl_a.id and
tbl_c.flag = 1 and
tbl_c.milestone_id = tbl_a.milestone_id
) most_recent_target_date
)
from tbl_a
Convert this query to a join, in this way:
select tbl_a.col_a,
tbl_a.col_b,
max(most_recent_target_date.target_date)
from tbl_a
join (
select tbl_b.target_date, tbl_b.date_id
from tbl_b
where tbl_b.flag = 1
union all
select tbl_c.target_date, tbl_c.date_id
from tbl_c
where tbl_c.flag = 1
) most_recent_target_date
ON tbl_a.date_id = most_recent_target_date.date_id
GROUP BY tbl_a.col_a,
tbl_a.col_b

What is simplest query to display unique values in each column with their count?

Let's consider I have table like this :
id name addr_line 1 addr_line_2 rec_ins_dt rec_updt_dt
and I want to show output as follows :
rec_ins_dt rec_ins_dt_count rec_updt_dt rec_updt_dt_count
How can I achieve this result using single query ? I understand this can be done by creating temp tables and then joining two temp tables together but I want to use single query.
Following are the additional limitations while executing this query :
Input data : 1 billion rows
Memory : 4 GB
Please consider platform as Oracle or Netezza. Thank you for your inputs.
SELECT
rec_ins_dt , COUNT(*) OVER (PARTITION BY rec_ins_dt) AS rec_ins_dt_count,
rec_updt_dt , COUNT(*) OVER (PARTITION BY rec_updt_dt) AS rec_ins_dt_count
FROM <your-table>;
Oracle Version

Transforming hive IN subselect query combined with WHERE replacement

I know that one needs to replace IN query with semi-left-join (e.g. Hive doesn't support in, exists. How do I write the following query?), but I don't know how to combine it with a WHERE clause:
SELECT *
from foo
WHERE userId IN
(SELECT distinct(userId) FROM foo WHERE x=true ORDER BY RAND() LIMIT 100);
thanks.
EDIT: Changed query. Intention is to create a random sample of entries (statistics wise).
(Posting alternative approach for completeness.)
To sample a set of records from a table, you can use Hive's TABLESAMPLE syntax. For example, too select a random sample of 100 distinct userId's you would use:
SELECT userId
FROM (SELECT DISTINCT(userId) as userId FROM foo) f
TABLESAMPLE(100 ROWS);
The syntax allows you to specify your sample size in different ways. The following is also valid:
SELECT userId
FROM (SELECT DISTINCT(userId) as userId FROM foo) f
TABLESAMPLE(1 PERCENT);
For more details, check out the manual page for this topic.
Once you have your sample of userId's, you can use Manuel Aldana's earlier answer to select the corresponding records from your original table.
select id from foo
left semi join
(SELECT id_2 FROM bar WHERE x=true RAND() LIMIT 100) x
ON foo.id=x.id_2
Should be like this.
I just don't understand this part : x=true RAND()
Also, this doesn't handle nulls just like your query.

Display Records through SQL in Oracle

I had run following query in Oracle Database and produces following output:
Query: select id,name from member where name like 'A%';
ID Name
261 A....
706 Aaa.......
327 Ab.....
and more...
This Query returns 50 records and
I want to display 10 records at a time to user.
Since, ID does not contain data in autoincrement fashion, i cannot use between operator.
and rownum operator also doesn't help much.
Kindly Help.
Regards,
Ankit Agarwal
SELECT ID, Name
from (
select id,name, ROW_NUMBER() over( order by name) r
from member
where name like 'A%'
)
WHERE R between FromRowNum AND ToRowNum;
See http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:76812348057

Resources