How can I calculate the total number of timestamps for a given ID and code in Hive - hadoop

We have code and products across different domains. We are tracking different timestamps at which the products transmit data. This how my data looks:
| timestamp_s | Product | Code | _c3 |
| 2017-01-01 01:18:04.40736 | A | 119 | 1 |
| 2017-01-01 01:18:05.20419 | A | 119 | 1 |
| 2017-01-01 01:18:11.21268 | A | 119 | 1 |
| 2017-01-01 10:48:22.52147 | A | 119 | 1 |
I want to find the total timestamps recorded for a product and code across different timestamps. Meaning like in the above case, the output count of timestamps for 2017-01-01 should be 4 for a unique product A and code 119 combination.
Basically, I want to see the total number of timestamp records for each day (meaning, the total count of all timestamp record(s) for 2017-01-01 as in the above case).

Use simple groupby+count distinct:
select count(distinct timestamp_s) ts_count, Product, Code
from table_name
group by Product, Code
Or if you need to have additional column with count on the same grain (without group by) use analytic function:
select timestamp_s, count(distinct timestamp_s) over(partition by Product, Code) ts_count, Product, Code
from table_name

Related

How to write huge table data to file | Informatica 10.x

I have created a Informatica flow
where I need to read data from table that to only one column which contain empids.
But the column might contain duplicate need to write distinct values to file from below query
Query :
select distinct
emp_id
from
employee
where
empid not in
(
select distinct
custid
from
customer
);
I have added the above query in Source Qualifier
employee table contains : 5 million records and customer table contains : 20 billion records
My Informatica is still running not got completed - 6 hours over till now and nothing is written to file because of huge data in both tables
Following is my query plan
--------------------------------------------------------------------
Id | Operation | Name |
--------------------------------------------------------------------
0 | SELECT STATEMENT | |
1 | AX COORDINATOR | |
2 | AX SEND QC (RANDOM) | :AQ10002 |
3 | HASH UNIQUE | |
4 | AX RECEIVE | |
5 | AX SEND HASH | :AQ10001 |
6 | HASH UNIQUE | |
7 | HASH JOIN ANTI | |
8 | AX RECEIVE | |
9 | AX SEND PARTITION (KEY) | :AQ10000 |
10 | AX SELECTOR | |
11 | INDEX FAST FULL SCAN | PK_EMP_ID |
12 | AX PARTITION RANGE ALL | |
13 | INDEX FAST FULL SCAN | PK_CUST_ID |
--------------------------------------------------------------------
Sample table data :
employee
111
123
145
1345
111
123
145
678
....
customer
111
111
111
1345
111
145
145
145
145
145
145
....
Expected output :
123
678
Any solution is much appreciated !!!
It seems to me the SQL is the problem. If you dont have anything like sorter/aggregator, you dont have to worry about infa.
SQL seems to be having expensive operations. You can try below -
select emp_id
from employee
where not exists
(select 1 from customer where custid =emp_id)
This should be little faster because
you arent doing a subquery to get distinct from a 20billion customer table.
you dont need to use any distinct in first select because you are selecting from emp table where that emp id is unique. And not exist will make sure no duplicates coming out of first select.
You can also use left join +where but i think it will be expensive because of join-induced duplicates.
I would start with partitioning the customer table by hash or range(customer_id) or insert_date, this would speed up your inline select substantially.
Also try this:
select emp_id from employee
minus
select emp_id from employee e, customer c
where e.emp_id=c.custid;

AWS QuickSight filtering based on result of a query or other dataset

I want to create an analysis table in AWS Quicksight that shows the quantity sold in a given month and it's subsequent month based on users who made a purchase on the given month.
Let's say I have a dataset called user_orders with the following data:
+---------+----------+------------+
| user_id | quantity | order_date |
+---------+----------+------------+
| 1 | 2 | 2020-04-01 |
+---------+----------+------------+
| 1 | 3 | 2020-04-02 |
+---------+----------+------------+
| 1 | 1 | 2020-05-23 |
+---------+----------+------------+
| 1 | 2 | 2020-06-02 |
+---------+----------+------------+
| 2 | 1 | 2020-05-03 |
+---------+----------+------------+
| 2 | 1 | 2020-05-04 |
+---------+----------+------------+
| 3 | 2 | 2020-04-07 |
+---------+----------+------------+
| 3 | 1 | 2020-04-10 |
+---------+----------+------------+
| 3 | 1 | 2020-06-23 |
+---------+----------+------------+
For example, using the table above I want to be able to show how many quantities sold in April, May, June, and so on (max 12 months) by users who made a purchase in April.
The resulting table should look like this:
+-----------+----------+
| | quantity |
+-----------+----------+
| 04-2020 | 8 |
+-----------+----------+
| 05-2020 | 1 |
+-----------+----------+
| 06-2020 | 3 |
+-----------+----------+
8 sold in April because user_id 1 made 5 purchase and user_id 3 made 3 purchase while user_id 2 did not make any purchase.
There is only 1 item sold in May because only user_id 1 made the purchase in May, but also made a purchase in April. user_id 2 also made a purchase in May but didn't in April so it's not counted.
I can make the table above using PHP and MySQL fairly easily using the following code:
# first get all the user ids who made a purchase in April
$user_ids = sql_query("SELECT DISTINCT user_id FROM user_orders WHERE order_date BETWEEN '2020-04-01' AND '2020-04-30'");
# get the quantity sold for each month by users who made a purchase in April
$purchases = sql_query("SELECT MONTH(order_date), SUM(quantity) FROM user_orders WHERE user_id IN ({$user_ids}) AND order_date BETWEEN '2020-04-01' AND '2021-03-31' GROUP BY MONTH(order_date);")
(Obviously, April is just an example, I'd like to be able to change the starting month dynamically using QuickSight control)
As my above example shown, it requires two queries to perform this analysis. First, is to get the user_ids of the users, and the next is to actually get the quantity sold by the users.
I have been trying to achieve this using Quicksight for the last 3 days but hasn't found any way yet.
I hope someone can point me in the right direction.
Thank you!
You can achieve this by creating a calculated field like this and filtering on it
distinctCountOver(ifelse(truncDate('MM', {order_Date}) = parseDate('2020-04-01'), 1, NULL), [{user_id}], PRE_AGG)
(ofcourse, you can change the parseDate portion to be your date parameter)
Now, lets say the name of the above calculated field is SpecificMonthUser. You can add a filter sum(SpecificMonthUser) != 0.
And then create a pivot table visualization with OrderDate, user id in the rows and sum(quantity) in the values. You should get the desired result.

Oracle 11g insert into select from a table with duplicate rows

I have one table that need to split into several other tables.
But the main table is just like a transitive table.
I dump data from a excel into it (from 5k to 200k rows) , and using insert into select, split into the correct tables (Five different tables).
However, the latest dataset that my client sent has records with duplicates values.
The primary key usually is ENI for my table. But even this record is duplicated because the same company can be a customer and a service provider, so they have two different registers but use the same ENI.
What i have so far.
I found a script that uses merge and modified it to find same eni and update the same main_id to all
|Main_id| ENI | company_name| Type
| 1 | 1864 | JOHN | C
| 2 | 351485 | JOEL | C
| 3 | 16546 | MICHEL | C
| 2 | 351485 | JOEL J. | S
| 1 | 1864 | JOHN E. E. | C
Main_id: Primarykey that the main BD uses
ENI: Unique company number
Type: 'C' - COSTUMER 'S' - SERVICE PROVIDERR
Some Cases it can have the same type. just like id 1
there are several other Columns...
What i need:
insert any of the main_id my other script already sorted, and set a flag on the others that they were not inserted. i cant delete any data i'll need to send these info to the costumer validate.
or i just simply cant make this way and go back to the good old excel
Edit: as a question below this is a example
|Main_id| ENI | company_name| Type| RANK|
| 1 | 1864 | JOHN | C | 1 |
| 2 | 351485 | JOEL | C | 1 |
| 3 | 16546 | MICHEL | C | 1 |
| 2 | 351485 | JOEL J. | S | 2 |
| 1 | 1864 | JOHN E. E. | C | 2 |
RANK - would be like the 1864 appears 2 times,
1st one found gets 1 second 2 and so on. i tryed using
RANK() OVER (PARTITION BY MAIN_ID ORDER BY ENI)
RANK() OVER (PARTITION BY company_name ORDER BY ENI)
Thanks to TEJASH i was able to come up with this solution
MERGE INTO TABLEA S
USING (Select ROWID AS ID,
row_number() Over(partition by eniorder by eni, type) as RANK_DUPLICATED
From TABLEA
) T
ON (S.ROWID = T.ID)
WHEN MATCHED THEN UPDATE SET S.RANK_DUPLICATED= T.RANK_DUPLICATED;
As far as I understood your problem, you just need to know the duplicate based on 2 columns. You can achieve it using analytical function as follows:
Select t.*,
row_number() Over(partition by main_id, eni order by company_name) as rnk
From your_table t

(Nested?) Select statement with MAX and WHERE clause

I'm cranking my head on a set of data in order to generate a report from a Oracle DB.
Data are in two tables:
SUPPLY
DEVICE
There is only one column that links the two tables:
SUPPLY.DEVICE_ID
DEVICE.ID
In SUPPLY, there are these data: (Markdown is not working well. it's supposed to show a table)
| DEVICE_ID | COLOR_TYPE | SERIAL | UNINSTALL_DATE |
|----------- |------------ |-------------- |--------------------- |
| 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
| 758 | 4 | CAP67543 | 30/09/2016,11:00:00 |
In DEVICE, there are columns that I've to select all (more or less), but each row is unique.
What i need to achieve is:
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
JOINED with
more or less all the columns in DEVICE.
At the end I should have something like this:
| ACCOUNT_CODE | MODEL | DEVICE.SERIAL | DEVICE_ID | COLOR_TYPE | SUPPLY.SERIAL | UNINSTALL_DATE |
|-------------- |------- |--------------- |----------- |------------ |--------------- |--------------------- |
| BUSTO | MS410 | LM753 | 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| MACCHI | MX310 | XC876 | 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| ASL_COMO | MX711 | AB123 | 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| ASL_COMO | MX711 | AB123 | 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| ASL_VARESE | X950 | DE8745 | 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
So far, using a nested select like:
SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE FROM
(SELECT SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE
FROM SUPPLY WHERE DEVICE_ID = '123456' ORDER BY UNINSTALL_DATE DESC)
WHERE ROWNUM <= 1
I managed to get the highest value on the UNISTALL_DATE column after trying MAX(UNISTALL_DATE) or HIGHEST(UNISTALL_DATE).
I tried also:
SELECT SUPPLY.DEVICE_ID, SUPPLY.COLOR_TYPE, ....
FROM SUPPLY,DEVICE WHERE SUPPLY.DEVICE_ID = DEVICE.ID
and it works, but gives me ALL the items, basically it's a merge of the two tables.
When I try to narrow the data selected, i get errors or a empty result.
I'm starting to wonder that it's not possible to obtain this data and i'm starting to export the data in excel and work from there, but I wish someone can help me before giving up...
Thank you in advance.
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
Use ROW_NUMBER function in this way:
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
This query marks most recent rows with RN=1
JOINED with more or less all the columns in DEVICE.
Just join the above query to DEVICE table
SELECT d.*,
x.COLOR_TYPE,
x.SERIAL,
x.UNINSTALL_DATE
FROM (
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
) x
JOIN DEVICE d
ON d.DEVICE_ID = x.DEVICE_ID AND x.RN=1
OK - so you could group by device_id, color_type and select max(uninstall_date) as well, and join to the other table. But you would miss the serial value for the most recent row (for each combination of device_id, color_type).
There are a few ways to fix that. Your attempt with rownum was close, but the problem is that you need to order within each "group" (by device_id, color_type) and get the first row from each group. I am sure someone will post a solution along those lines, using either row_number() or rank() or perhaps the analytic version of max(uninstall_date).
When you just need the "top" row from each group, you can use keep (dense_rank first/last) - which may be slightly more efficient - like so:
select device_id, color_type,
max(serial) keep (dense_rank last order by uninstall_date) as serial,
max(uninstall_date) as uninstall_date
from supply
group by device_id, color_type
;
and then join to the other table. NOTE: dense_rank last will pick up the row OR ROWS with the most recent (max) date for each group. If there are ties, that is more than one row; the serial will then be the max (in lexicographical order) among those rows with the most recent date. You can also select min, or add some order so you pick a specific one (you didn't discuss this possibility).
SELECT
d.ACCOUNT_CODE, d.DNS_HOST_NAME,d.IP_ADDRESS,d.MODEL_NAME,d.OVERRIDE_SERIAL_NUMBER,d.SERIAL_NUMBER,
s.COLOR, s.SERIAL_NUMBER, s.UNINSTALL_TIME
FROM (
SELECT s.DEVICE_ID, s.LAST_LEVEL_READ, s.SERIAL_NUMBER,TRUNC(s.UNINSTALL_TIME), row_number()
OVER (
PARTITION BY DEVICE_ID, COLOR
ORDER BY UNINSTALL_TIME DESC
) As RN
FROM SUPPLY s
WHERE s.UNINSTALL_TIME IS NOT NULL AND s.SERIAL_NUMBER IS NOT NULL
)
JOIN DEVICE d
ON d.ID = s.DEVICE_ID AND s.RN=1;
#krokodilko: thank you very much for your help. First query works. Modified it in order to remove junk, putting real columns name i need (yesterday evening i had no access to the DB) and getting only the data I need.
Unfortunately, when I join the two tables as you suggested I get error:
ORA-00904: "S"."RN": invalid identifier
00904. 00000 - "%s: invalid identifier"
If i remove s. before RN, the ORA-00904 moves back to s.DEVICE_ID.

Hive Contiguous Date Ranges

I am using Hive and I would like to take a table with a historical list of customers, subscription events, and subscription types and summarize by contiguous runs of subscription types for each customer.
Example Input (db.cust_hist):
customer_id | eff_dt | exp_dt | sub_cd | sub_type
---------------------------------------------------------
1 | 02/01/2015 | 03/01/2015 | active | A
1 | 03/01/2015 | 04/01/2015 | active | A
1 | 03/15/2015 | 12/31/9999 | cancel | A
1 | 04/01/2015 | 05/01/2015 | active | A
1 | 05/01/2015 | 06/01/2015 | active | A
1 | 02/01/2015 | 03/01/2015 | active | B
1 | 03/01/2015 | 04/01/2015 | active | B
The sub_cd in this case refers to the type of event that is effective over the date range for that row. For example, the user canceled their A subscription type on 3/15 and resumed on 4/01.
The output I'm trying to get looks like this (db.cust_snapshot):
customer_id | eff_dt | exp_dt | sub_type
------------------------------------------------
1 | 02/01/2015 | 03/15/2015 | A
1 | 04/01/2015 | 06/01/2015 | A
1 | 02/01/2015 | 04/01/2015 | B
and reflects the gap in coverage.
From what I have read in this link from BetterAtOracle (specific to SQL) which does a very good job of laying things out, I need to use row numbers and a lagging window, but I can't seem to apply it to my situation in Hive (perhaps because of the 12/31/9999 notation/subscription code?)
I tried:
SELECT customer_id
, eff_dt
, exp_dt
, sub_cd
, sub_type
, CASE WHEN DATEDIFF(TO_DATE(eff_dt), TO_DATE(lag(exp_dt) OVER (PARTITION BY customer_id, sub_type ORDER BY eff_dt)) <=1 THEN NULL
ELSE row_number() OVER(PARTITION BY customer_id, sub_type ORDER BY eff_dt)
END) as grp
FROM db.cust_hist
ORDER BY TO_DATE(eff_dt)
As you can see, I haven't applied the subscription event code. This sort of gets me there as I can start to see different groups based on subscription type, but I feel like I'm stuck from here on out.
Any help or pointers would be greatly appreciated. Before this task, I never understood the true power of ranks, rows, lag, and other window functions!

Resources