sort_array order by a different column, Hive - sorting

I have two columns, one of products, and one of the dates they were bought. I am able to order the dates by applying the sort_array(dates) function, but I want to be able to sort_array(products) by the purchase date.
Is there a way to do that in Hive?
Tablename is
ClientID Product Date
100 Shampoo 2016-01-02
101 Book 2016-02-04
100 Conditioner 2015-12-31
101 Bookmark 2016-07-10
100 Cream 2016-02-12
101 Book2 2016-01-03
Then, getting one row per customer:
select
clientID,
COLLECT_LIST(Product) as Prod_List,
sort_array(COLLECT_LIST(date)) as Date_Order
from tablename
group by 1;
As:
ClientID Prod_List Date_Order
100 ["Shampoo","Conditioner","Cream"] ["2015-12-31","2016-01-02","2016-02-12"]
101 ["Book","Bookmark","Book2"] ["2016-01-03","2016-02-04","2016-07-10"]
But what I want is the order of the products to be tied to the correct chronological order of purchases.

It is possible to do it using only built-in functions, but it is not a pretty site :-)
select clientid
,split(regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',cast(date as string),product)))),'[^:]*:([^,]*(,|$))','$1'),',') as prod_list
,sort_array(collect_list(date)) as date_order
from tablename
group by clientid
;
+----------+-----------------------------------+------------------------------------------+
| clientid | prod_list | date_order |
+----------+-----------------------------------+------------------------------------------+
| 100 | ["Conditioner","Shampoo","Cream"] | ["2015-12-31","2016-01-02","2016-02-12"] |
| 101 | ["Book2","Book","Bookmark"] | ["2016-01-03","2016-02-04","2016-07-10"] |
+----------+-----------------------------------+------------------------------------------+

Related

Nested Queries with Select Statement

I am new to Oracle. I have a table where there are multiple restriction groups that are assigned to users. Every group of user belong to a userregionID.
I have to display a list of userregionID where users are assigned more than 1 restriction group.
My Tables
user - Id, userregionid
userRestriction - userId, restrictionGroup
For example,
User Table
EID-999 | 12345
EID- 888 | 12345
D-900 | 2322
F-943 | 6767
UserRestriction Table
UserId | RestrictionGroup
EID-999| A1
EID-888 | B1
EID-999 | C1
F-943 | Z1
F-943 | X1
So, my output should come like
UserRegionId | Count of Users having restriction Group >1
12345 | 1
6767 | 1
because user EID-999 and F-943 belong to userregionId 12345 and 6767 respectively and they are assigned more than 1 restriction group.
My Effort
I have written a query that displays the list of users having > 1 restrictionGroup within the same userregionID
but I am clueless on how to proceed further and convert this query into a nested query that can only fetch the count and userregionID from the entire database.
My query
select distinct ec.userId, e.userregionid,
count(distinct ec.restrictionGroup) over (partition by ec.userId)
from user e, userRestriction ec
where e.userregionid = '12345' and e.Id= ec.userId
You might not need a nested-query here and a INNER JOIN as below can help you.
select u.userregionid, count(ur.userId)
from userRestriction ur, USR u
where ur.userId=u.id
group by ur.userId , u.userregionid
having count(ur.userId) >1;
PS: A DB-Fiddle here can help you to visualize.

How to load data to same Hive table if file has different number of columns

I have a main table (Employee) which is having 10 columns and I can load data into it using load data inpath /file1.txt into table Employee
My question is how to handle the same table (Employee) if my file file2.txt has same columns but column 3 and columns 5 are missing. if I directly load data last columns will be NULL NULL. but instead it should load 3rd as NULL and 5th column as NULL.
Suppose I have a table Employee and I want to load the file1.txt and file2.txt to table.
file1.txt
==========
id name sal deptid state coutry
1 aaa 1000 01 TS india
2 bbb 2000 02 AP india
3 ccc 3000 03 BGL india
file2.txt
id name deptid country
1 second 001 US
2 third 002 ENG
3 forth 003 AUS
In file2.txt we are missing 2 columns i.e. sal and state.
we need to use the same Employee table how to handle it ?
I'm not aware of any way to create a table backed by data files with a non-homogenous structure. What you can do however, is to define separate tables for the different column configurations and then define a view that queries both.
I think it's easier if I provide an example. I will use two tables of people, both have a column for name, but one stores height as well, while the other stores weight instead:
> create table table1(name string, height int);
> insert into table1 values ('Alice', 178), ('Charlie', 185);
> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);
> create view people as
> select name, height, NULL as weight from table1
> union all
> select name, NULL as height, weight from table2;
> select * from people order by name;
+---------+--------+--------+
| name | height | weight |
+---------+--------+--------+
| Alice | 178 | NULL |
| Bob | NULL | 98 |
| Charlie | 185 | NULL |
| Denise | NULL | 52 |
+---------+--------+--------+
Or as a closer example to your problem, let's say that one table has name, height and weight, while the other only has name and weight, thereby height is "missing from the middle":
> create table table1(name string, height int, weight int);
> insert into table1 values ('Alice', 178, 55), ('Charlie', 185, 78);
> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);
> create view people as
> select name, height, weight from table1
> union all
> select name, NULL as height, weight from table2;
> select * from people order by name;
+---------+--------+--------+
| name | height | weight |
+---------+--------+--------+
| Alice | 178 | 55 |
| Bob | NULL | 98 |
| Charlie | 185 | 78 |
| Denise | NULL | 52 |
+---------+--------+--------+
Be sure to use union all and not just union, because the latter tries to remove duplicate rows, which makes it very expensive.
It seems like there is no way to directly load into specified columns.
As such, this is what you probably need to do:
Load data inpath to a (temporary?) table that matches the file
Insert into relevant columns of final table by selecting the contents of the previous table.
The situation is very similar to this question which covers the opposite scenario (you only want to load a few columns).

Build adjacency matrix from list of weighted edges in BigQuery

Related issue:
How to create dummy variable columns for thousands of categories in Google BigQuery
I have a table of list of weighted edges which is a list of user-item rating, it looks like this:
| userId | itemId | rating
| 001 | 001 | 5.0
| 001 | 002 | 4.0
| 002 | 001 | 4.5
| 002 | 002 | 3.0
I want to convert this weighted edge list into a adjacency matrix:
| userId | item001 | item002
| 001 | 5.0 | 4.0
| 002 | 4.5 | 3.0
According to this post, we can do it in two steps, the first step is to extract the matrix entry's value to generate a query, and second step is to run the query which is generated from 1st step.
But my question is how to extract the rating value and use the rating value in the IF() statement? My intuition is to put a nested query inside the IF() statement such like:
IF(itemId = blah,
(select rating
from mytable
where
userId = blahblah
and itemId = blah),
0)
But this query looks too expensive, can someone give me an example?
Thanks
Unless I am missing something - it is quite similar to the post you referenced
Step 1 - generate query
SELECT 'SELECT userID, ' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(itemId = "' + STRING(itemId) + '", rating, 0)) AS item' + STRING(itemId)
)
+ ' FROM YourTable GROUP BY userId'
FROM (
SELECT itemId
FROM YourTable
GROUP BY itemId
)
Step 2 - run generated query
SELECT
userID,
SUM(IF(itemId = "001", rating, 0)) AS item001,
SUM(IF(itemId = "002", rating, 0)) AS item002
FROM YourTable
GROUP BY userId
Result as expected
userID item001 item002
001 5.0 4.0
002 4.5 3.0

Inner join with 3 tables and 2 sum groups

I have 3 tables; COMPANY, TRAINING TICKET and TEST.
COMPANY table:
COMPANY CODE | COMPANY NAME
192 ABC ENTERPRISE
299 XYZ ENTERPRISE
TRAINING TICKET table:
TICKET ID | COMPANY CODE | START DATE
2900 192 2015-02-02
3939 192 2015-03-03
4399 299 2015-03-02
TEST SESSION table:
TEST CODE | TICKET ID | COMPANY CODE | CERTIFIED
1221 2900 192 YES
2821 3939 192 NULL
3922 4399 299 YES
I need something like this:
C. CODE | COMPANY NAME | 1ST START DATE | TRAINING TICKET TOTAL | CERTIFIED TOTAL
192 ABC ENTERPRISE 2015-02-02 2 1
299 XYZ ENTERPRISE 2015-03-02 1 1
Its possible?
My Sql instruction is:
Select *, count(TICKET.CCODE) AS TICKET_TOTAL, count(TEST.CODE) AS CERT_TOTAL
from TICKET
Inner Join COMPANY on TICKET.CCODE = COMPANY.CCODE
Inner Join TEST on COMPANY.CCODE = TEST.CCODE
Group by (TICKET.CCODE),(TEST.CCODE)
Order by TICKET_TOTAL DESC
but both counts are always equals (same result for TICKET_TOTAL and CERT_TOTAL) and the sums are wrong - the result is TICKET_TOTAL = 21 and CERT_TOTAL = 28, but I got 523 - for TOP 1 company.
I got the answer:
Select COMPANY.CODE, COMPANY.NAME,
MIN(TICKET.STARTDATE), count(TICKET.TICKETID) AS TICKET_TOTAL,
count(TEST.CERTIFIED) AS CERT_TOTAL
from COMPANY
INNER JOIN TICKET ON COMPANY.CODE = TICKET.CCODE
LEFT JOIN TEST ON TICKET.TICKETID = TEST.TICKET
Group by (TICKET.CCODE)
ORDER BY TICKET_TOTAL DESC
1- Reorder and star the instruction from COMPANY TABLE
2- MIN(TICKET.STARTDATE) to got the First Start Date (Use MAX to got the Last Start Date if necessary)
3- Change Inner Join to Left Join (because some companies have a ticket on ticket table but does not have a test on test table)
Hope this can help someone in the future!

How many Include I can use on ObjectSet in EntityFramework to retain performance?

I am using the following LINQ query for my profile page:
var userData = from u in db.Users
.Include("UserSkills.Skill")
.Include("UserIdeas.IdeaThings")
.Include("UserInterests.Interest")
.Include("UserMessengers.Messenger")
.Include("UserFriends.User.UserSkills.Skill")
.Include("UserFriends1.User1.UserSkills.Skill")
.Include("UserFriends.User.UserIdeas")
.Include("UserFriends1.User1.UserIdeas")
where u.UserId == userId
select u;
It has a long object graph and uses many Includes. It is running perfect right now, but when the site has many users, will it impact performance much?
Should I do it in some other way?
A query with includes returns a single result set and the number of includes affect how big data set is transfered from the database server to the web server. Example:
Suppose we have an entity Customer (Id, Name, Address) and an entity Order (Id, CustomerId, Date). Now we want to query a customer with her orders:
var customer = context.Customers
.Include("Orders")
.SingleOrDefault(c => c.Id == 1);
The resulting data set will have the following structure:
Id | Name | Address | OrderId | CustomerId | Date
---------------------------------------------------
1 | A | XYZ | 1 | 1 | 1.1.
1 | A | XYZ | 2 | 1 | 2.1.
It means that Cutomers data are repeated for each Order. Now lets extend the example with another entities - 'OrderLine (Id, OrderId, ProductId, Quantity)andProduct (Id, Name)`. Now we want to query a customer with her orders, order lines and products:
var customer = context.Customers
.Include("Orders.OrderLines.Product")
.SingleOrDefault(c => c.Id == 1);
The resulting data set will have the following structure:
Id | Name | Address | OrderId | CustomerId | Date | OrderLineId | LOrderId | LProductId | Quantity | ProductId | ProductName
------------------------------------------------------------------------------------------------------------------------------
1 | A | XYZ | 1 | 1 | 1.1. | 1 | 1 | 1 | 5 | 1 | AA
1 | A | XYZ | 1 | 1 | 1.1. | 2 | 1 | 2 | 2 | 2 | BB
1 | A | XYZ | 2 | 1 | 2.1. | 3 | 2 | 1 | 4 | 1 | AA
1 | A | XYZ | 2 | 1 | 2.1. | 4 | 2 | 3 | 6 | 3 | CC
As you can see data become quite a lot duplicated. Generaly each include to a reference navigation propery (Product in the example) will add new columns and each include to a collection navigation property (Orders and OrderLines in the example) will add new columns and duplicate already created rows for each row in the included collection.
It means that your example can easily have hundreds of columns and thousands of rows which is a lot of data to transfer. The correct approach is creating performance tests and if the result will not satisfy your expectations, you can modify your query and load navigation properties separately by their own queries or by LoadProperty method.
Example of separate queries:
var customer = context.Customers
.Include("Orders")
.SingleOrDefault(c => c.Id == 1);
var orderLines = context.OrderLines
.Include("Product")
.Where(l => l.Order.Customer.Id == 1)
.ToList();
Example of LoadProperty:
var customer = context.Customers
.SingleOrDefault(c => c.Id == 1);
context.LoadProperty(customer, c => c.Orders);
Also you should always load only data you really need.
Edit: I just created proposal on Data UserVoice to support additional eager loading strategy where eager loaded data would be passed in additional result set (created by separate query within the same database roundtrip). If you find this improvement interesting don't forget to vote for the proposal.
(You can improve performance of many includes by creating 2 or more small data request from data base like below.
According to my experience,Only can give maximum 2 includes per query like below.More than that will give really bad performance.
var userData = from u in db.Users
.Include("UserSkills.Skill")
.Include("UserIdeas.IdeaThings")
.FirstOrDefault();
userData = from u in db.Users
.Include("UserFriends.User.UserSkills.Skill")
.Include("UserFriends1.User1.UserSkills.Skill")
.FirstOrDefault();
Above will bring small data set from database by using more travels to the database.
Yes it will. Avoid using Include if it expands multiple detail rows on a master table row.
I believe EF converts the query into one large join instead of several queries. Therefore, you'll end up duplicating your master table data over every row of the details table.
For example: Master -> Details. Say, master has 100 rows, Details has 5000 rows (50 for each master).
If you lazy-load the details, you return 100 rows (size: master) + 5000 rows (size: details).
If you use .Include("Details"), you return 5000 rows (size: master + details). Essentially, the master portion is duplicated over 50 times.
It multiplies upwards if you include multiple tables.
Check the SQL generated by EF.
I would recommend you to perform load tests and measure the performance of the site under stress. If you are performing complex queries on each request you may consider caching some results.
The result of include may change: it depend by the entity that call the include method.
Like the example proposed from Ladislav Mrnka, suppose that we have an entity
Customer (Id, Name, Address)
that map to this table:
Id | Name | Address
-----------------------
C1 | Paul | XYZ
and an entity Order (Id, CustomerId, Total)
that map to this table:
Id | CustomerId | Total
-----------------------
O1 | C1 | 10.00
O2 | C1 | 13.00
The relation is one Customer to many Orders
Esample 1: Customer => Orders
var customer = context.Customers
.Include("Orders")
.SingleOrDefault(c => c.Id == "C1");
Linq will be translated in a very complex sql query.
In this case the query will produce two record and the informations about the customer will be replicated.
Customer.Id | Customer.Name | Order.Id | Order.Total
-----------------------------------------------------------
C1 | Paul | O1 | 10.00
C1 | Paul | O2 | 13.00
Esample 2: Order => Customer
var order = context.Orders
.Include("Customers")
.SingleOrDefault(c => c.Id == "O1");
Linq will be translated in a simple sql Join.
In this case the query will produce only one record with no duplication of informations:
Order.Id | Order.Total | Customer.Id | Customer.Name
-----------------------------------------------------------
O1 | 10.00 | C1 | Paul

Resources