In BigQuery using data from GA, I am trying to find the pagetype (based on pagepath) with the maximum number of hits within a session for each session for a user. This will be used to determine which pagetype had the most activity for a session (I want only one, hence max).
Using the row number to assign rank for each pagetype within a session and filtering for rank 1 works for one user. When I try to replicate that for the bigger dataset (~400GB), I get the 'Resources exceeded....' error.
I'm new to BigQuery and would appreciate any tips to optimize this code.
SELECT
userid,
sessionid,
pagetype,
hits
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY userid, sessionid ORDER BY sessionid ASC) rnk
FROM (
SELECT
userid,
sessionid,
pagetype,
COUNT(1) AS hits
FROM
[xxxxxxx] WHERE
GROUP BY
userid,
sessionid,
pagetype
ORDER BY
sessionid,
hits DESC ) )
WHERE
rnk = 1
Using standard SQL, you can write a query such as:
#standardSQL
SELECT
first_session.*
FROM (
SELECT
ARRAY_AGG(
STRUCT(userid, sessionid, pagetype, hits)
ORDER BY sessionid ASC LIMIT 1
)[OFFSET(0)] AS first_session
FROM (
SELECT
userid,
sessionid,
pagetype,
COUNT(*) AS hits
FROM `xxxxxxx`
GROUP BY
userid,
sessionid,
pagetype
)
GROUP BY userid, sessionid
);
This builds a struct with the relevant columns for each group and selects only the first one, as determined by sessionid.
Related
I want to create a materialized view in ClickHouse that stores the final product of an aggregation function. The best practice is to store the state and in query time to calculate the final product but it's too costly to do it in query time for my use case.
Base table:
CREATE TABLE IF NOT EXISTS active_events
(
`event_name` LowCardinality(String),
`user_id` String,
`post_id` String
)
My current materialization:
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
POPULATE AS
SELECT
post_id,
event_name,
uniqState(user_id) unique_users_state
FROM active_events
GROUP BY post_id, event_name
FROM test_sessions
GROUP BY session_id;
And then at query time, I can use uniqMerge to calculate the exact number of users who've done a certain event.
I don't mind a small delay in the materialization but I want the full product to be calculated during ingestion rather than the query.
Here's the query:
SELECT post_id, sumIf(total, event_name = 'click') / sumIf(total, event_name = 'impression') as ctr
FROM (
SELECT post_id, event_name, uniqMerge(unique_users_state) as total
FROM inventory
WHERE event_name IN ('click', 'impression')
GROUP BY post_id, event_name
) as res
GROUP BY post_id
HAVING ctr > 0.1
ORDER BY ctr DESC
It's literally impossible.
Imagine you insert into a table some user_id - 3456, how many uniq? 1, Ok but you cannot store 1, because if you insert 3456, it still should be 1. So CH stores states and they are HLL (hyperloglog) structures and they are not fully aggregated/calculated. Because you may query group by event_name or group by event_name, post_id or without groupby.
Another problem why your query is slow. You did not provide your query so I can only make a guess that the issue is index_granularity and CH reads a lot excessive data from the disk IF you query where event_name = ...
It can be solved like this
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniq, String) CODEC(NONE) -- uniq is not compressible in 99% cases
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
Settings index_granularity=256 -- 256 instead of default 8192.
Another approach to use another HLL function because uniq is too heavy.
Try this:
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniqCombined64(14), String) CODEC(NONE)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
Settings index_granularity=256
POPULATE AS
SELECT
post_id,
event_name,
uniqCombined64State(14)(user_id) unique_users_state
FROM active_events
GROUP BY post_id, event_name
select uniqCombined64Merge(14)(unique_users_state)
from inventory
where event_name = ...
Attention: need to use (14) in all 3 places uniqCombined64(14) / uniqCombined64State(14) / uniqCombined64Merge(14)
uniqCombined64(14) is more inaccurate than uniq but able to work 10-100 faster in some cases and with an error rate < 5%.
Let me explain the question.
I have two tables, which have 3 columns with same data tpyes. The 3 columns create a key/ID if you like, but the name of the columns are different in the tables.
Now I am creating queries with these 3 columns for both tables. I've managed to independently get these results
For example:
SELECT ID, FirstColumn, sum(SecondColumn)
FROM (SELECT ABC||DEF||GHI AS ID, FirstTable.*
FROM FirstTable
WHERE ThirdColumn = *1st condition*)
GROUP BY ID, FirstColumn
;
SELECT ID, SomeColumn, sum(AnotherColumn)
FROM (SELECT JKM||OPQ||RST AS ID, SecondTable.*
FROM SecondTable
WHERE AlsoSomeColumn = *2nd condition*)
GROUP BY ID, SomeColumn
;
So I make a very similar queries for two different tables. I know the results have a certain number of same rows with the ID attribute, the one I've just created in the queries. I need to check which rows in the result are not in the other query's result and vice versa.
Do I have to make temporary tables or views from the queries? Maybe join the two tables in a specific way and only run one query on them?
As a beginner I don't have any experience how to use results as an input for the next query. I'm interested what is the cleanest, most elegant way to do this.
No, you most probably don't need any "temporary" tables. WITH factoring clause would help.
Here's an example:
with
first_query as
(select id, first_column, ...
from (select ABC||DEF||GHI as id, ...)
),
second_query as
(select id, some_column, ...
from (select JKM||OPQ||RST as id, ...)
)
select id from first_query
minus
select id from second_query;
For another result you'd just switch the tables, e.g.
with ... <the same as above>
select id from second_query
minus
select id from first_query
I am using CDH-5.4.4 Cloudera Edition, I have a CSV file in HDFS location, My requirement is to perform Real time SQL queries on Hadoop Environement (OLTP).
So I decided to go with Impala, I have created MetaStore table to a CSV file, then execuing query in impala editor (within HUE application) .
When i am executing below query, i am getting error like
"AnalysisException: all DISTINCT aggregate functions need to have the
same set of parameters as count(DISTINCT City); deviating function:
count(DISTINCT Country)".
CSV File
OrderID,CustomerID,City,Country
Ord01,Cust01,Aachen,Germany
Ord02,Cust01,Albuquerque,USA
Ord03,Cust01,Aachen,Germany
Ord04,Cust02,Arhus,Denmark
Ord05,Cust02,Arhus,Denmark
Problamatic Query
Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID
Problem:
Unable to execute the Impala Query with More than one Distinct Values in an Query.. I have searched over internet they provide NDV() method as a workaround, But NDV method only returns approximate count of distinct values, I need Exact unique count for more than one fields.
Expectation:
What is the best way to do Exact unique count for more than one fields? Kindly modify the above query to work with Impala.
Note: This is not my original table, I have replicate for the forum question.
I've the same problem in Impala. Here is my workaround:
SELECT CustomerID
,sum(nr_of_cities)
,sum(nr_of_countries)
FROM (
SELECT CustomerID
,Count(DISTINCT City) AS nr_of_cities
,0 AS nr_of_countries
FROM CustomerOrders
GROUP BY CustomerID
UNION ALL
SELECT CustomerID
,0 AS nr_of_cities
,Count(DISTINCT Country) AS nr_of_countries
FROM CustomerOrders
GROUP BY CustomerID
) AS aa
GROUP BY CustomerID
I think this can be done cleaner (untested):
WITH
countries AS
(
SELECT CustomerID
,COUNT(DISTINCT City) AS nr_of_countries
FROM CustomerOrders
GROUP BY 1
)
,
cities AS
(
SELECT CustomerID
,COUNT(DISTINCT City) AS nr_of_cities
FROM CustomerOrders
GROUP BY 1
)
SELECT CustomerID
,nr_of_cities
,nr_of_countries
FROM cities INNER JOIN countries USING (CustomerID)
I have a Hive table named 'Login'. It contains the following columns :-
UserID | UserName | UserIP | UserCountry | Date
On a particular day (all that logins of that day), I want to find out the UserIDs, which has been accessed from a country (UserCountry) from where the user has never accessed his account from OR the IPs (UserIP) from which the account has never been accessed before.
I would start with an except where I remove prior countries and IPs
select userid, usercountry, userip
from table
where date=xx
except
select userid, usercountry, userip
from table
where date<xx
I think that the best way is the GROUP clause!
You say "has never been accessed before", means COUNT = 1.
To find the IP use only once :
select UserId, UserIP, COUNT(UserIP) FROM Login WHERE Date = yourdate GROUP BY UserIP, UserId HAVING COUNT(UserIP) = 1
To find the country use only once :
select UserId, UserCountry, COUNT(UserCountry) FROM Login WHERE Date = yourdate GROUP BY UserCountry, UserId HAVING COUNT(UserCountry) = 1
Left Outer Join will be able to satisfy your requirement in HIVE.
select t1.userid, t1.usercountry, t1.userip
from table t1
LEFT OUTER JOIN
from table t2
ON (t1.userid=t2.userid)
WHERE t1.date=xx and
t2.data < xx and
(t2.usercountry IS NULL or
t2.userip IS NULL);
Hope this helps...
I need help! For example, there are four tables: cars, users, departments and join_user_department. Last table used for M: N relation between tables user and department because some users have limited access. I need to get the number of cars in departments where user have access. The table “cars” has a column department_id. If the table join_user_department doesn’t have any record by user_id this means that he have access to all departments and select query must be without any condition. I need do something like this:
declare
DEP_NUM number;--count of departments where user have access
CARS_COUNT number;--count of cars
BEGIN
SELECT COUNT (*) into DEP_NUM from join_user_departments where user_id=?;
SELECT COUNT(*) into CARS_COUNT FROM cars where
IF(num!=0)—it meant that user access is limited
THEN department_id IN (select dep_id from join_user_departments where user_id=?);
A user either has access to all cars (I'm assuming all cars are tied to a department, and the user has access to all departments) or the user has limited access. You can use a UNION ALL to bring these two groups together, and group by user to do a final count. I've cross joined the users with unlimited access to the cars table to associate them with all cars:
(UPDATED to also count the departments)
select user_id,
count(distinct department_id) as dept_count,
count(distinct car_id) as car_count,
from (
select ud.user_id, ud.department_id, c.car_id
from user_departments ud
join cars c on c.department_id = ud.department_id
UNION ALL
select u.user_id, v.department_id, v.car_id
from user u
cross join (
select d.department_id, c.car_id
from department d
join cars c on c.department_id = d.department_id
) v
where not exists (
select 1 from user_departments ud
where ud.user_id = u.user_id
)
)
group by user_id
A UNION ALL is more efficient that a UNION; a UNION looks for records that fall into both groups and throws out duplicates. Since each user falls into one bucket or another, UNION ALL should do the trick (doing a distinct count in the outer query also rules out duplicates).
"If the table join_user_department doesn’t have any record by user_id
this means that he have access to all departments"
This seems like very bad practice. Essentially you are using the absence of records to simulate the presence of records. Very messy. What happens if there is a User who has no access to a Car from any Department? Perhaps the current business logic doesn't allow this, but you have a "data model" which won't allow to implement such a scenario without changing your application's logic.