I'm trying to apply a maximum aggregate function to a table by grouping on some fields and projecting. Can I refer to other non-grouping fields in the original table in the aggregating projection?
As as example I have a table blah with schema (user_id: long, order_id: long, product_id: long, gender: chararray, size: int), where user_id, order_id and product_id create a composite key but there can be multiple user ids and order ids. To get the maximum size for each order I use
result_table = foreach (group blah by (user_id, order_id)) generate
FLATTEN(group) as (user_id, order_id),
MAX(blah.size) as max_size;
Is there some way I can also add product_id to the creation of result_table so I have a table containing the user_id, order_id, product_id and max_size (max_size would be duplicated over differing product_ids) ?
If I could refer to the product_id specific to each grouped user_id and order_id I can save myself a mapreduce job by not joining back with the original table to access this field. Thanks guys.
Pig is well suited for such things, it has bags and that enables it to do things which in SQL require extra joins.
If you do the following:
grp = group blah by (user_id, order_id);
describe grp;
you will see that there is a bag with the schema identical to the schema of the "blah" (something like group:(user_id:long, order_id: long), blah: {(user_id: long, order_id: long, product_id: long, gender: chararray, size: int)}). That is a very powerful thing as it will allow us to create an output with all of the original rows with group summaries in each row without using inner joins:
grp = group blah by (user_id, order_id);
result_table = foreach grp generate
FLATTEN(blah.(user_id, order_id, product_id)), -- flatten the bag created by original group
MAX(blah.size) as max_size;
if the same product_id appears multiple times within group of user_id, order_id than there will be duplicates, to avoid it we could use a DISTINCT nested into FOREACH:
grp = group blah by (user_id, order_id);
result_table = foreach grp {
dist = distinct blah.(user_id, order_id, product_id); -- remove duplicates
generate flatten(dist), MAX(blah.size) as max_size;
}
It will be done in a single MapReduce job.
Related
I want to create a materialized view in ClickHouse that stores the final product of an aggregation function. The best practice is to store the state and in query time to calculate the final product but it's too costly to do it in query time for my use case.
Base table:
CREATE TABLE IF NOT EXISTS active_events
(
`event_name` LowCardinality(String),
`user_id` String,
`post_id` String
)
My current materialization:
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
POPULATE AS
SELECT
post_id,
event_name,
uniqState(user_id) unique_users_state
FROM active_events
GROUP BY post_id, event_name
FROM test_sessions
GROUP BY session_id;
And then at query time, I can use uniqMerge to calculate the exact number of users who've done a certain event.
I don't mind a small delay in the materialization but I want the full product to be calculated during ingestion rather than the query.
Here's the query:
SELECT post_id, sumIf(total, event_name = 'click') / sumIf(total, event_name = 'impression') as ctr
FROM (
SELECT post_id, event_name, uniqMerge(unique_users_state) as total
FROM inventory
WHERE event_name IN ('click', 'impression')
GROUP BY post_id, event_name
) as res
GROUP BY post_id
HAVING ctr > 0.1
ORDER BY ctr DESC
It's literally impossible.
Imagine you insert into a table some user_id - 3456, how many uniq? 1, Ok but you cannot store 1, because if you insert 3456, it still should be 1. So CH stores states and they are HLL (hyperloglog) structures and they are not fully aggregated/calculated. Because you may query group by event_name or group by event_name, post_id or without groupby.
Another problem why your query is slow. You did not provide your query so I can only make a guess that the issue is index_granularity and CH reads a lot excessive data from the disk IF you query where event_name = ...
It can be solved like this
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniq, String) CODEC(NONE) -- uniq is not compressible in 99% cases
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
Settings index_granularity=256 -- 256 instead of default 8192.
Another approach to use another HLL function because uniq is too heavy.
Try this:
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniqCombined64(14), String) CODEC(NONE)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
Settings index_granularity=256
POPULATE AS
SELECT
post_id,
event_name,
uniqCombined64State(14)(user_id) unique_users_state
FROM active_events
GROUP BY post_id, event_name
select uniqCombined64Merge(14)(unique_users_state)
from inventory
where event_name = ...
Attention: need to use (14) in all 3 places uniqCombined64(14) / uniqCombined64State(14) / uniqCombined64Merge(14)
uniqCombined64(14) is more inaccurate than uniq but able to work 10-100 faster in some cases and with an error rate < 5%.
Currently I am trying to find all the unique indexes defined in a table which are NOT NULL for Oracle database. What I mean by that is, Oracle allows creating unique indexes on columns which are even defined as NULL.
So if my table has two unique indexes, I want to retrieve the particular unique index which is having all the columns having the NOT NULL constraints.
I did come up with this query:
select ind.index_name, ind_col.column_name, ind.index_type, ind.uniqueness
from sys.dba_indexes ind
inner join sys.dba_ind_columns ind_col on ind.owner = ind_col.index_owner and ind.index_name = ind_col.index_name
where ind.owner in ('ISADRM') and ind.table_name in ('TH_RHELOR') and ind.uniqueness IN ('UNIQUE')
The above query is giving me all the unique indexes with the associated columns, but I am not sure, how should I join the above query with ALL_TAB_COLS which has the NULLABILITY data for all the columns of a table.
I tried joining this table with indexes and tried subquery as well, but not getting appropriate results.
Hence, would request you to please comment on same.
Analytic functions and inline views can help.
The analytic functions let you return detailed data but also create a summary on that data, based on separate windows. The detailed results include index owner, index name, and column name, but the counts are only per index owner and index name.
The first inline view joins the three tables, returns the detailed information, and has analytic functions to generate the count of all columns and the count of all nullable columns. The second inline view only selects rows where those two counts are equal.
--Unique indexes and columns where every column is NOT NULL.
select owner, index_name, column_name
from
(
--All relevant columns and counts of columns and not null columns.
select
dba_indexes.owner,
dba_indexes.index_name,
dba_tab_columns.column_name,
dba_tab_columns.nullable,
count(*) over (partition by dba_indexes.owner, dba_indexes.index_name) total_columns,
sum(case when nullable = 'N' then 1 else 0 end)
over (partition by dba_indexes.owner, dba_indexes.index_name) total_not_null_columns
from dba_indexes
join dba_ind_columns
on dba_indexes.owner = dba_ind_columns.index_owner
and dba_indexes.index_name = dba_ind_columns.index_name
join dba_tab_columns
on dba_ind_columns.table_name = dba_tab_columns.table_name
and dba_ind_columns.column_name = dba_tab_columns.column_name
where dba_indexes.owner = user
and dba_indexes.uniqueness = 'UNIQUE'
order by 1,2,3
)
where total_columns = total_not_null_columns
order by 1,2,3;
Analytic functions and inline views are tricky but they're very powerful once you learn how to use them.
Let me explain the question.
I have two tables, which have 3 columns with same data tpyes. The 3 columns create a key/ID if you like, but the name of the columns are different in the tables.
Now I am creating queries with these 3 columns for both tables. I've managed to independently get these results
For example:
SELECT ID, FirstColumn, sum(SecondColumn)
FROM (SELECT ABC||DEF||GHI AS ID, FirstTable.*
FROM FirstTable
WHERE ThirdColumn = *1st condition*)
GROUP BY ID, FirstColumn
;
SELECT ID, SomeColumn, sum(AnotherColumn)
FROM (SELECT JKM||OPQ||RST AS ID, SecondTable.*
FROM SecondTable
WHERE AlsoSomeColumn = *2nd condition*)
GROUP BY ID, SomeColumn
;
So I make a very similar queries for two different tables. I know the results have a certain number of same rows with the ID attribute, the one I've just created in the queries. I need to check which rows in the result are not in the other query's result and vice versa.
Do I have to make temporary tables or views from the queries? Maybe join the two tables in a specific way and only run one query on them?
As a beginner I don't have any experience how to use results as an input for the next query. I'm interested what is the cleanest, most elegant way to do this.
No, you most probably don't need any "temporary" tables. WITH factoring clause would help.
Here's an example:
with
first_query as
(select id, first_column, ...
from (select ABC||DEF||GHI as id, ...)
),
second_query as
(select id, some_column, ...
from (select JKM||OPQ||RST as id, ...)
)
select id from first_query
minus
select id from second_query;
For another result you'd just switch the tables, e.g.
with ... <the same as above>
select id from second_query
minus
select id from first_query
I have List 1 with following schema
{customerId: int,storeId: int,products: {(prodId: int,name: chararray)}}
Customer List with following schema
{uniqueId: int,customerId: int,name: chararray}
Store List with following schema
{uniqueId: int,storeNum: int,name: chararray}
and Product List with schema
{uniqueId: int,sku: int,productName: chararray}
Now I want to search customerId , storeId and prodId of each item in List 1 with other lists to check the ids are valid or not. The valid items has to be stored in on file and invalid items in another.
As PIG is very new for me, I feel this as very complex to do. Please give me a good logic to do this job using Apache PIG.
First of all load all your data think these as tables
cust_data = LOAD '\your\path\to\customer\data' USING PigStorage() as (uniqueId: int, customerId: int, name: chararray);
store_data = LOAD '\your\path\to\store\data' USING PigStorage() as (uniqueId: int, storeNum: int, name: chararray);
product_data = LOAD '\your\path\to\product\data' USING PigStorage() as (uniqueId: int, sku: int, productName: chararray);
You can check your loaded data schema by
DESCRIBE cust_data;
DESCRIBE store_data;
DESCRIBE product_data;
JOIN Customer and Store data first using uniqueId (we are doing a equijoin)
cust_store_join = JOIN cust_data BY uniqueId, store_data BY uniqueId;
then Generate your columns
cust_store = FOREACH cust_store_join GENERATE cust_data::uniqueId as uniqueId, cust_data::customerId as customerId, cust_data::name as cust_name, store_data::storeNum as storeNum, store_data::name as store_name;
Now JOIN customer store and product using uniqueId (we are doing a equijoin)
cust_store_product_join = JOIN cust_store BY uniqueId, product_data BY uniqueId;
finally generate all your desired columns
customer_store_product = FOREACH cust_store_product_join GENERATE cust_store::uniqueId as uniqueId, cust_store::customerId as customerId, cust_store::cust_name as cust_name, cust_store::storeNum as storeNum, product_data::sku as sku, product_data::productName as productName;
now store your desired columns in your local /hdfs directory
below store command will store all the matching uniqueId from all three tables i.e., customer, store, product
STORE customer_store_product INTO '\your\output\path' USING PigStorage(',');
Similarly you can join your list1 schema and generate columns and store data using same logic.
Hope this helps
I have a query like below - table names etc. changed for keeping the actual data private
SELECT inv.*,TRUNC(sysdate)
FROM Invoice inv
WHERE (inv.carrier,inv.pro,inv.ndate) IN
(
SELECT carrier,pro,n_dt FROM Order where TRUNC(Order.cr_dt) = TRUNC(sysdate)
)
I am selecting records from Invoice based on Order. i.e. all records from Invoice which are common with order records for today, based on those 3 columns...
Now I want to select Order_Num from Order in my select query as well.. so that I can use the whole thing to insert it into totally seperate table, let's say orderedInvoices.
insert into orderedInvoices(seq_no,..same columns as Inv...,Cr_dt)
(
SELECT **Order.Order_Num**, inv.*,TRUNC(sysdate)
FROM Invoice inv
WHERE (inv.carrier,inv.pro,inv.ndate) IN
(
SELECT carrier,pro,n_dt FROM Order where TRUNC(Order.cr_dt) = TRUNC(sysdate)
)
)
?? - how to do I select that Order_Num in main query for each records of that sub query?
p.s. I understand that trunc(cr_dt) will not use index on cr_dt (if a index is there..) but I couldn't select records unless I omit the time part of it..:(
If the table ORDER1 is unique on CARRIER, PRO and N_DT you can use a JOIN instead of IN to restrict your records, it'll also enable you to select whatever data you want from either table:
select order.order_num, inv.*, trunc(sysdate)
from Invoice inv
join order ord
on inv.carrier = ord.carrier
and inv.pro = ord.pro
and inv.ndate = ord.n_dt
where trunc(order.cr_dt) = trunc(sysdate)
If it's not unique then you have to use DISTINCT to deduplicate your record set.
Though using TRUNC() on CR_DT will not use an index on that column you can use a functional index on this if you do need an index.
create index i_order_trunc_cr_dt on order (trunc(cr_dt));
1. This is a really bad name for a table as it's a keyword, consider using ORDERS instead.