Joining two kafka engine tables in clickhouse - clickhouse

I am trying to join two tables as following
CREATE MATERIALIZED VIEW db.data_v ON CLUSTER shard1 TO db.table
AS
SELECT JSON_VALUE(db.table2_queue.message, '$.after.id') bid,
JSON_VALUE(message, '$.after.brand_id') AS brand_id,
JSON_VALUE(message, '$.after.id') AS id
FROM
db.table1_queue lq
Join db.table2_queue bq
on JSON_VALUE(bq.message, '$.after.id') = JSON_VALUE(lq.message, '$.after.brand_id')
However i got an empty result:
0 rows in set. Elapsed: 0.006 sec.

Related

I/O issue with PowerCenter Informatica in Oracle

I have two tables in Oracle and I have to synchronize values (Field column) between the tables. I'm using Informatica PowerCenter for this synchronization operation. The source qualifier query causes high I/O usage and I need to solve it.
Table1
Table1 has about 20M data. Field in Table1 is the actual field. Timestamp field holds create & update date and it has daily partition.
Id
Field
Timestamp
1
A
2017-05-12 03:13:40
2
B
2002-11-01 07:30:46
3
C
2008-03-03 03:26:29
Table2
Table2 has about 500M data. Field in Table2 should be as sync as possible to Field in Table1. Timestamp field holds create & update date and it has daily partition. Table2 is also target in the mapping.
Id
Table1_Id
Field
Timestamp
Action
100
1
A
2005-09-30 03:20:41
Nothing
101
1
B
2015-06-29 09:41:44
Update Field as A
102
1
C
2016-01-10 23:35:49
Update Field as A
103
2
A
2019-05-08 07:42:46
Update Field as B
104
2
B
2003-06-02 11:23:57
Nothing
105
2
C
2021-09-21 12:04:24
Update Field as B
106
3
A
2022-01-23 01:17:18
Update Field as C
107
3
B
2008-04-24 15:17:25
Update Field as C
108
3
C
2010-01-15 07:20:13
Nothing
Mapping Queries
Source Qualifier Query
SELECT *
FROM Table1 t1, Table2 t2
WHERE t1.Id = t2.Table1_Id AND t1.Field <> t2.Field
Update Transformation Query
UPDATE Table2
SET
Field = :tu.Field,
Timestamp = SYSDATE
WHERE Id = :tu.Id
You can use below approach.
SQ - Your SQL is correct and you can use it if you see its working but add a <> clause on partition date key column. You can use this SQL to speed it up as well.
SELECT *
FROM Table2 t2
INNER JOIN Table1 t3 ON t3.Id = t2.Table1_Id
LEFT OUTER JOIN Table1 t1 ON t1.Id = t2.Table1_Id AND t1.Field = t2.Field AND t1.partition_date= t2.partition_date -- You did not mention partition_date column but i am assuming there is a separate column which is used to partition.
WHERE t1.id is null -- <> is inefficient.
Then in your infa target T2 definition, make sure you mention partition_date as part of key along with ID.
Then use a update strategy set to DD_UPDATE. You can set the session to update as well.
And remove that target override. This actually applies the update query on the whole table and sometime can be inefficient abd I/O intensive.
Informatica is powerful to update data in bunch through update strategy. You can increase commit interval as per your performance.
You shouldn't try to update a 500M table in a single go using SQL. Yes, you can use PLSQL to update in a bunch.

Optimize table does not run right after table mutation

I'm updating a table with table mutations like this:
ALTER TABLE T1
UPDATE column1 = replaceAll('X', 'Y')
After that, I'm sending optimize-final command with clickhouse-client like this:
OPTIMIZE TABLE T1 FINAL
Ok.
0 rows in set. Elapsed: 0.002 sec.
But it returns instantly(0.002 sec.) and I can see the rows are not updated yet.
After a couple of seconds(10-50) I run the optimize-final command again but this time it hangs until the table is optimized.
Is this the expected behavior of optimize-final?
I can see the rows are not updated yet.
ALTER TABLE T1 UPDATE -- asynchronous
You should check select count() from system.mutations where not is_done; that your mutation is done.
In next versions you can run mutations synchronously
ALTER TABLE T1 UPDATE column1 = replaceAll('X', 'Y') SETTINGS
mutations_sync = 2
mutations_sync, 0, "Wait for synchronous execution of ALTER TABLE UPDATE/DELETE queries (mutations). 0 - execute asynchronously. 1 - wait current server. 2 - wait all replicas if they exist.
OPTIMIZE TABLE T1 FINAL
OPTIMIZE -- merge has no relation to mutations.
0 rows in set. Elapsed: 0.002 sec.
In some cases OPTIMIZE could not start and returns immediately
Use optimize_throw_if_noop to find out a reason
set optimize_throw_if_noop = 1;
OPTIMIZE TABLE T1 FINAL;

How to obtain a random sample of 100K users with all their transactions in hive?

I have a huge dataset containing information on millions of users and their purchases recorded for 1 year. Is there a way to create a random sample of 100K users (keeping all their individual purchases) from this data? Since a user can have more than one purchase, the sample will contain more than 100k records.
I was able to find the rand() function but it does not give me all records for the users.
I tried this query:
select *
from mytable
where rand()< 0.025 and mydate between '20140101' and '20141231'
distribute by rand()
sort by rand()
limit 100000
This result produces only the 100k random records and not all the records for these 100k users.
Any suggestions on how to write a hive query to obtain this results?
You should create table of 100,000 random userIDs first:
CREATE table Random_Users AS
Select * From (Select distinct userId From my table) users
where rand()< 0.025 limit 100000;
Then you can do
Select mytable.* From mytable m JOIN random_users r ON (m.userId = r.userId);

Performance issue in hive version 0.13.1

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

Optimised Hive query with JOIN , having million records

I have 2 tables-
bpm_agent_data - 40 Million records , 5 Columns
bpm_loan_data - 20 Million records, 5 Columns
Now I ran a query in Hive-
select count(bpm_agent_data.AgentID), count(bpm_loan_data.LoanNumber) from bpm_agent_data JOIN bpm_loan_data where bpm_loan_data.id = bpm_agent_data.id;
which is taking long long time to complete.
What should be the ideal way to write the query in HIVE so that Reducer must not take so much time.
Found the solution for the above query,
replaced where with ON
select count(bpm_agent_data.AgentID), count(bpm_loan_data.LoanNumber) from bpm_agent_data JOIN bpm_loan_data ON( bpm_loan_data.id = bpm_agent_data.id);

Resources