One to Many outer join using KTable in Kafka - apache-kafka-streams

I have a table contact published to topic#1 with key as contact.id. Also a subtable is published to topic#2 with key as subtable.id. Now, I need to do a join like
select * from contact c outer join subtable st on c.id = st.id;
There is a 1..n relationship between the tables. How to perform a aggregation with/without lambda here.

Kafka Streams API does not support 1:n KTable-KTable joins yet (version 1.0 or earlier).
The only non-primary key join supported is a KStream-GlobalKTable join.
More details about join can be found in this blog post: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/
Also, there is a JIRA for 1:n joins: https://issues.apache.org/jira/browse/KAFKA-3705

Related

Chained CTEs in Redshift - How do I know which DIST KEY the CTE will inherit?

I have a view in Redshift which consists of lots of CTEs that are joined (chained) between each other. Inside these CTEs there are joins between multiple tables. If I then Join to a CTE that has a join of multiple tables inside where does the SORT KEY and DIST KEY for the Join from? How does Redshift decide which table in the join in the CTE, the CTE should inherit it's DIST KEY and SORT KEY from? If at all?
For example, tbl1 has a DIST KEY on tbl_key, tbl2 has a DIST KEY on tbl_id, tbl3 has DIST KEY on tbl_key.
First, I create a CTE which is the join of tbl1 and tbl2.
With cte1 as (
Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id )
Second, I create a CTE that joins to the first CTE
With cte2 as (
Select cte1.*, tbl3.col3
From cte1
Join tbl3 using (tbl_key))
Now my question is, does CTE1 have a DIST KEY on tbl1's DIST KEY of tbl_key or tbl2's DIST KEY of tbl_id? or both? or neither?
In Redshift, CTEs are just used to simplify the reading of sql. They are processed just the same as subqueries. i.e. they are not made physical and therefore do not have their own dist/sort key.
You could rewrite your code as
Select cte1.*, tbl3.col3
From (Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id
) as cte1
Join tbl3 using (tbl_key)
which can be simplified further as
Select tbl1.col1, tbl2.col2, tbl3.col3
from tbl1
join tbl2 on tbl1.job_no = tbl2.job_id
join tbl3 using (tbl_key)
If you are able to choose your dist/sort keys then you should consider which tables are the biggest and prioritise those accordingly.
for example if tbl1 and tbl2 are large then it may make sense to have them distributed as you described.
However, if tbl2 and tbl3 are both large, it may make sense to distribute both on tbl_key.
When you issue a query Redshift will compile and optimize that query as it sees fit to achieve the best performance and be logically equivalent. Your CTEs look like subqueries to the compile / optimization process and the order in which the joins are performed may have no relation to how you wrote the query.
Redshift makes these optimization choices based on the table metadata that is created / updated by ANALYZE. If you want Redshift to make smart choices on how to join your tables together you will want your table metadata to be up to date. The query plan (including join order and data distribution) is set at query compile, it is not dynamically determined during execution.
One of the choices Redshift makes is how the intermediate data of the query is distributed (your question) but remember that these intermediate results can be for a modified join order. To see what order that Redshift plans to join your tables look at the EXPLAIN plan for the query. The more tables you are joining and the more complex your query, the more choices Redshift has and the less likely it is that the EXPLAIN plan will join in the order you specified. I've worked on clients' queries with dozens of joins and many nested levels of subquery and the EXPLAIN plan is often very different than the original query as written.
So Redshift is trying to make smart choices about the join order and intermediate result distribution. For example it will usually join small tables to large tables first and keep the distribution of the large table. But here large and small are based on post WHERE clause filtering and the guesses Redshift can make based on metadata. The further join is away from the source table metadata (deep into the join tree) the more uncertain Redshift is about what the incoming and outgoing data of the join will look like.
Here the EXPLAIN plan can give you hints about what Redshift is "thinking" - if you see a DIST INNER join Redshift is moving the data of one table (or intermediate result set) to match the other. If you DIST BOTH then Redshift is redistributing both sets of data to some new distribution (usually one of the join on columns). It does this to avoid having only 1 slice with data and all others with nothing to do as this would be very inefficient.
To sum up to see what Redshift is planning to do with your joins look at the EXPLAIN plan. You can also infer some info about intermediate result distribution from the explain plan but is doesn't provide a complete map of what it plans to do.

Dynamic aggregation in SQL (Hive)

I have two tables. Table A with 3 columns: userid, a start date, and end date. Table B with events and datetimestamps. I would like to aggregate Table B up to the datetimes between the start and end date based on Table A. So something like...
select a.userid, count(distinct b.eventid) as events
from table a
inner join table b
on a.userid=b.userid
and b.datetime between a.starttime and b.endtime
group by a.userid
But Hive doesn't like that... I'm using Hadoop HortonWorks. Would appreciate any guidance!
Move the between condition to where as only equality conditions in joins are supported prior to version 2.2.0.
From Hive documentation
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.

Is it possible to compare two fields from different types on a Query DSL?

I'm new to ElasticSearch and I'm struggling with this question. Basically what I want to do is sort of like this (SQL Example):
SELECT A.id
FROM TableA A, TableB B
WHERE A.id = B.id;
I want a Query that returns all of the info from TableA, but only if the id from TableA is equal to an id from TableB.
I've read a lot of Query Filter fields and I think I might use the Term Field but I'm not sure how.
Thanks in advance!
This answer was given by Adrien Grand on a ElasticSearch group:
This SQL query is a join and in general elasticsearch does not support joins.
If the id field is your PK, you might be able to do it by indexing B as a child of A (using parent/child) and then searching for all documents in A that have a child in B.

Linq to Entity Query .Expand

I got the following tables
TableA, TableB, TableC, TableD, TableE and they have foreign key relations like
FK_AB(one to many),FK_BC(one to one),FK_CD(One to many),FK_DE(one to one) and have the navigation properties based on these foreignkeys
Now I want to query TableA and get the records from TableA, TableD and TableE whoose Loadedby column equal to System. My query is like below
var query= from A in Context.TableA.Expand(TableB/TableC/TableD).Expand(TableB/TableC/TableD/TableE)
where A.Loadedby=="System"
select A;
The above query is working fine. I want the records from TableD and TableE whoose Loadedby value equal to System but the above query returning all the records from TableD and TableE which are related to TableA record satisfying A.Loadedby="System" this condition is not checked in the child tables.
Can anyone tell me how to filter the child tables also.
Currently OData only supports filters on the top-level. So in the above example it can only filter rows from the TableA. Inside expansions all the approriate rows will be included, always, there's no way to filter those right now.
You might be able to ask for the exanded entities separately with additional queries (with the right filter) and possibly use batch to group all the queries in one request. But that depends on the actual query you need to send.

Could I use LLBL Gen to JOIN columns between tables which don't have a relationship of primary-foreign keys?

I have the following DB structure
Table1( ID1,Col1,Col2) and Table1( ID2,Col3,Col4)
Table1 and Table2 are separate table and don't have any relationships between them.
and I would like to achieve the following result
SELECT *
FROM
Table1 JOIN Table2 ON Table1.Col1= Table2.Col3
How could I achieve that using LLBL Gen Adapter.
Thanks.
Is there any specific reason you're not using the LINQ statements?
Otherise that would make it easy:
Linq to LLBLGen Pro:
http://www.llblgen.com/documentation/3.1/LLBLGen%20Pro%20RTF/hh_start.htm
http://www.llblgen.com/documentation/3.1/LLBLGen%20Pro%20RTF/Tutorials%20and%20examples/examples_howdoi.htm#linq
Other options:
Define model-only relations in the LLBLGen Pro UI (since 3.0 only if the DB types + lengths match exactly unfortunately)
Have a look at instantiating a EntityRelation (however due to LINQ my Predicate skills are getting a bit rusty)

Resources