Pushdown in Polybase - polybase

I have the following scenario. A dimension table, e.g. PRODUCT is loaded into SQL Server 2016. A fact table, e.g. ORDER_ITEM is loaded into Hadoop. I want to run an aggregate query across PRODUCT and ORDER_ITEM, e.g.
SELECT
PRODUCT.PRODUCT_CATEGORY,
SUM(ORDER_ITEM.AMOUNT)
FROM
HADOOP.ORDER_ITEM OI
JOIN RDBMS.PRODUCT P ON (OI.PRODUCT_ID = P.PRODUCT_ID)
GROUP BY
PRODUCT.PRODUCT_CATEGORY
What is the behaviour.
(1) Does Polybase broadcast the PRODUCT dimension into Hadoop performs the join and aggregation there and returns the result
(2) Does Polybase broadcast the ORDER_ITEM table to SQL Server and perfroms the join and aggregation there?
It's probably (2), but if someone has tried it out let me know

PolyBase never moves data from the SQL Server regardless of the data volume. Depending on the statistics, PolyBase would either:
A) Stream order_item table back to SQL Server and compute the join and aggregate the data.
B) Push down a partial aggregate Sum(Order_Item.Amount) group by OI.ProductiD, stream the result set to SQL Server, then do the join and final aggregation within SQL Server.

Related

Chained CTEs in Redshift - How do I know which DIST KEY the CTE will inherit?

I have a view in Redshift which consists of lots of CTEs that are joined (chained) between each other. Inside these CTEs there are joins between multiple tables. If I then Join to a CTE that has a join of multiple tables inside where does the SORT KEY and DIST KEY for the Join from? How does Redshift decide which table in the join in the CTE, the CTE should inherit it's DIST KEY and SORT KEY from? If at all?
For example, tbl1 has a DIST KEY on tbl_key, tbl2 has a DIST KEY on tbl_id, tbl3 has DIST KEY on tbl_key.
First, I create a CTE which is the join of tbl1 and tbl2.
With cte1 as (
Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id )
Second, I create a CTE that joins to the first CTE
With cte2 as (
Select cte1.*, tbl3.col3
From cte1
Join tbl3 using (tbl_key))
Now my question is, does CTE1 have a DIST KEY on tbl1's DIST KEY of tbl_key or tbl2's DIST KEY of tbl_id? or both? or neither?
In Redshift, CTEs are just used to simplify the reading of sql. They are processed just the same as subqueries. i.e. they are not made physical and therefore do not have their own dist/sort key.
You could rewrite your code as
Select cte1.*, tbl3.col3
From (Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id
) as cte1
Join tbl3 using (tbl_key)
which can be simplified further as
Select tbl1.col1, tbl2.col2, tbl3.col3
from tbl1
join tbl2 on tbl1.job_no = tbl2.job_id
join tbl3 using (tbl_key)
If you are able to choose your dist/sort keys then you should consider which tables are the biggest and prioritise those accordingly.
for example if tbl1 and tbl2 are large then it may make sense to have them distributed as you described.
However, if tbl2 and tbl3 are both large, it may make sense to distribute both on tbl_key.
When you issue a query Redshift will compile and optimize that query as it sees fit to achieve the best performance and be logically equivalent. Your CTEs look like subqueries to the compile / optimization process and the order in which the joins are performed may have no relation to how you wrote the query.
Redshift makes these optimization choices based on the table metadata that is created / updated by ANALYZE. If you want Redshift to make smart choices on how to join your tables together you will want your table metadata to be up to date. The query plan (including join order and data distribution) is set at query compile, it is not dynamically determined during execution.
One of the choices Redshift makes is how the intermediate data of the query is distributed (your question) but remember that these intermediate results can be for a modified join order. To see what order that Redshift plans to join your tables look at the EXPLAIN plan for the query. The more tables you are joining and the more complex your query, the more choices Redshift has and the less likely it is that the EXPLAIN plan will join in the order you specified. I've worked on clients' queries with dozens of joins and many nested levels of subquery and the EXPLAIN plan is often very different than the original query as written.
So Redshift is trying to make smart choices about the join order and intermediate result distribution. For example it will usually join small tables to large tables first and keep the distribution of the large table. But here large and small are based on post WHERE clause filtering and the guesses Redshift can make based on metadata. The further join is away from the source table metadata (deep into the join tree) the more uncertain Redshift is about what the incoming and outgoing data of the join will look like.
Here the EXPLAIN plan can give you hints about what Redshift is "thinking" - if you see a DIST INNER join Redshift is moving the data of one table (or intermediate result set) to match the other. If you DIST BOTH then Redshift is redistributing both sets of data to some new distribution (usually one of the join on columns). It does this to avoid having only 1 slice with data and all others with nothing to do as this would be very inefficient.
To sum up to see what Redshift is planning to do with your joins look at the EXPLAIN plan. You can also infer some info about intermediate result distribution from the explain plan but is doesn't provide a complete map of what it plans to do.

Dynamic aggregation in SQL (Hive)

I have two tables. Table A with 3 columns: userid, a start date, and end date. Table B with events and datetimestamps. I would like to aggregate Table B up to the datetimes between the start and end date based on Table A. So something like...
select a.userid, count(distinct b.eventid) as events
from table a
inner join table b
on a.userid=b.userid
and b.datetime between a.starttime and b.endtime
group by a.userid
But Hive doesn't like that... I'm using Hadoop HortonWorks. Would appreciate any guidance!
Move the between condition to where as only equality conditions in joins are supported prior to version 2.2.0.
From Hive documentation
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.

Sql to join two foxpro tables

I have two foxpro files as detailed below
E:\F1\Table1.dbf {Id, Name, Address, City}
E:\F2\Table2.dbf {Id, qualifcn, marks}
How can I join them to get an ADODB record set with details from both tables?
Thanks and regards
Jojy
Like I have asked other people with similar questions - is this a One-time need or is this an On-going need?
For your general SQL syntax you might want to look at:
Inner and Outer SQL Joins
Especially - 4) Full Outer Join SQL Example
But if this is a one-time need, you can merely:
Manually create a new recipient table with ALL fields
Append Table1 into new table
Set relation to ID into Table2
REPLACE the recipient table's 'extra' fields with the Related Table2 values
After which your new recipient table has ALL field values from both tables.
Good Luck
I KNOW the following has worked with OleDB connection and same principal may work for you. Since both your data components are on the same logical drive, just different paths, you might be able to via common root.
Instead of making a connection to your direct folder where the first data location is, make a connection to the common root path. Then in your query, refer to the RELATIVE PATH to the tables
Connect to E:\
Your query could be
select
T1.*,
T2.*
from
F1\Table1 T1
JOIN F2\Table2 T2
on T1.ID = T2.ID
where
...

How to force oracle to use index or ordered hints for remote joins

I'm using Oracle 11g. I have a query that joins local table with remote tables using db links. I want the driving table to be the remote table as I primarily filter using remote table to get a few rows. I then want to join them with local table.
The problem is the optimizer ignores ORDERED and INDEX hints and does a full table scan of the local table. I am using the right indexes and have generated statistics. I run the queries individually with each table they use the correct indexes, but with the join, the local table always does a full table scan and acts as the driving table.
SELECT /*+ INDEX_RS_ASC(l) */
*
FROM remote_table#mylink r
JOIN local_table l USING (cont_id)
WHERE r.PRIME_VENDOR_ID = '12345'

how to join two tables in oracle on blob column?

how to join two tables in oracle on blob column
when this query is executed "SQL command not properly ended" error message is appearing
select name,photo
from tbl1 join tbl2 on tbl1.photo = tbl2.photo
First, it seems very very odd to have a design where you are storing the same blob in two different tables and very odd that you would want to join on an image. That doesn't seem like a sensible design.
You've tagged this for Oracle 8i. That is an ancient version of Oracle that didn't support the SQL 99 join syntax. You would need to do the join in the where clause instead. You can't directly test for equality between two blob values. But you can use dbms_lob.compare
select name,photo
from tbl1,
tbl2
where dbms_lob.compare(tbl1.photo, tbl2.photo) = 0
This will be rather hideous from a performance perspective. You'll have to compare every photo from tbl1 against every photo from tbl2 and comparing two lobs isn't particularly quick. If you are really intent on comparing images, you are probably better off computing a hash, storing that in a separate column that is indexed, and then comparing the hashes rather than comparing the images directly.
The code:
SELECT
name, photo
FROM
tbl1 T1
INNER JOIN
tbl2 T2
ON
T1.photo = T2.photo
If not running fine, you would have to make few changes in your TABLE structure:
1. ...Add a new TABLE named as IMAGES having columns (image_id, image_blob)
2. ...And then you you would have to change the:
tbl1's blob and tbl2's blob to image_id
3. ...Then perform the JOIN on the basis of COLUMN named as image_id
NOTE: You can not perform GROUP BY, JOIN(any JOIN), CONCAT operations on BLOB datatype
SUGGESTION: save the Paths to images in the DATABASE and save the IMAGES somewhere on that SERVER's Directory (As saving images in BLOB in the DATABASE is not a good practice..... To ensure what i said VISIT HERE)

Resources