I have three tables (Oracle source), lets call them tables 1, 2, and 3.
I would like to check a boolean field in table 1, and if it is T I want only data from table 2, and if it is F, I want only table 3.
Which transformation would be the most efficient for doing so, and how would I go about implementing it?
I'm experimenting with the Filter, Java, and Expression transformations, but if the expressions are checking on a row-by-row basis, then it seems like overkill for the expression to run on every row instead of just checking once and using the appropriate table.
Both tables 2 and 3 have a field with the same name, and I want that field for just one of the tables, based on the condition.
Create a mapping similar to the following diagram:
src_TABLE2 --> sq_TABLE2 --|
|--> union --> (further processing)
src_TABLE3 --> sq_TABLE3 --|
and use the following Source Filter conditions* in the source qualifiers:
-- for sq_TABLE2
'T' = ( SELECT FLAG FROM TABLE1 )
-- for sq_TABLE3
'F' = ( SELECT FLAG FROM TABLE1 )
There are two source tables in the mapping and the union transformation merges data from these two pipelines. However, because of the filtering conditions at any given time data will be retrieved only from one of the source tables.
* Source Filter is an attribute on the Properties tab of a source qualifier. The condition you set here is appended to the WHERE clause of the SELECT statement that is sent to the database.
Related
I have a big calculation that joins together about 10 tables and calculates some values from the result. I want to write a function that allows me to replace one of the tables that are joined (lets call it Table A) with a table (type) I give as an input parameter.
I have defined row and table types for table A like
create or replace TYPE t_tableA_row AS OBJECT(*All Columns of Table A*);
create or replace TYPE t_tableA_table as TABLE OF t_tableA_row;
And the same for the types of the calculation I need as an output of the function.
My functions looks like this
create or replace FUNCTION calculation_VarInput (varTableA t_tableA_table)
RETURN t_calculationResult_table AS
result_ t_calculationResult_table;
BEGIN
SELECT t_calculationResult_row (*All Columns of Calculation Result*)
BULK COLLECT INTO result_
FROM (*The calculation*)
RETURN result_;
END;
If I test this function with the normal calculation that just uses Table A(ignoring the input parameter), it works fine and takes about 3 Second. However, if I replace Table A with varTableA (the input parameter that is a table type of Table A), the calculation takes so long I have never seen it finish.
When I use table A for the calculation it looks like this
/*Inside the calculation*/
*a bunch tables being joined*
JOIN TableA A On A.Value = B.SomeOtherValue
JOIN *some other tables*
When I use varTableA its
/*Inside the calculation*/
*a bunch tables being joined*
JOIN TABLE(varTableA ) A On A.Value = B.SomeOtherValue
JOIN *some other tables*
Sorry for not posting the exact code but the calculation is huge and would really bloat this post.
Any ideas why using the table type when joining makes the calculation so much slower when compared to using the actual table?
Your function encapsulates some selection logic in a function and so hides information from the optimizer. This may lead the optimizer to make bad or inefficient decisions.
Oracle has gathered statistics for TableA so the optimizer knows how many rows it has, what columns are indexed and so on. Consequently it can figure out the best access path for the table. It has no stats for TABLE(varTableA ) so it assumes it will return 8192 (i.e. 8k) rows. This could change the execution plan if say the original TableA returned 8 rows. Or 80000. You can check this easily enough by running EXPLAIN PLAN for both versions of query.
If that is the problem add a /*+ cardinality */ to the query which accurately reflects the number of rows in the function's result set. The hint (hint, not function) tells the optimizer the number of rows it should use in its calculation.
I don't want to actually change the values in my tables permanently, I just want to know what the calculation result would be if some values were different.
Why not use a view instead? A simple view which selects from TableA and applies the required modifications in its projection. Of course I know nothing about your data and how you want to manipulate it, so this may be impractical for all sorts of reasons. But it's where I would start.
In doing benchmarks between Impala and PrestoDB, we noticed that building pivot tables is quite difficult in Imapala because it does not have Cube operators, like Presto does. Here are two examples in Presto:
The CUBE operator generates all possible grouping sets (i.e. a power set) for a given set of columns. For example, the query:`
SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_state, destination_state);
is equivalent to:
SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_state, destination_state),
(origin_state),
(destination_state),
());
Another example is the ROLLUP operator. Full documentation is here: https://prestodb.io/docs/current/sql/select.html.
It's not syntatic sugar because PRESTO perform one table scan for whole query - so using this operators you can build pivot table in one request Impala need to run 2-3 queries.
Is there a way in which we can do this with one query / table-scan in Impala instaead of 3? Otherwise the performance becomes terrible on creating any type of pivot table.
We can use impala windo functions but instead of single column output, you will get 3 columns.
SELECT origin_state,
destination_state,
SUM(package_weight) OVER (PARTITION BY origin_state, destination_state) AS pkgwgrbyorganddest,
SUM(package_weight) OVER (PARTITION BY origin_state) AS pkgwgrbyorg,
SUM(package_weight) OVER (PARTITION BY destination_state) AS pkgwgrbydest
FROM shipping;
New to SQL. Pardon me if this question is a basic one. Is there a way for me to do this below
SELECT COLUMN1 as CUSTOM_NAME, <wildcard for remaining columns as is> from TABLE;
I only want COLUMN1 appear once in the final result
There is no way to make that kind of dynamic SELECT list with regular SQL*.
This is a good thing. Programming gets more difficult the more dynamic it is. Even the simple * syntax, while useful in many contexts, causes problems in production code. The Oracle SQL grammar is already more complicated than most traditional programming languages, adding a little meta language to describe what the queries return could be a nightmare.
*Well, you could create something using Oracle data cartridge, or DBMS_XMLGEN, or a trick with the PIVOT clause. But each of those solutions would be incredibly complicated and certainly not as simple as just typing the columns.
This is about as close as you will get.
It is very handy for putting the important columns up front,
while being able to scroll to the others if needed. COLUMN1 will end up being there twice.
SELECT COLUMN1 as CUSTOM_NAME,
aliasName.*
FROM TABLE aliasName;
In case you have many columns it might be worth to generate a full column list automatically instead of relying on the * selector.
So a two step approach would be to generate the column list with custom first N columns and unspecified order of the other columns, then use this generated list in your actual select statement.
-- select comma separated column names from table with the first columns being in specified order
select
LISTAGG(column_name, ', ') WITHIN GROUP (
ORDER BY decode(column_name,
'FIRST_COLUMN_NAME', 1,
'SECOND_COLUMN_NAME', 2) asc) "Columns"
from user_tab_columns
where table_name = 'TABLE_NAME';
Replace TABLE_NAME, FIRST_COLUMN_NAME and SECOND_COLUMN_NAME by your actual names, adjust the list of explicit columns as needed.
Then execute the query and use the result, which should look like
FIRST_COLUMN_NAME, SECOND_COLUMN_NAME, OTHER_COLUMN_NAMES
Ofcourse this is overhead for 5-ish columns, but if you ever run into a company database with 3 digit number of columns, this can be interesting.
I need to work with a SQL result set in order to do some processing for each column (medians, standard deviations, several control statements included)
The SQL is dynamic so I don't know the number of columns, rows.
First I tried to use temporary tables, views, etc to store the results, however I did not manage to overcome the 30 character limit of Oracle columns when using the below sql:
create table (or view or global temporary table) as select * from (
SELECT
DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE,
SUM(DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI_CHZ +DMTTBF_MAT_MATURATO_BILL_POS. MAT_N_NUM_EVENTI) <-- exceeds the 30 character limit
FROM DMTTBF_MAT_MATURATO_BILL_POS
WHERE DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE >= '201301'
GROUP BY DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE
)
Second choice was to use some PL/SQL types to store the entire table information, so I could call it like in other programming languages (e.g. a matrix result[i][j]) but I could not find anything similar.
Third variant, using files for reading and writing: i did not try it yet; i'm still expecting a more elegant pl/sql solution
It's possible that I have the wrong approach here so any advice is more than welcome.
UPDATE: Modifying the input SQL is not an option. The program has to accept any select statement.
Note that you can alias both tables and fields. Using a table alias keeps references to it from producing walls of text in the query. Using one for a field gives it a new name in the output.
SELECT A.LONG_FIELD_NAME_HERE AS SHORTNAME
FROM REALLY_LONG_TABLE_NAME_HERE A
The auto naming adds _1 and _2 etc to differentiate the same column name coming from different table references. This often puts a field already borderline over the limit. Giving the fields names yourself bypasses this.
You can put the alias also in dynamic SQL:
sqlstr := 'create table (or view or global temporary table) as select * from (
SELECT
DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE,
SUM(DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI_CHZ + DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI) AS '||SUBSTR('SUM(DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI_CHZ +DMTTBF_MAT_MATURATO_BILL_POS.MAT_N_NUM_EVENTI)', 1, 30)
||' FROM DMTTBF_MAT_MATURATO_BILL_POS
WHERE DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE >= ''201301''
GROUP BY DMTTBF_MAT_MATURATO_BILL_POS.MAT_V_COD_ANNOMESE
)'
I have a page that pulls together aggregate data from two different tables. I would like to perform these queries in parallel to reduce the latency without having to introduce a stored procedure that would do both.
For example, I currently have this:
ViewBag.TotalUsers = DB.Users.Count();
ViewBag.TotalPosts = DB.Posts.Count();
// Page displays both values but has two trips to the DB server
I'd like something akin to:
var info = DB.Select(db => new {
TotalUsers = db.Users.Count(),
TotalPosts = db.Posts.Count());
// Page displays both values using one trip to DB server.
that would generate a query like this
SELECT (SELECT COUNT(*) FROM Users) AS TotalUsers,
(SELECT COUNT(*) FROM Posts) AS TotalPosts
Thus, I'm looking for a single query to hit the DB server. I'm not asking how to parallelize two separate queries using Tasks or Threads
Obviously I could create a stored procedure that got back both values in a single trip, but I'd like to avoid that if possible as it's easier to add additional stats purely in code rather than having to keep refreshing the DB import.
Am I missing something? Is there a nice pattern in EF to say that you'd like several disparate values that can all be fetched in parallel?
This will return the counts using a single select statement, but there is an important caveat. You'll notice that the EF-generated sql uses cross joins, so there must be a table (not necessarily one of the ones you are counting), that is guaranteed to have rows in it, otherwise the query will return no results. This isn't an ideal solution, but I don't know that it's possible to generate the sql in your example since it doesn't have a from clause in the outer query.
The following code counts records in the Addresses and People tables in the Adventure Works database, and relies on StateProvinces to have at least 1 record:
var r = from x in StateProvinces.Top("1")
let ac = Addresses.Count()
let pc = People.Count()
select new { AddressCount = ac, PeopleCount = pc };
and this is the SQL that is produced:
SELECT
1 AS [C1],
[GroupBy1].[A1] AS [C2],
[GroupBy2].[A1] AS [C3]
FROM
(
SELECT TOP (1) [c].[StateProvinceID] AS [StateProvinceID]
FROM [Person].[StateProvince] AS [c]
) AS [Limit1]
CROSS JOIN
(
SELECT COUNT(1) AS [A1]
FROM [Person].[Address] AS [Extent2]
) AS [GroupBy1]
CROSS JOIN
(
SELECT COUNT(1) AS [A1]
FROM [Person].[Person] AS [Extent3]
) AS [GroupBy2]
and the results from the query when it's run in SSMS:
C1 C2 C3
----------- ----------- -----------
1 19614 19972
You should be able to accomplish what you want with Parallel LINQ (PLINQ). You can find an introduction here.
It seems like there's no good way to do this (yet) in EF4. You can either:
Use the technique described by adrift which will generate a slightly awkward query.
Use the ExecuteStoreQuery where T is some dummy class that you create with property getters/setters matching the name of the columns from the query. The disadvantage of this approach is that you can't directly use your entity model and have to resort to SQL. In addition, you have to create these dummy entities.
Use the a MultiQuery class that combines several queries into one. This is similar to NHibernate's futures hinted at by StanK in the comments. This is a little hack-ish and it doesn't seem to support scalar valued queries (yet).