Pentaho data integration merge with inner join has 5000+ matches and left join has 0 matches - sorting

I want to merge an excel with a database query to add some fields in a transformations.
The merge with INNER seems to work well and has +5000 matches, but I need to do a LEFT JOIN to get the unmatches rows as well, and in this case the matched rows is 0.
Why is not matching any rows when LEFT JOIN is used? Any ideas?
Transformation
Merge
Sort rows (left side)
Sort rows (right side)

From your screenshot, it looks fine but I'm unsure why the Merge join step didn't work for you maybe you can try another way to do left join by using the Stream lookup step instead, for me it works both ways. You can find the difference between the two steps from here

Related

Pentaho, trying to merge a small list of cities with a big list of person using same Key field

im trying to merge two lists, for both list i added a sequence to count to a maximium of 18, because i have a list of 18 cities.
This is my transformation:
Basically i added the city_ID-Sequence to do a sequence to a maximum of 18 since my text file ID, have a field "ID" with a maximum of 18.
The idea would be when merging on "merge join 2", would merge with every thing with the same ID, repeating the cities name on the "csv file input 2", making easy for me to not generate by hand the cities names.
This is the result when i merge on "Merge Join 2":
I try to the the merge with a full outer join, and i made sure i had the right ID. What i pretend is when merging on "merge join 2" the cities names repeat down the line.
This is my citie List:
This is my people list after adding the city_ID sequence:
The Merge join step needs the input rows ordered by the key you are using to join, so you need a Sort step before each branch of the Merge join.
I assume that in the "Merge join" you are joining by ID_Client, since you are adding that ID_Client to both files using a sequence, and the sequence provides the rows ordered, then in this case is not necessary, but in "Merge join 2" you are joining by ID in the rows coming from "Text file input" and by City_ID in the rows coming from "Merge join", so you need to add one Sort step after the "Text file input" step to sort by ID (yes, not needed in this particular case, but it's good practice in case your file isn't ordered) and another Sort step after the "Merge join" step to order the rows by City_ID.
Add those Sort steps always before a Merge join (unless you have added it before and you haven't changed the order in the following steps) even if you think you don't need them because the files come ordered or you are querying a table using a Sort by. Pentaho uses java internally, so with keys based on character columns or date columns the order based in your language might be not the same order java uses internally due to special characters.
Beware also that the Sort step by default treats cases as the same character ("A" the same as "a") unless you check specifically to treat them as different, the Merge join step treats them different and you'll find mismatches due to that annoying default in Pentaho.

How to speed-up a spatial join in BigQuery?

I have a BigQuery table with point registers along a whole country, and I need to assign a "censal zone" to each one of them, which polygons are contained in another table. I've been trying to do so using a query like this one:
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON ST_CONTAINS(zone_polygon, point_geo)
The first table is quite large, so the query performes very inefficiently as it is comparing each possible pairs of (point, censal zone). However, both tables have a column identifier for the municipality in which they are in, so the question is, can rewrite my query in some way that ST_CONTAINS(*) is performed for each (point, censal zone) pair that belongs to the same municipality, hence not comparing all posible censal zones within the country for each point? Can I do this without having to read points_table multiple times?
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)
I'm quite new to BigQuery so I don't really know if a query like this would actually do what I'am expecting, as I couldn't find anything in the documentation.
Thanks!
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)

Should I apply string manipulation after or before joining tables in Oracle

I have two tables need to inner join, one table has relatively small number of records compared to the other one. I need to apply some string manipulation to the smaller table, and my question is can I apply the string function after the join, or should I apply them in a sub query and then join the sub select to the bigger table?
An example would be something like this:
Option 1:
SELECT SUBSTR("SMALL_TABLE"."COL_NAME",x,y) "NEW_COL" FROM "BIG_TABLE"
JOIN "SMALL_TABLE" ON ...
Option 2:
SELECT "NEW_COL"
FROM "BIG_TABLE"
JOIN
(
SELECT SUBSTR("SMALL_TABLE"."COL_NAME",x,y) "NEW_COL" FROM "SMALL_TABLE"
) "T"
ON ...
Which is better for performance option 1 or 2?
I am using oracle 11g.
Regardless of how you structure the query, Oracle's optimizer is free to evaluate the function before or after the join. Assuming that the string manipulation is only done as part of the projection step (i.e. it is done only in the SELECT clause and is not used as a predicate in the WHERE clause), I would expect that Oracle would apply the SUBSTR before joining the tables if you used either formulation because it would then have to apply the function to fewer rows (though it can probably treat the SUBSTR as a deterministic call and cache the results if it applies the function after the join).
As with any query optimization question, the first step is always to generate a query plan and see if the different queries actually produce different plans. I would expect the plans to be identical and, thus, the performance to be identical. But there are any number of reasons that one of the two options might produce different plans on your system given your optimizer statistics, initialization parameters, etc.
It is better to apply the operations before doing the join and then joining and querying for the final result. This is called query optimization.
By doing so for ur question you will perform lesser operations when "join"ing as u will be eliminating the useless rows beforehand.
Lots of examples here : http://beginner-sql-tutorial.com/sql-query-tuning.htm
and this is the best one I could find : http://www.cse.iitb.ac.in/~sudarsha/db-book/slide-dir/ch14.ppt‎

sort in ssis takes much time, if i do sort on ole db command condition split doesnt work

well this is my problem
i use 2 source
first query (select * from servera.databasea.tablea)
secund query(select id, modifiedon from serverb.databaseb.tableb)
sort first query, sort second query
merge join at left join
condition split is.. isnull(idtableb) then i do insert (insert ont serverb)
!isnull(idtableb) && modifiedontableb<modifiedontablea then update(on server b)
it works ok with a few of rows but i work with more of 50000 and it take more than 2 hours on sort and it get error
well my another way was doing
sort on oledbsource on right-click at show advancededitor and on
input and output properties on ole db source output i choose issorted changed to true
on output columns to id i changed to
sortkeyposition to 1
(i didnt put nothing to modifiedon)
so i did these steps for 2 oledbsource
(oledb for server1, and server 2)
it work a lot of faster it finished at 5 minuts and do insert (always)
condition split doesn't work now :s cuz always going to insert
so i on condition split added parse (DT_DBDATE) and it continues being equal (only inserts)
never going to update after i chended mofidiedon parse (DT_DATE) and it continue being equal. then my question is
(i dont want to use sort) how can i do condition split works?
The Sort step takes long because you are running low on memory for your sort operation. This means it will start sorting on disk, and that is horribly slow. Options to this is to use some 3rd party sorting components like NSort.
Otherwise you can do the following:
In order for your MERGE to work your inputs need to be sorted, both in the query, and using the SortKeyPosition. Also they need to be sorted the same.
Your queries should read:
SELECT * FROM servera.databasea.tablea ORDER BY id, modifiedon
SELECT id, modifiedon FROM serverb.databaseb.tableb ORDER BY id, modifiedon
Now set the IsSorted to TRUE, set SortKeyPosition 1 to id
In your MERGE step, use id for join key.
Now in your conditional split, you can use your two output cases.
Please note, if you have MULTIPLE rows per id, you need something more to sort/join on, so that you don't get stuff in the wrong order.

Outer Joins with Subsonic 3.0

Does anyone know of a way to do a left outer join with SubSonic 3.0 or another way to approach this problem? What I am trying to accomplish is that I have one table for departments and another table for divisions. A department can have multiple divisions. I need to display a list of departments with the divisions it contains. Getting back a collection of departments which each contain a collection of divisions would be ideal, but I would take a flattened result table too.
Using the LINQ syntax seems to be broken (I am new to LINQ though and may be using it wrong), for example this throws an ArgumentException error:
var allDepartments = from div in Division.All()
join dept in Department.All() on div.DepartmentId equals dept.Id into divdept
select divdept;
So I figured I could fall back to using the SubSonic query syntax. This code however generates an INNER JOIN instead of an OUTER JOIN:
List<Department> allDepartments = new Select()
.From<Department>()
.LeftOuterJoin<Division>(DepartmentsTable.IdColumn, DivisionsTable.DepartmentIdColumn)
.ExecuteTypedList<Department>();
Any help would be appreciated. I am not having much luck with SubSonic 3. I really enjoyed using SubSonic 2 and may go back to that if I can't figure out something as basic as a left join.
Getting back a collection of departments which each contain a collection of divisions would be ideal
SubSonic does this for you (if you setup your relationships correctly in the database), just select all Departments:
var depts = Model.Department.All();
There will be a property in each item of depts named Divisions, which contains a collection of Division objects.

Resources