How to find unique values from set of rows using pentaho kettle? - etl

I have one de normalized table. I want to select all the values from one specific column of that table and load only unique values from that column into separate table.
How to do this using Pentaho Spoon? Please note that I am totally newbie to Spoon. I have tried only hello world transformation in my life.
I have table named 'Employees' which has lots of columns as follows (I have not given unrelated columns here):
+-------------------------------------------------------+
Employees
+-------------------------------------------------------+
employee_number | employee_name | deputed_branch | phone
+-------------------------------------------------------+
Now I want to move only unique branch names into new table named branches using Spoon.
'branches' table will look like following :
+-------------------------------------------------------+
branches
+-------------------------------------------------------+
| branch_id | branch_name
+-------------------------------------------------------+
where branch_id will be unique and auto incremented.
To connect Employees and branches table I will use Employee_branch table which will consist of employee_number and branch_id column.
Can anyone please tell how to do this?
Thanks in advance !!

can't you just do that in the sql?
select distinct deputed_branch from employees
If not; Then use either the unique rows step ( not that it has to be sorted data ) or the group by step. ( also sorted )
or; Memory group by if number of rows is low ( data doesnt need to be sorted )

Related

to update a column details into base table 2 from column of staging table against the datails of column from base table 1

There's total of three tables involved. one header base table, one material
base table, one staging table.
I have created the staging table with 4 columns, the values will be
updated from csv uploaded, column 1 is batch_no, column 2 is for
attribute.
>header base table(h) has batch_no and batch_id
>material base table(m) has batch_id, attr_m (empty, to be updated)
>staging table(s) has batch_no and attr_s
create table he (BATCH_ID number, BATCH_NO varchar2(30));
create table me (a6 varchar2(30), BATCH_id number);
create table s (batch_no varchar2(30), att varchar2(30));
I want to take values from attr_s and update attr_m against batch_no. How do I do that?
Here's my code, please help me fix this code, it doesn't work
update me
set a6 = (select att
from s where batch_no = (select he.batch_no
from he, s
where he.batch_no=s.batch_no))
error received:
single row subquery return multiple rows.
single row subquery return multiple rows
The update statement is applied to each individual row in ME. Therefore the assignment operation requires one scalar value to be returned from the subquery. Your subquery is returning multiple values, hence the error.
To fix this you need to further restrict the subquery so it returns one row for each row in ME. From your data model the only way to do this is with the BATCH_ID, like so:
update me
set a6 = (select att
from s where batch_no = (select he.batch_no
from he, s
where he.batch_no=s.batch_no
and he.batch_id = me.batch_id))
Such a solution will work providing that there is only one record in S which matches a given permutation of (batch_no, batch_id). As you have provided any sample data I can't verify that the above statement will actually solve your problem.

Delete data based on the count & timestamp using pl\sql

I'm new to PL\SQL programming and I'm from DBA background. I got one requirement to delete data from both main table and reference table but need to follow below logic while deleting data because we need to delete 30M of data from the tables so we're reducing data based on the "State_ID" column below.
Following conditions need to consider
1. As per sample data given below(Main Table), sort data based on timestamp with desc order and leave the first 2 rows of data for each "State_id" and delete rest of the data from the both tables based on "state_id" column.
2. select state_id,count() from maintable group by state_id order by timestamp desc Having count()>2;
So if state_id=1 has 5 rows then has to delete 3 rows of data by leaving first 2 rows for state_id=1 and repeat for other state_id values.
Also same matching data should be deleted from the reference table as well.
Please someone help me on this issue. Thanks.
enter image description here
Main table
You should be able to do each table delete as a single SQL command. Anything else would essentially force row-by-row processing, which is the last thing you want for that much data. Something like this:
delete from main_table m
where m.row_id not in (
with keep_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number<3)
select row_id from keep_me)
or
delete from main_table m
where m.row_id in (
with delete_me as (
select row_id,
row_number() over (partition by state_id
order by time_stamp desc) id_row_number
from main_table where id_row_number>2)
select row_id from delete_me)

Create a generic DB table

I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm

Validate person without value in date column

I have a table with several employees. They have the following columns empid,datecolumn1,is_valid.
Very few employees have a more than one record in the table. If an employee has more than one record in the table I would like to 'invalidate' one of the records on the following condition:
1. If a employee has more than one record in the table then the record with no value in the datecolumn1 is valid (update is_valid to 1) and the record with value in datecolumn1 is not valid (update is_valid to 0).
How do I accomplish this?
As Ben points out, you've stated that if datecolumn1 is NULL you want the is_valid column to be set to both 0 and 1. Assuming you fix that, you may need to adjust this CASE statement depending on which way you decide is correct.
UPDATE employees
SET is_valid = (CASE WHEN datecolumn1 IS NULL
THEN 1
ELSE 0
END)
WHERE empid IN (SELECT e.empid
FROM employees e
GROUP BY emempid
HAVING COUNT(*) > 1)
create a staging table, and fill it by a SELECT on the original table with a GROUP BY employee Id (or whatever your unique identifier is). Create a second staging table and fill it by SELECTING on the original table and excluding all rows that match rows in your grouped table. Now you have a table that contains only those people with multiple rows. From your original table, set is_valid to 0 on all rows that match employee id with the second staging table and also have no datecolumn1 (or perhaps that also have a datecolumn1 - your question as of this writing is a bit unclear.) and is_valid to 1 on the others. Once done with that, delete the staging tables, and you should have what you need.
You could also do this with a single more complicated multiselect call, but I find it helpful to use staging tables when things get complicated.

two tables with the same sequence

Is possible to have two tables with the same incrementing sequence?
I was trying to do a tree with ID, NAME, ParentID and i have to join two tables.
If i have different id the tree scheme of ID - ParentId will not work.
Table A Table B
ID | NAME | PID ID | NAME | PID
1 | xpto | 0 1 | xpto | 1
If you are doing both inserts at the same time, you can use SEQUENCE.NEXTVAL for the insert into the first table to get a new ID, and then SEQUENCE.CURRVAL for the insert into the second table to reuse the same ID.
I found the answer: "Sequence numbers are generated independently of tables, so the same sequence can be used for one or for multiple tables."
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_6015.htm
Tanks for your help.
You could have a master table that is nothing but the sequence PK/FK and then have two child tables. Insert a row in the master just to get the sequence and then use that sequence as the PK in the child tables. If the child tables have the same sequence then why is not one table?
create sequence v_seq
INCREMENT by 1
minvalue 1
maxvalue 10;
Sample Image
create table v_t_s_emp(v_id number,vname varchar2(10));
insert into v_t_s_emp values(v_seq.nextval,'krishna');
create table v_t_s_emp1(v_id number,vname varchar2(10));
insert into v_t_s_emp1 values(v_seq.nextval,'RAMesh');
commit;
select * from v_t_s_emp
union all
select * from v_t_s_emp1;

Resources