Dump Data into Hadoop with realtime encryption and Hierarchical Masking of some columns of data - hadoop

Say i have a dataset/table(Banking sector) which has the following details.
Name | Mob.no | AccountNo | Address | SSN | Salary....... |and so on.
john | 123456 | 987654321 | abx 123 | 1122 | 28000
I have to dump this into Hadoop.
But while dumping i want the AccountNo and SSN columns to be encrypted,
while its getting stored in HDFS.
This is the first part.
Now when i am retriving the results,
First decryption should happen.
After that i want to mask some of the columns.
Say. There are two Persons(CEO, Project Manager) Viewing the results of john.
Then, CEO should be able to see all the details(columns) after decryption.
For Project Manager , the column AccountNo and Salary Should be Masked
For example:
Name | Mob.no | AccountNo | Address | SSN | Salary....... |and so on.
john | 123456 | 9876xxxxx | abx 123 | 1122 | xxxxx
IS there any way to achieve this in Hadoop.
Encrypting column's of data while Dumping into HDFS.
Masking columns based on Hierarchy .
Any Leads would be appreciated,Since i am new to hadoop

Related

How can i merge multiple columns from two different files in talend

Lets say i have multiple columns coming from two different files like that :
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
| | |
And another one like this :
USERNAME | AGE |
Jonathan | 33 |
Mike | 41 |
And i want to merge the data of the columns that have the same name into one like this while keeping the data of the columns that are unique at each field:
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
Jonathan | 33 | |
Mike | 41 | |
Sorry if the answer is obvious, im new to talend, thanks.
What tool is available toy you?
The Append function in SAS for example can do this for you.
You can use the append approach in Python, R or other language you intend using.
For Talen:
Copy the complete subjob1 – copy me sub job and paste it to create a second sub job.
Link the two sub jobs using an onSubjobOK link.
Open tFixedFlowInput, and change Records from first subjob to Records from second subjob.
Open tFileOutputDelimited on the new sub job, and tick Append, as shown in the following screenshot:
use a tUnite component to accomplish that
here is the link of the documentation : https://help.talend.com/r/fr-FR/8.0/orchestration/tunite
your flow would be
tFileInput1(excel or csv ) ----------------------------------------------
|
| ->tUnite -> tLogRow
tFileInput2(excel or csv )->tMap (add to empty fields GENDER & Children )|

In SCD Type2 how to find latest record

I have table i have run the job in scdtype 2 load the data below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
then i have run the second run load the data given below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
--------------------
1 | abc | bang |
here no dates,flags,and run ids
how to find second updated record in this situtation
Thanks
I don't think you'll be able to distinguish between the updated record and the original record.
A Dimension table using Type 2 SCD requires additional columns that describes the period in which the record is valid (or current), exactly for this reason.
The solution is to ensure your dimension table has these columns (Typically ValidFrom and ValidTo dates or date/times, and sometimes an IsCurrent flag for good measure). Your ETL process would then populate these columns as part of making the Type 2 updates.

How to merge two tables with same schema in Talend avoiding duplicates?

I have two tables TableA and TableB
TableA looks similar to following:
customerId | name | email |telephone
-------------------------------------------------
00001 | Anne | anne#gmail.com | 123456
00002 | Ben | ben#gmail.com |
00003 | Ryan | ryan#yahoo.com |
TableB looks similar to following:
customerId | name | email | telephone
---------------------------------------------------
76105 | Anne | anne#gmail.com |
89102 | Ben | ben#gmail.com | 567890
23390 | Ryan | ryan#yahoo.com | 756541
43769 | Abby | abby#yahoo.com | 890437
I'm trying to achieve the following 2 tables.
TableC
customerId | name | email |telephone
-------------------------------------------------
00001 | Anne | anne#gmail.com | 123456
00002 | Ben | ben#gmail.com | 567890
00003 | Ryan | ryan#yahoo.com | 756541
TableD
customerId | name | email |telephone
-------------------------------------------------
43769 | Abby | abby#gmail.com | 890437
I was using a tmap with TableA as the main and the TableB as the look up. In the tmap I created an inner join between TableA and TableB using email as the foreign key. I wrote innerJoin outputs to one table and innerJoin rejects to another. However I find some of the records missing in TableC.
What is the correct way to achieve this in Talend DI?
I think the choice of the main and the lookup impact the reject catching, here is what you need :
tmap :
tFixedFlowInput : to simulate your data
tLogRow: to display output data

Dynamically Identify Columns in External Tables

Dynamically Identify Columns in External Tables
We have a process wherein we upload employee data from multiple legislations (ex. US, Philippines, Latin America) via a SQL Loader.
This happens at least once a week and the current process is they create a control file every time they load employee information,
Load that into Staging Tables using SQL*Loader.
I was hoping to simplify the process by creating an External Table and running a concurrent request to put the data into our staging Tables.
There are two stumbling blocks i'm encountering:
There are some columns which are not being used by some legislations.
Example: US uses the column "Veteran_Information", while the Philippines and Latin America don't.
Philippines uses "SSS_Number" while US and Latin America Don't.
Latin America uses a "Medical_Insurance" Column while US and Philippines don't.
Something like below:
US: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, VETERAN_INFORMATION
PHL: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, SSS_NUMBER
LAT: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, MEDICAL_INSURANCE
Business Users don't use a Standard CSV Template/Format.
Since the File is being sent by Non-IT Business Users, they don't usually follow a prescribed format. (Training/User issue, probably).
they often don't follow the correct order of columns
they often don't follow the correct number of columns
they often don't follow the correct names of columns
Something like below:
US: LEGISLATION, EMPLOYEE_ID, VETERAN_INFORMATION, DATE_OF_BIRTH, EMAIL_ADD
PHL: EMP_NUM, LEGISLATION, DOB, SSS_NUMBER, EMAIL_ADDRESS
LAT: LEGISLATION, PS_ID, BIRTH_DATE, EMAIL, MEDICAL_INSURANCE
Is there a way for External Tables to identify the correct order and naming of columns even if they're not in the correct order/naming convention in the File?
Taking the Column Data from Problem 2:
US: LEGISLATION | EMPLOYEE_ID | VETERAN_INFORMATION | DATE_OF_BIRTH | EMAIL_ADD
US | 111 | No | 1967 | vet#gmail.com
PHL: EMP_NUM | LEGISLATION | DOB | SSS_NUMBER | EMAIL_ADDRESS
222 | PHL | 1898 | 456789 | pinoy#gmail.com
LAT: LEGISLATION | PS_ID | BIRTH_DATE | EMAIL | MEDICAL_INSURANCE
HON | 333 | 1956 | hon#gmail.com | Yes
I would like it to be like this when it appears in the External Table:
LEGISLATION | EMPLOYEE_NUMBER | DATE_OF_BIRTH | VETERAN_INFORMATION | SSS_NUMBER | MEDICAL_INSURANCE | EMAIL_ADDRESS
US | 111 | 1967 | Y | (NULL) | (NULL) | vet#gmail.com
PHL | 222 | 1898 | (NULL) | 456789 | (NULL) | pinoy#gmail.com
HON | 333 | 1956 | (NULL) | (NULL) | Yes | hon#gmail.com
Is there a way for External Tables to do something like above?
Thanks in advance!
The simplest would be:
Use three distinct load scripts for each type of input (US, PHL, HON). Each script just discards the other 2 record types, and places the columns (possibly doing some transformation, like 'No' -> 'N') in the right place and inserts NULL for columns that were not present for that record type.

How to repeat query in Oracle Forms upon dynamically changing ORDER_BY clause?

I have an Oracle Forms 6i form with a data block that consists of several columns.
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
------------------------------------------------------------------------------
The user can press F7 (to Enter in Query Mode, for example, he/she types JOH% in the first_name and H% in the DEPARTMENT field) , then F8 to execute the query and see the results. In this example, a list of all employees with their last name starting with JOH and working in any department starting with H will be listed. Here is a sample output of that query
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| MIKE | JOHN | HUMAN RES. | 05-MAY-82 | 02-FEB-95 | |
| BEN | JOHNATHAN | HOUSING | 23-APR-76 | 16-AUG-98 | |
| SMITH | JOHN | HOUSING | 11-DEC-78 | 30-JUL-91 | |
| | | | | | |
------------------------------------------------------------------------------
I then added a small button on top of each column to allow the user to sort the data by the desired column, by executing WHEN-BUTTON-PRESSED trigger:
set_block_property('dept', order_by, 'first_name desc');
The good news is that the ORDER_BY does change. The bad news is that the user never notice the change because he/she will need to do another query and execute to see the output ordered by the column they selected. In other words, user will only notice the change in the next query he/she will execute.
I tried to automatically execute the query upon changing the ORDER_BY clause like this:
set_block_property('dept', order_by, 'first_name desc');
go_block('EMPLOYEE');
do_key('EXECUTE_QUERY');
/* EXECUTE_QUERY -- same thing */
but what happens is that all data from the table is selected, ignoring the criteria that the user has initially set during the query mode entry.
I also searched for a solution to this problem and most of them deal with SYSTEM.LAST_QUERY and default_where. The problem is, last_query can refer to a different block from a different form, that is not valid on the currently displayed data bloc.
How can do the following in just one button press:
1- Change the ORDER_BY clause of the currently active datablock
and: 2- Execute the last query that the user has executed, using the same criteria that was set?
Any help will be highly appreciated.
You can get the last query of the block with get_block_property built-in function:
GET_BLOCK_PROPERTY('EMPLOYEE', LAST_QUERY);
Another option is to provide separate search field(s) on the form, instead of using the QBE functionality.

Resources