How can i merge multiple columns from two different files in talend - etl

Lets say i have multiple columns coming from two different files like that :
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
| | |
And another one like this :
USERNAME | AGE |
Jonathan | 33 |
Mike | 41 |
And i want to merge the data of the columns that have the same name into one like this while keeping the data of the columns that are unique at each field:
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
Jonathan | 33 | |
Mike | 41 | |
Sorry if the answer is obvious, im new to talend, thanks.

What tool is available toy you?
The Append function in SAS for example can do this for you.
You can use the append approach in Python, R or other language you intend using.
For Talen:
Copy the complete subjob1 – copy me sub job and paste it to create a second sub job.
Link the two sub jobs using an onSubjobOK link.
Open tFixedFlowInput, and change Records from first subjob to Records from second subjob.
Open tFileOutputDelimited on the new sub job, and tick Append, as shown in the following screenshot:

use a tUnite component to accomplish that
here is the link of the documentation : https://help.talend.com/r/fr-FR/8.0/orchestration/tunite
your flow would be
tFileInput1(excel or csv ) ----------------------------------------------
|
| ->tUnite -> tLogRow
tFileInput2(excel or csv )->tMap (add to empty fields GENDER & Children )|

Related

Airtable: Join two tables into a unified master output

I have data in two different linked tables in Airtable and I need to join them together. See example:
The PERSON table looks like:
Name | Classes
----------------
John | A,B,C,F
Sally | B,F
Max | B,C
While the linked CLASSES table looks like:
Class | Date | People
---------------------------
A | 1975 | John
B | 2000 | John,Sally,Max
C | 1823 | John,Max
D | 1492 |
E | 2020 |
F | 2010 | John,Sally
What I need is:
Person|Class|Date
--------------
John | A | 1975
John | B | 2000
John | C | 1823
John | F | 2010
Sally | B | 2000
Sally | F | 2010
Max | B | 2000
Max | C | 1823
How do I get this view / table as output?
The more I see questions like this, with no answer, the more I realise how airtable just isn't a database in any real sense.
This is a perfectly reasonable question about how to join 2 tables after those tables have been normalised. Answer? It can't be done, not easily!
So what is airtable supposed to be used for, building non-normalised databases, otherwise known as a spreadsheet!
If you use click "Class" field like "A" or "B" in the "Person" table, it'll show the popup so that you could see the class details.
Or if you really want to need that kind of table, my suggestion is like this
Create a new table called "xxx", and write the code in the scripting block and populate the data from "Person", "Class" tables to the new table.
PS: Scripting block is only supported in the "Pro" plan.

Efficient way to join by levenshtein in Hive or Impala

I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?
Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance

Re: Transpose data using Linq

I have a table, in the following format,
|BallotNo | City | CandidateNo | Votes
|Box1 | AA | Cand1 | 1200
|Box1 | AA | Cand2 | 1500
|Box2 | BB | Cand1 | 2500
|Box2 | BB | Cand2 | 3600
uing linq, I want to a get a result in the format
|Box1 |AA |Cand1 |1200 |Cand2 |1500
|Box2 |BB |Cand1 |2500 |Cand2 |3600
Thanks
You are looking for a grouping option.
As I have understood, you need to group by City row, it is pretty easy, see the http://msdn.microsoft.com/library/bb534492.aspx link on how to use the GroupBy extension method.

How to repeat query in Oracle Forms upon dynamically changing ORDER_BY clause?

I have an Oracle Forms 6i form with a data block that consists of several columns.
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
------------------------------------------------------------------------------
The user can press F7 (to Enter in Query Mode, for example, he/she types JOH% in the first_name and H% in the DEPARTMENT field) , then F8 to execute the query and see the results. In this example, a list of all employees with their last name starting with JOH and working in any department starting with H will be listed. Here is a sample output of that query
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| MIKE | JOHN | HUMAN RES. | 05-MAY-82 | 02-FEB-95 | |
| BEN | JOHNATHAN | HOUSING | 23-APR-76 | 16-AUG-98 | |
| SMITH | JOHN | HOUSING | 11-DEC-78 | 30-JUL-91 | |
| | | | | | |
------------------------------------------------------------------------------
I then added a small button on top of each column to allow the user to sort the data by the desired column, by executing WHEN-BUTTON-PRESSED trigger:
set_block_property('dept', order_by, 'first_name desc');
The good news is that the ORDER_BY does change. The bad news is that the user never notice the change because he/she will need to do another query and execute to see the output ordered by the column they selected. In other words, user will only notice the change in the next query he/she will execute.
I tried to automatically execute the query upon changing the ORDER_BY clause like this:
set_block_property('dept', order_by, 'first_name desc');
go_block('EMPLOYEE');
do_key('EXECUTE_QUERY');
/* EXECUTE_QUERY -- same thing */
but what happens is that all data from the table is selected, ignoring the criteria that the user has initially set during the query mode entry.
I also searched for a solution to this problem and most of them deal with SYSTEM.LAST_QUERY and default_where. The problem is, last_query can refer to a different block from a different form, that is not valid on the currently displayed data bloc.
How can do the following in just one button press:
1- Change the ORDER_BY clause of the currently active datablock
and: 2- Execute the last query that the user has executed, using the same criteria that was set?
Any help will be highly appreciated.
You can get the last query of the block with get_block_property built-in function:
GET_BLOCK_PROPERTY('EMPLOYEE', LAST_QUERY);
Another option is to provide separate search field(s) on the form, instead of using the QBE functionality.

Select only the rows which timestamp correspond to the current month

I am starting to try some experiments using Google SpreadSheets as a DB and for that I am collecting data from different sources and inserting them via spreadsheets API into a sheet.
Each row has a value (Column B) and a timestamp (Column A).
+---------------------+------+
| ColA | ColB |
+---------------------+------+
| 13/10/2012 00:19:01 | 42 |
| 19/10/2012 00:29:01 | 100 |
| 21/10/2012 00:39:01 | 23 |
| 22/10/2012 00:29:01 | 1 |
| 23/10/2012 00:19:01 | 24 |
| 24/10/2012 00:19:01 | 4 |
| 31/10/2012 00:19:01 | 2 |
+---------------------+------+
What I am trying to do is to programmatically add the sum of all rows in Column B where Column A is equal to the current month into a different cell.
Is there any function that I can use for that? Or anyone can point me to the right direction on how can I create a custom function which might do something like this? I know how to do this using MySQL but I couldn't find anything for Google SpreadSheets
Thanks in advance for any tip in the right direction.
Would native spreadsheet functions do?
=ArrayFormula(SUMIF(TEXT(A:A;"MM/yyyy");TEXT(GoogleClock();"MM/yyyy");B:B))

Resources