Talend : How to fix the code of method is exceeding the 65535 byte limit - etl

I have a set of 5 tables that have 2 millions rows and 450 columns approximately
My job look like this :
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap---tUnite---tDBOutput
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
It's my 5 tables tables that I'm trying to union, with the tMap where I'm adding an Id to trace which table date come from + reduce number of columns (from 450 to 20)
Then I unite the 5 in one tUnite that load a table in Truncate - Insert mode
I'm trying to make it work but always have the same error which is "The code of method tDBInput-10Process is exceeding the 65535 bytes limit"

If you use only 20 of 450 columns, you could select only those columns in each of your tDBInput, instead of extracting all columns and filtering them in tMap.

Related

In sqoop what does "size" mean when used with --split-limit arguments

From sqoop docs
Using the --split-limit parameter places a limit on the size of the split section created. If the size of the split created is larger than the size specified in this parameter, then the splits would be resized to fit within this limit, and the number of splits will change according to that.
What does "size" refer to here. Can some one explain with a little example.
I was just reading this and I think it would be interpreted like this.
Example table has a Primary Key col called ID and is an INT and table has 1000 rows with the ID values from 1 to 1000. if you set num-mappers to 50 then you would have 50 tasks each try to import 20 rows. The first query would have a predicate that says WHERE ID >= 1 AND ID <= 20. The 2nd mapper would say WHERE ID >= 21 AND ID <= 40... and so on.
If you also define the split-limit then depending on the size of the splits this parameter may adjust the number of tasks used to sqoop the data.
For example, with num-mappers set to 50 and split-limit set to 10, you would now need 100 tasks to import 10 rows of data each to get all 1000 rows. Your first task would now saw WHERE ID >= 1 AND ID <= 10.
In the case of a DateTime column, the value is now based on seconds. So If you have 10 years of data with 1 row for every day you would have about 3,653 rows of data. If you set num-mappers to 10 then your tasks would each try to sqoop about 365 days of data with a predicate that looked something like MYDATETIMECOL >= '2010-01-01' AND MYDATETIMECOL <= '2010-12-31' but if you also set the split-limit to something like 2592000 (num of seconds in 30 days) then you would need about 122 tasks to sqoop the data and the first task would have a predicate like MYDATETIMECOL >= '2010-01-01' AND MYDATETIMECOL <= '2010-01-30'.
These two examples have both used a 1:1 ratio for column value to row count. If either of these tables had 1000 rows per value in the split-by col then ALL of those rows would be sqooped as well.
Example with DateTime col where every day you have loaded 1000 rows for the last 10 years and now you have 3,653,000 rows, the predicates and the number of tasks would be the same but the number of rows sqooped in each of those tasks would be 1000x more.

convert string of a column to multiple rows

For data like below
Col1
----
1
23
34
124
Output should be like below
Out
1
2
3
4
I tried the below hierarchical query but its giving repeated data
select substr(col1, level, 1)
from table1
connect by level <= length(col1);
I can't use distinct as this is sample and main table where I have to use this query has quite large data.
Thanks

Sampling Issue with hive

"all_members" is a table in hive with 10m rows and 1 column: "membership_nbr". I want to sample 3000 rows. This is what I have done:
hive>create table sample_members as select * from all_members limit 1;
hive>insert overwrite table sample_members select membership_nbr from all_members tablesample(3000 rows);
hive>select count(*) from sample_members;
OK 45000
The result wont change if I replace 3000 rows with 300 rows
Do I do something wrong?
Table Sampling using tablesample(3000 rows) wont fetch 3000 rows from entire table instead it will fetch 3000 rows from each input split.
So, your query might run 15 mappers. So, each mapper will fetch 3000 rows. Totally, 3000 * 15 = 45000 rows. Also, if you change the 3000 rows to 300 rows you will get 4500 rows as output after sampling.
So, as per your requirement you have to give tablesample(200 rows). As a result each mapper will fetch 200 rows. Finally, 15 mappers will fetch 3000 sampling rows.
Refer the below link for various types of sampling:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

Split amount into multiple rows if amount>=$10M or <=$-10B

I have a table in oracle database which may contain amounts >=$10M or <=$-10B.
99999999.99 chunks and also include remainder.
If the value is less than or equal to $-10B, I need to break into one or more 999999999.99 chunks and also include remainder.
Your question is somewhat unreadable, but unless you did not provide examples here is something for start, which may help you or someone with similar problem.
Let's say you have this data and you want to divide amounts into chunks not greater than 999:
id amount
-- ------
1 1500
2 800
3 2500
This query:
select id, amount,
case when level=floor(amount/999)+1 then mod(amount, 999) else 999 end chunk
from data
connect by level<=floor(amount/999)+1
and prior id = id and prior dbms_random.value is not null
...divides amounts, last row contains remainder. Output is:
ID AMOUNT CHUNK
------ ---------- ----------
1 1500 999
1 1500 501
2 800 800
3 2500 999
3 2500 999
3 2500 502
SQLFiddle demo
Edit: full query according to additional explanations:
select id, amount,
case
when amount>=0 and level=floor(amount/9999999.99)+1 then mod(amount, 9999999.99)
when amount>=0 then 9999999.99
when level=floor(-amount/999999999.99)+1 then -mod(-amount, 999999999.99)
else -999999999.99
end chunk
from data
connect by ((amount>=0 and level<=floor(amount/9999999.99)+1)
or (amount<0 and level<=floor(-amount/999999999.99)+1))
and prior id = id and prior dbms_random.value is not null
SQLFiddle
Please adjust numbers for positive and negative borders (9999999.99 and 999999999.99) according to your needs.
There are more possible solutions (recursive CTE query, PLSQL procedure, maybe others), this hierarchical query is one of them.

Hive Script - How to transform table / find average of certain records according to one columns name?

I want to transform a Hive table by aggregating based on averages. However, I don't want the average value of an entire column, I want the average of the records in that column that have the same type in another column.
Here's an example, easier than trying to explain:
TABLE I HAVE:
Timestamp CounterName CounterValue MaxCounterValue MinCounterValue
00:00 Counter1 3 3 100:00 Counter2 4 5 2
00:00 Counter3 1 4 1
00:00 Counter4 6 6 100:05 Counter1 3 5 200:05 Counter2 2 2 200:05 Counter3 4 5 400:05 Counter4 6 6 5.......
TABLE I WANT:
CounterName AvgCounterValue MaxCounterValue MinCounterValue
Counter1 3 5 1Counter2 3 5 2Counter3 2.5 5 1Counter4 6 6 1
So I have a list of a bunch of counters, which each have multiple records (one per 5 minute time period). Every time each counter is logged, it has a value, a max value during that 5 minutes, and a min value. I want to aggregate this huge table so that it just has one record for each counter, which records the overall average value for that counter from all the records in the table,and then the overall min/max value of the counter in the table.
The reason this is difficult is because all the documentation says is how to aggregate by the average of a column in one table - I don't know how to split it up in groups.
Here's the script I've started with:
FROM HighCounters INSERT OVERWRITE TABLE MdsHighCounters
SELECT
HighCounters.CounterName AS CounterName,
HighCounters.CounterValue AS CounterValue
HighCounters.MaxCounterValue AS MaxCounterValue,
HighCounters.MinCounterValue AS MinCounterValue
GROUP BY HighCounters.CounterName;
And I don't know where to go from there... any ideas? Thanks!!
I think I solved my own problem:
FROM HighCounters INSERT OVERWRITE TABLE MdsHighCounters
SELECT
HighCounters.CounterName AS CounterName,
AVG(HighCounters.CounterValue) AS CounterValue,
MAX(HighCounters.MaxCounterValue) AS MaxCounterValue,
MIN(HighCounters.MinCounterValue) AS MinCounterValue
GROUP BY HighCounters.CounterName;
Does this look right to you?

Resources