Pig Latin JOIN error

Pig Latin JOIN error - hadoop

I am loading two datasets A, B
A= LOAD [datapath]
B= LOAD [datapath]
I want to JOIN all fields of both A and B by id field.Both A and B have common field id and other fields. When I perform JOIN by id:
AB= JOIN A by id, B by id;
The resulted dataset AB includes two similar columns for the field id, However, it only must show only one column for the id field. What am I doing wrong here?

That's the expected behaviour, when joining two datasets, all columns are included (even those ones which you are joining by)
You can check it here
If you want to drop a column you can do it with the generate statement. But first you ned to know the position of the undesired column.
If that column is, for instance, in the 3th position
C = FOREACH AB GENERATE $1,$2, $4, $5...;
Edit from the comments
You can also use a generate statement without knowing position. Example:
C = FOREACH AB GENERATE A::id AS id, A::foo AS foo, B::bar AS bar;

Related

Using query hints to use a index in an inner table

I have a query which uses the view a as follows and the query is extremely slow.
select *
from a
where a.id = 1 and a.name = 'Ann';
The view a is made up another four views b,c,d,e.
select b.id, c.name, c.age, e.town
from b,c,d,e
where c.name = b.name AND c.id = d.id AND d.name = e.name;
I have created an index on the table of c named c_test and I need to use it when executing the first query.
Is this possible?

Are you really using this deprecated 1980s join syntax? You shouldn't. Use proper explicit joins (INNER JOIN in your case).
You are joining the two tables C and D on their IDs. That should mean they are 1:1 related. If not, "ID" is a misnomer, because an ID is supposed to identify a row.
Now let's look at the access route: You have the ID from table B and the name from tables B and C. We can tell from the column name that b.id is unique and Oracle guarantees this with a unique index, if the database is set up properly.
This means the DBMS will look for the B row with ID 1, find it instantly in the index, find the row instantly in the table, see the name and see whether it matches 'Ann'.
The only thing that can be slow hence is joining C, D, and E. Joining on unique IDs is extremely fast. Joining on (non-unigue?) names is only fast, if you provide indexes on the names. I'd recommend the following indexes accordingly:
create index idx_c on c (name);
create index idx_e on e (name);
To get this faster still, use covering indexes instead:
create index idx_b on b (id, name);
create index idx_c on c (name, id, age);
create index idx_d on d (id, name);
create index idx_e on e (name, town);

How to effeciently select data from two tables?

I have two tables: A, B.
A has prisoner_id and prisoner_name columns.
B has all other info about prisoners included prisoner_name column.
First I select all of the data that I need from B:
WITH prisoner_datas AS
(SELECT prisoner_name, ... FROM B WHERE ...)
Then I want to know all of the id of my prisoner_datas. To do this I need to combine information by prisoner_name column, because it's common for both tables
I did the following
SELECT A.prisoner_id, prisoner_datas.prisoner_name, prisoner_datas. ...,
FROM A, prisoner_datas
WHERE A.prisoner_name = prisoner_datas.prisoner_name
But it works very slow. How can I improve performance?

Add an index on the prisoner_name join column in the B table. Then the following join should have some performance improvement:
SELECT
A.prisoner_id,
B.prisoner_name,
B.prisoner_datas.id -- and other columns if needed
FROM A
INNER JOIN B
ON A.prisoner_name = B.prisoner_name
Note here that I used an explicit join syntax here. It isn't required, and the query plan might not change, but it makes the query easier to read. I don't think the CTE will change much, but the lack of an index on the join column should be important here.

PIG: How to remove '::' in the column name

I have a pig relation like below:
FINAL= {input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray,test_1:: type: chararray,test_2::name:chararray}
I am trying to store all columns for input_md5 relation to a hive table.
like all input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray not taking test_1:: type: chararray,test_2::name:chararray
is there any command in pig which filters only columns of input_md5.Something like below:
STORE= FOREACH FINAL GENERATE all input_md5::type .
I know that pig have :
FOREACH FINAL GENERATE all input_md5::type as type syntax, but i have many columns so I cannot use as in my code.
Because when i try:
STORE= FOREACH FINAL GENERATE input_md5::type .. bus_input_md5::name;
Pig throws an error:
org.apache.hive.hcatalog.common.HCatException : 2007 : Invalid column position in partition schema : Expected column <type> at position 1, found column <input_md5::type>
Thanks in advance,

Resolved this issue , below is the fix:
Create a relation with some filter condition as below:
DUMMY_RELATION= FILTER SOURCE_TABLE BY type== ''; (I took a column named type ,this can be filtered by any column in the table , all that matters is we need its schema)
FINAL_DATASET= UNION DUMMY_RELATION,SCHEMA_1,SCHEMA_2;
(this new DUMMY_RELATIONn should be placed 1st in the union)
Now you no more have :: operator And your column names would match hive table's column names, provided your source table (to DUMMY_RELATION) and target table have same column order.
Thanks to myself :)

I implemented Neethu's example this way. May have typos, but it shows how to implement this idea.
tableA = LOAD 'default.tableA' USING org.apache.hive.hcatalog.pig.HCatLoader();
tableB = LOAD 'default.tableB' USING org.apache.hive.hcatalog.pig.HCatLoader();
--load empty table
finalTable = LOAD 'default.finalTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
--example operations that end up with '::' in column names
g = group tableB by (id);
j = JOIN tableA by id LEFT, g by group;
result = foreach j generate tableA::id, tableA::col2, g::tableB;
--union empty finalTable and result
result2 = union finalTable, result;
--bob's your uncle
STORE result2 INTO 'finalTable' USING org.apache.hive.hcatalog.pig.HCatStorer();
Thanks to Neethu!

Inner join two data sets using Apache Hadoop Pig

I have two data sets (1M unique string) and (1B unique string); I want to know how many strings are common in both sets, and wondering what is the most efficient way to get the number using Apache Pig?

You can first join both the file like below:
A = LOAD '/joindata1.txt' AS (a1:int,a2:int,a3:int);
B = LOAD '/joindata2.txt' AS (b1:int,b2:int);
X = JOIN A BY a1, B BY b1;
Then you can count the number of rows :
grouped_records = GROUP X ALL;
count_records = FOREACH grouped_records GENERATE COUNT(A.a1);
Does it help you problem...

Your case doesn't fall under either replicate or merge or skewed join. So you have to do a default join, where in map phase it annotates each record's source, Join key would be used as the shuffle key so that the same join key goes to same reducer then the leftmost input is cached in memory in the reducer side and the other input is passed through to do a join. You could also improve your join by normal join optimizations like filter NULL's before joining and table which has the largest number of tuples per key could be kept as the last table in your query.

If your data is already sorted in both the data sets you can define merged join.
Mergede = join A by a1, B by b1 USING "merge";
Skewed Join: If the data is skewed and user need finer control over the allocation to reducers.
skewedh = join A by a1, B by b1 USING "skewed";

Pig multiply a number from a table to all the values of another table

I have two tables:
A: (feature:chararray, value:float)
B:(multiplier:charray, value:float)
where A is a table with thousands of rows and B has only one row.
What I wanna do is take all the rows in A and multiply A.value by B.value.
e.g.
A:[('f1', 1.5) , ('f2', 2.3)]
B:[('mul', 2)]
I'd like to product a table C
C: [('f1', 3), ('f2', 4.6)]
Is there an easy way to do so?

You can do a CROSS and a FOREACH ... GENERATE.
X = A CROSS B;
Y = FOREACH X GENERATE A::feature, A::value * B::value;
The above code has not been tested.

If You are very sure that the 2nd table has only one row then take the first column
of 2nd table and hardcode the same value as last column in 1st table and then
do the inner join and the you can easily multiply
Let say first file as plain.txt
(f1,1.5)
(f2,2)
here is the second file as multi.txt
(mul,2)
A = load '/user/cloudera/inputfiles/plain.txt' USING PigStorage(',') AS(feature:chararray,value:double);
B = load '/user/cloudera/inputfiles/multi.txt' USING PigStorage(',') AS(operation:chararray,no:int);
C = foreach A generate feature,value,'mul' as ope;
D = join C by ope, B by operation;
E = foreach D generate feature,(value*no) as multiplied_value;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig Latin JOIN error - hadoop

Related

Using query hints to use a index in an inner table

How to effeciently select data from two tables?

PIG: How to remove '::' in the column name

Inner join two data sets using Apache Hadoop Pig

Pig multiply a number from a table to all the values of another table

Categories

Resources