Headers in Pig Output - hadoop

I have written a successful script for counting total number of steps taken by pedestrians, and their highest step count. What I don't get is producing headers in Pig Output, so that output looks neat, and clean. Is there any way that can produce headers while writing output. Following is my code,
register 'piggybank-0.15.0.jar';
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
part1 = LOAD '/home/cloudera/Pedestrian_Counts.csv' using CSVLoader(',') as (date_time, sensor_id: int, sensor_name: chararray, hourly_counts: int);
part2 = GROUP part1 BY (sensor_id, sensor_name);
part3 = FOREACH part2 GENERATE FLATTEN(group) AS (sensor_id, sensor_name), SUM(part1.hourly_counts), MAX(part1.hourly_counts);
STORE part3 into '/home/cloudera/pedestrian_result' using PigStorage('\t');
First 5 lines of my output is as follows,
1 Bourke Street Mall (North) 49591633 5573
2 Bourke Street Mall (South) 67759939 7035
3 Melbourne Central 70973929 5890
4 Town Hall (West) 90274498 8052
5 Princes Bridge 58752043 7391
Can we place headers while writing output? Thanks in advance.

Either merge all the part files data to a file in local file system which has header information in it or use hive table to store the output of this pig script.
Using Hive table for storing the output will have its own schema.
You should be using Hcat for accessing Hive in Pig.

Related

Joining three csv files with different header in NiFi

I am new to nifi. Currently I have a use case where I have to merge three csv files that I extracted from the XLXS files. I need to merge them on the basis of NPI column in all the three csv file.
files1.xlsx looks like-
NPI FULL_NAME LAST_NAME FIRST_NAME
1003002627 Arora Himanshu Arora Himanshu
1003007204 Arora Vishal Arora Vishal
files2.xlsx looks like-
NPI No Employee Number CHI/SL Acct-Unit
1003002627 147536 5812207
1003007204 185793 5854207
files3.xlsx looks like -
Individual NPI Group NPI Market
1003002627 1396935714 Houston
1003007204 1396935714 Houston
I want to left join on my first csv with the specific column in the other csv file so that my desired output is
NPI Full Name Employee Number Market
1003002627 Arora Himanshu 147536 Houston
1003007204 Arora Vishal 185793 Houston
I tried with the Query Record processor but I don't know how three different schema from three different csv file should I combine together. Any help on this will be highly appreciated.
This is what I have tried.
NiFi doesn't have the ability to do what amounts to a SQL join on three different record sets like this out of the box, but there is a work around here if you know Python:
Zip them up.
Ingest the zip file.
Use PutFile to put the zip to a standard location.
Write a Python script that will unzip the zip for you and do the merging.
Call the Python script immediately after PutFile with the ExecuteStreamCommand processor.
The ExecuteStreamCommand processor will read stdout, so just write your combined output with sys.stdout and it'll get back into NiFi.

Comparing two large files having different order using Unix shell script

I have two text files each of size 3.5GB that I want to compare using Unix script. The files contain around 5 million records in them.
The layout of the files are like below.
*<sysdate>
<Agent Name 1>
<Agent Address 1>
<Agent Address 2>
<Agent Address 3>
...
<Agent Name 2>
<Agent Address 1>
<Agent Address 2>
<Agent Address 3>
...
<Total number of records present>*
Sample file.
<sysdate>
Sachin Tendulkar 11051973 M
AddrID1 AddrLn11 AddrLn12 City1 State1 Country1 Phn1 OffcAddr11 OffcAddr12 St1 Cntry1
AddrID2 AddrLn21 AddrLn22 City2 State2 Country2 Phn2 OffcAddr21 OffcAddr22 St2 Cntry2
...
Sourav Ganguly 04221975 M
AddrID1 AddrLn11 AddrLn12 City1 State1 Country1 Phn1 OffcAddr11 OffcAddr12 St1 Cntry1
AddrID2 AddrLn21 AddrLn22 City2 State2 Country2 Phn2 OffcAddr21 OffcAddr22 St2 Cntry2
...
<Total number of records present>
The order of the Agent addresses in the two files is different. I need to find the records that are present in one file but not in the other and also the mismatched records. I tried sorting the files using Unix sort command initially but it failed due to server space issue. ETL (Informatica) approach can also be considered.
Any help would be appreciated
You can use awk and start writing to a new file each time you match Agent Name and give that file the name of the agent (perkaps in subdir's using the first three characters). Next compare the directories (trees) from both input files (diff -r).
Another solution is importing all records in two different tables and use sql to compare:
select name from table1 where name not in (select name from table2);
select name from table2 where name not in (select name from table1);
select name from table1
inner join table2 on table1.name=table2.name
where table1.address1 <> table2.address1
or table1.address2 <> table2.address2
...
In informatica load both the files.
find MD5 of each row, by concatenating each column, For example:
MD5(COL1||Col2||COL3)
Now compare both the MD5 values from each file by using joiner, by this way you can find matching and non-matching rows.
first at all, send an example of the 2nd file
why you cant sorter the data with the sorter transformation?
my approuch would be concatenate the first 3 columns (name,addres1,addres2) and make it as key, next use a joiner transformation to match the data.
you also can do a union transformation and after that an aggregator transformation to count how many times the key that you create is found
if the count is equal to 2, mean that the data is in both files
if the count is equal to 1, mean that the data is in just 1 file
send more info about the problem to be more specific
Try to restructure your data first.
Keep adding the AgentName and other fields to every address related to that agent. Use simple tricky expression logic, like a variable/counting methodology to achieve this. By doing this your flat files will be compare-friendly & can be compared easily either in UNIX or in Informatica.
Let me know if you are interested in this solution, shall help you more.

Problems generating ordered files in Pig

I am facing two issues:
Report Files
I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
I'd like all of these to end up in one report so what I end up doing is before storing the result using HBaseStorage, I'm sorting them using parallel 1: report = ORDER report BY col1 PARALLEL1. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:
B = FOREACH A GENERATE col1,col2,col3;
B = ORDER B BY col1 PARALLEL 1;
STORE B INTO $output USING PigStorage(',');
Is there a better way of generating a single file output?
Group By
I have several reports that perform group-by: grouped = GROUP data BY col unless I mention parallel 1 sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:
Instead of seeing this:
part-r-00000:
grouped_col_val_1, 5, 6
grouped_col_val_2, 1, 1
part-r-00001:
grouped_col_val_1, 3, 4
grouped_col_val_2, 5, 5
I should be seeing:
part-r-00000:
grouped_col_val_1, 8, 10
grouped_col_val_2, 6, 6
So I end up doing my group as follows: grouped = GROUP data BY col PARALLEL 1
then I see the correct result.
I have a feeling I'm missing something.
Here is a pseudo-code for how I am doing the grouping:
raw = LOAD '$path' USING PigStorage...
row = FOREACH raw GENERATE id, val
grouped = GROUP row BY id;
report = FOREACH grouped GENERATE group as id, SUM(val)
STORE report INTO '$outpath' USING PigStorage...
EDIT, new answers based on the extra details you provided:
1) No, the way you describe it is the only way to do it in Pig. If you want to download the (sorted) files, it is as simple as doing a hdfs dfs -cat or hdfs dfs -getmerge. For HBase, however, you shouldn't need to do extra sorting if you use the -loadKey=true option of HBaseStorage. I haven't tried this, but please try it and let me know if it works.
2) PARALLEL 1 should not be needed. If this is not working for you, I suspect your pseudocode is incomplete. Are you using a custom partitioner? That is the only explanation I can find to your results, because the default partitioner used by GROUP BY sends all instances of a key to the same reducer, thus giving you the results you expect.
OLD ANSWERS:
1) You can use a merge join instead of just one reducer. From the Apache Pig documentation:
Often user data is stored such that both inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.
The way to do this is as follows:
C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
2) You shouldn't need to use PARALLEL 1 to get your desired result. The GROUP should work fine, regardless of the number of reducers you are using. Can you please post the code of the script you use for Case 2?

assigning IDs to hadoop/PIG output data

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).
We are loading the data in PIG like this:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);
Then our generate is:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`
It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?
Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?
EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:
someProduct\t35 521 225
the output would be:
someProduct\t6009
I don't think you can do it this wait in Pig.
A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.
DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
output = FOREACH data GENERATE title, MAP_PRODUCT(category);
This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

Resources