Hadoop PIG ouput is not split in mutliple files with PARALLEL operator - hadoop

Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key:
1 hello
2 bla
1 hi
2 works
2 end
But this data doesn't split:
1 hello
3 bla
1 hi
3 works
3 end
The code that I used that works fine for one and not for the other is
InputData = LOAD 'above_data.txt';
GroupReq = GROUP InputData BY $0 PARALLEL 2;
FinalOutput = FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();
The above code creates two output part files but in first input it splits the data nicely and put the key 1 in part-r-00000 and key 2 in part-r-00001. But for the second input it creates two part files but all the data ends up in part-r-00000. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys?
Note: for the second input if I use PARALLEL 3 (3 reducers), it creates three part files and add all the data for key 1 in part-0 and all the data for key 3 in part-3 file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.

That's because the number of the reducer that a key goes to is determined as hash(key) % reducersCount. If the key is an integer, hash(key) == key. When you have more data, they will be distributed more or less evenly, so you shouldn't worry about it.


Append multiple CSVs into one single file with Apache Nifi

I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
File 2:
Desired output:
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).
What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.

How to process large file with one record dependent on another in MapReduce

I have a scenario where there is a really large file and say line 1 record might have dependency on 1000th line data and the line 1 and 1000 can be part of separate spilts. Now my understanding of the framework is that record reader is going to return one key, value pair to mapper and each k,v pair will be independent of another. Moreover since the file has been divided into splits and i want that as well (i.e. splittable false is no option), can i handle this anyhow may be writing my own record reader, mapper or reducer?
Dependency is like -
Row1: a,b,c,d,e,f
Row2: x,y,z,p,q,r
Now x in Row2 need to be used with say d in Row1 to get my desired output.
I think what you need is to implement a reducer side join. Here you can see a better explanation of it: http://hadooped.blogspot.mx/2013/09/reduce-side-joins-in-java-map-reduce.html.
Both related values have to end in the same reducer (defined by the key and the Partitioner) and they should be grouped together (GroupingComparator) and may be use a SecondSort to order the grouped values.

Create Rows depending on count in Informatica

I am new to informatica power center tool and performing some assignment.
I have input data in a flat file.
data.csv contains
And Required output will be
output.csv should be like
Means I need to create output rows depending upon value in column. I tried it using java transformation and I got the result.
Is there any other way to do it.
Please help.
Java transformation is a very good approach, but if you insist on an alternative implementation, you can use a helper table and a Joiner transformation.
Create a helper table and populate it with appropriate amount of rows (you need to know the maximum value that may appear in the input file).
There is one row with COUNTER=1, two rows with COUNTER=2, three rows with COUNTER=3, etc.
Use a Joiner transformation to join data from the input file and the helper table - since the latter contains multiple rows for a single COUNTER value, the input rows will be multiplied.
Depending on your RDBMS, you may be able to produce the contents of the helper table using a SQL query in a source qualifier.

Mapreduce multiple map and redcuer

I am thinking to process CSV files, each size around 1 Mb using MapReduce which has data like
lat , lng
18.123, 77.312
I want to keep mapper and reducer count upon no. of input files.
Although for my Mapper input is key->filename, value->18.123, 77.312
But i needed to calculate Distance from first record and second one
i.e Geo Distance for 1st record -> 18.123, 77.312 and 18.434,77,456
2nd record -> 18.123, 77.312 and 18.434,77,456
In Mapper i get one line of csv file is readed at a time and pass that to reducer
So i am thinking to concatinate csv file records by delimiter Pipe sign (|)
Above csv data will become as
18.123, 77.312|18.434,77,456|18,654,77,483|....|....|...
In Mapper i will get key->filename and values->All records of csv with pipe sign
In reducer i will split it and calulate distance and store in Mysql
Is this correct approach or has some other way
Thanks in advance.

Problems generating ordered files in Pig

I am facing two issues:
Report Files
I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
I'd like all of these to end up in one report so what I end up doing is before storing the result using HBaseStorage, I'm sorting them using parallel 1: report = ORDER report BY col1 PARALLEL1. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
Is there a better way of generating a single file output?
Group By
I have several reports that perform group-by: grouped = GROUP data BY col unless I mention parallel 1 sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:
Instead of seeing this:
grouped_col_val_1, 5, 6
grouped_col_val_2, 1, 1
grouped_col_val_1, 3, 4
grouped_col_val_2, 5, 5
I should be seeing:
grouped_col_val_1, 8, 10
grouped_col_val_2, 6, 6
So I end up doing my group as follows: grouped = GROUP data BY col PARALLEL 1
then I see the correct result.
I have a feeling I'm missing something.
Here is a pseudo-code for how I am doing the grouping:
raw = LOAD '$path' USING PigStorage...
row = FOREACH raw GENERATE id, val
grouped = GROUP row BY id;
report = FOREACH grouped GENERATE group as id, SUM(val)
STORE report INTO '$outpath' USING PigStorage...
EDIT, new answers based on the extra details you provided:
1) No, the way you describe it is the only way to do it in Pig. If you want to download the (sorted) files, it is as simple as doing a hdfs dfs -cat or hdfs dfs -getmerge. For HBase, however, you shouldn't need to do extra sorting if you use the -loadKey=true option of HBaseStorage. I haven't tried this, but please try it and let me know if it works.
2) PARALLEL 1 should not be needed. If this is not working for you, I suspect your pseudocode is incomplete. Are you using a custom partitioner? That is the only explanation I can find to your results, because the default partitioner used by GROUP BY sends all instances of a key to the same reducer, thus giving you the results you expect.
1) You can use a merge join instead of just one reducer. From the Apache Pig documentation:
Often user data is stored such that both inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.
The way to do this is as follows:
C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
2) You shouldn't need to use PARALLEL 1 to get your desired result. The GROUP should work fine, regardless of the number of reducers you are using. Can you please post the code of the script you use for Case 2?
