Split file into 4 equal parts using apache Pig - hadoop

I want to split a file into 4 equal parts using Apache pig. Example, if a file has 100 lines the first 25 should go to the 1st output file and so on.. the last 25 lines should go to the 4th output file. Can someone help me to achieve this. I am using Apache pig because the number of records in the file will be in Millions and there are previous steps that generate the file that needs to be split uses Pig.

I did a bit of digging on this, because it comes up the the Hortonworks sample exam for hadoop. It doesn't seem to be well documented - but its quite simple really. In this example I was using the Country sample database offered for download on dev.mysql.com:
grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';
Then if we have a look at the directory in hdfs:
[root#sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r-- 3 hive hdfs 0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r-- 3 hive hdfs 3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r-- 3 hive hdfs 4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r-- 3 hive hdfs 4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002
Hope that helps.

You can use some of the below PIG feature to achieve your desired result.
SPLIT function http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
MultiStorage class : https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
Write custom PIG storage : https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions
You have to provide some condition based on your data.

This could do. But may be there could be a better option.
A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';

My requirement changed a bit, I have to store only the first 25% of the data into one file and the rest to another file. Here is the pig script that worked for me.
ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1 <= $0/4;
rest_file = filter with_max by $1 > $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');

Related

How to count number of files under specific directory in hadoop?

I'm new to map-reduce framework. I want to find out the number of files under a specific directory by providing the name of that directory.
e.g. Suppose we have 3 directories A, B, C and each one is having 20, 30, 40 part-r files respectively. So I'm interested in writing a hadoop job, which will count files/records in each directory i.e I want an output in below formatted .txt file:
A is having 20 records
B is having 30 records
C is having 40 records
These all directories are present in HDFS.
The simplest/native approach is to use built in hdfs commands, in this case -count:
hdfs dfs -count /path/to/your/dir >> output.txt
Or if you prefer a mixed approach via Linux commands:
hadoop fs -ls /path/to/your/dir/* | wc -l >> output.txt
Finally the MapReduce version has already been answered here:
How do I count the number of files in HDFS from an MR job?
Code:
int count = 0;
FileSystem fs = FileSystem.get(getConf());
boolean recursive = false;
RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive);
while (ri.hasNext()){
count++;
ri.next();
}
System.out.println("The count is: " + count);

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

Pig Latin - foreach generate method does not work without the first field

I am facing a strange problem with pig generate function where if I do not use the first field the data generated seems to be wrong. Is this the expected behaviour ?
a = load '/input/temp2.txt' using PigStorage(' ','-tagFile') as (fname:chararray,line:chararray) ;
grunt> b = foreach a generate $1;
grunt> dump b;
(temp2.txt)
(temp2.txt)
grunt> c = foreach a generate $0,$1;
grunt> dump c;
(temp2.txt,field1,field2)
(temp2.txt,field1,field22)
$cat temp2.txt
field1,field2
field1,field22
pig -version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35
In the example I was expecting dump b to return data file values instead of the file name
in your example , you use PigStorage(' ','-tagFile') ,so each line were split by space .
then:
$0 ->field1,field2
$1 -> nothing ,
just use PigStorage(',','-tagFile') .

Pig Latin - adding values from different bags?

I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel
I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log

How to limit number of concurrent jobs that are starting by Pig script?

I am trying to implement simple data processing flow for POC in Pig using Hortonworks sandbox.
The idea is following: there is some set of already processed data. New data set should be added to old data without duplicates.
For testing purpose I use very small data sets (less than 10 KB).
For virtual machine I've allocated 4GB of RAM and 2 of 4 processor cores.
Here is my Pig script:
-- CONFIGURABLE PROPERTIES
%DEFAULT atbInput '/user/hue/ATB_Details/in/1'
%DEFAULT atbOutputBase '/user/hue/ATB_Details/out/1'
%DEFAULT atbPrevOutputBase '/user/hue/ATB_Details/in/empty'
%DEFAULT validData 'valid'
%DEFAULT invalidData 'invalid'
%DEFAULT billDateDimensionName 'tmlBillingDate'
%DEFAULT admissionDateDimensionName 'tmlAdmissionDate'
%DEFAULT dischargeDateDimensionName 'tmlDischargeDate'
%DEFAULT arPostDateDimensionName 'tmlARPostDate'
%DEFAULT patientTypeDimensionName 'dicPatientType'
%DEFAULT patientTypeCodeDimensionName 'dicPatientTypeCode'
REGISTER bdw-all-deps-1.0.jar;
DEFINE toDateDimension com.epam.bigdata.etl.udf.ToDateDimension();
DEFINE toCodeDimension com.epam.bigdata.etl.udf.ToCodeDimension();
DEFINE isValid com.epam.bigdata.etl.udf.atbdetails.IsValidFunc();
DEFINE isGarbage com.epam.bigdata.etl.udf.atbdetails.IsGarbageFunc();
DEFINE toAccounntBalanceCategory com.epam.bigdata.etl.udf.atbdetails.ToBalanceCategoryFunc();
DEFINE isEndOfMonth com.epam.bigdata.etl.udf.IsLastDayOfMonthFunc();
DEFINE toBalanceCategoryId com.epam.bigdata.etl.udf.atbdetails.ToBalanceCategoryIdFunc();
rawData = LOAD '$atbInput';
--CLEANSING
SPLIT rawData INTO garbage IF isGarbage($0),
cleanLines OTHERWISE;
splitRecords = FOREACH cleanLines GENERATE FLATTEN(STRSPLIT($0, '\\|'));
cleanData = FOREACH splitRecords GENERATE
$0 AS Id:LONG,
$1 AS FacilityName:CHARARRAY,
$2 AS SubFacilityName:CHARARRAY,
$3 AS PeriodDate:CHARARRAY,
$4 AS AccountNumber:CHARARRAY,
$5 AS RAC:CHARARRAY,
$6 AS ServiceTypeCode:CHARARRAY,
$7 AS ServiceType:CHARARRAY,
$8 AS AdmissionDate:CHARARRAY,
$9 AS DischargeDate:CHARARRAY,
$10 AS BillDate:CHARARRAY,
$11 AS PatientTypeCode:CHARARRAY,
$12 AS PatientType:CHARARRAY,
$13 AS InOutType:CHARARRAY,
$14 AS FinancialClassCode:CHARARRAY,
$15 AS FinancialClass:CHARARRAY,
$16 AS SystemIPGroupCode:CHARARRAY,
$17 AS SystemIPGroup:CHARARRAY,
$18 AS CurrentInsuranceCode:CHARARRAY,
$19 AS CurrentInsurance:CHARARRAY,
$20 AS InsuranceCode1:CHARARRAY,
$21 AS InsuranceBalance1:DOUBLE,
$22 AS InsuranceCode2:CHARARRAY,
$23 AS InsuranceBalance2:DOUBLE,
$24 AS InsuranceCode3:CHARARRAY,
$25 AS InsuranceBalance3:DOUBLE,
$26 AS InsuranceCode4:CHARARRAY,
$27 AS InsuranceBalance4:DOUBLE,
$28 AS InsuranceCode5:CHARARRAY,
$29 AS InsuranceBalance5:DOUBLE,
$30 AS AgingBucket:CHARARRAY,
$31 AS AccountBalance:DOUBLE,
$32 AS TotalCharges:DOUBLE,
$33 AS TotalPayments:DOUBLE,
$34 AS EstimatedRevenue:DOUBLE,
$35 AS CreateDateTime:CHARARRAY,
$36 AS UniqueFileId:LONG,
$37 AS PatientBalance:LONG,
$38 AS VendorCode:CHARARRAY;
--VALIDATION
SPLIT cleanData INTO validData IF isValid(*),
invalidData OTHERWISE;
--Dimension update--
--MACROS
DEFINE mergeDateDimension(validDataSet, dimensionFieldName, previousDimensionFile) RETURNS merged {
dates = FOREACH $validDataSet GENERATE $dimensionFieldName;
oldDimensions = LOAD '$previousDimensionFile' USING PigStorage('|') AS (
id:LONG,
monthName:CHARARRAY,
monthId:INT,
year:INT,
fiscalYear:INT,
originalDate:CHARARRAY);
oldOriginalDates = FOREACH oldDimensions GENERATE originalDate;
allDates = UNION dates, oldOriginalDates;
uniqueDates = DISTINCT allDates;
$merged = FOREACH uniqueDates GENERATE toDateDimension($0);
};
DEFINE mergeCodeDimension(validDataSet, dimensionFieldName, previousDimensionFile, outputIdField) RETURNS merged {
newCodes = FOREACH $validDataSet GENERATE $dimensionFieldName as newCode;
oldDim = LOAD '$previousDimensionFile' USING PigStorage('|') AS (
id:LONG,
code:CHARARRAY);
allCodes = COGROUP oldDim BY code, newCodes BY newCode;
grouped = FOREACH allCodes GENERATE
(IsEmpty(oldDim) ? 0L : SUM(oldDim.id)) as id,
group AS code;
ranked = RANK grouped BY id DESC, code DESC DENSE;
$merged = FOREACH ranked GENERATE
((id == 0L) ? $0 : id) as $outputIdField,
code AS $dimensionFieldName;
};
--DATE DIMENSIONS
billDateDim = mergeDateDimension(validData, BillDate, '$atbPrevOutputBase/dimensions/$billDateDimensionName');
STORE billDateDim INTO '$atbOutputBase/dimensions/$billDateDimensionName';
admissionDateDim = mergeDateDimension(validData, AdmissionDate, '$atbPrevOutputBase/dimensions/$admissionDateDimensionName');
STORE admissionDateDim INTO '$atbOutputBase/dimensions/$admissionDateDimensionName';
dischDateDim = mergeDateDimension(validData, DischargeDate, '$atbPrevOutputBase/dimensions/$dischargeDateDimensionName');
STORE dischDateDim INTO '$atbOutputBase/dimensions/$dischargeDateDimensionName';
arPostDateDim = mergeDateDimension(validData, PeriodDate, '$atbPrevOutputBase/dimensions/$arPostDateDimensionName');
STORE arPostDateDim INTO '$atbOutputBase/dimensions/$arPostDateDimensionName';
--CODE DIMENSION
patientTypeDim = mergeCodeDimension(validData, PatientType, '$atbPrevOutputBase/dimensions/$patientTypeDimensionName', PatientTypeId);
STORE patientTypeDim INTO '$atbOutputBase/dimensions/$patientTypeDimensionName' USING PigStorage('|');
patientTypeCodeDim = mergeCodeDimension(validData, PatientTypeCode, '$atbPrevOutputBase/dimensions/$patientTypeCodeDimensionName', PatientTypeCodeId);
STORE patientTypeCodeDim INTO '$atbOutputBase/dimensions/$patientTypeCodeDimensionName' USING PigStorage('|');
The problem is that when I run this script it never completes (gets stuck).
In Job Browser I can see one completed job and multiple jobs with 0% progress.
If I comment out processing of last three files - everything works fine (i.e. three parallel jobs succeed).
I've tried few approaches to fix this issue:
-no_multiquery Pig parameter - allows to execute script completely using only one job at time. Main disadvantage is huge number of generated jobs (26) and very long execution time (near 15 mins for described script and almost 40 mins for more complicated version).
Work only with parts that I develop and test by commenting out other parts - this is not an option for long term perspective.
Change mapred.capacity-scheduler.maximum-system-jobs property in mapred-site.xml so there should be less than three jobs at once as described here.
Change mapred.capacity-scheduler.queue.default.maximum-capacity in capacity-scheduler.xml for configuring default queue. But this approach didn't worked for me as well as previous.
Allocate more memory for sandbox virtual machine and mappers and reducers - no effect.
So my question is how can I limit the number of concurrent jobs that are starting by Pig script?
Or maybe there is other configuration fix that allow concurrent execution of multiple jobs?
[UPDATE]
If I run the same script with the same input data from shell console - everything works fine.
So I assume that there is some issue with HUE.
[UPDATE]
If I start more complex script from console it also gets stuck, but in this case number of parallel jobs is 8.
Last time we saw this it was because the cluster had only one map task.
You can use EXEC as described here:
http://pig.apache.org/docs/r0.11.1/perf.html#Implicit-Dependencies

Resources