Pig Changing Schema to required type - hadoop

I'm a new Pig user.
I have an existing schema which I want to modify. My source data is as follows with 6 columns:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
and so on. Each Op value is always C, T or X.
I basically want to split my data in the following way into 7 columns:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
Basically split the Op column into 3 columns: each for one Op value. Each of these columns should contain appropriate value from column Value.
How can I do this in Pig?

One way to achieve the desired result:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
Update:
If you want to skip order by which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
where TupleArrange in mjar.jar is something like this:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
#Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
#Override
public Schema outputSchema(Schema input) {
return input;
}
}

Related

Java 8 Using Stream, find the latest status, if 'F' replace the reason from the corresponding from status 'R'

I have a List of Object with field names as below, I'm trying to replace the reason field of the
status 'F' from the corresponding reason field from status 'R'. This needs to be performed only when the latest status by timestamp per publisher has a status of 'F'
auditedBy status time Reason
api-a S 10:20 'A1'
api-a R 10:25 'A2'
api-a F 10:30 'A3'
-----
api-b S 10:30 'B1'
api-b S 10:25 'B2'
api-b S 10:20 'B3'
-----
api-c S 10:20 'C1'
api-c S 10:30 'C1'
api-c F 10:40 'C3'
so the final output of the above data will be
auditedBy status time Reason
api-a S 10:20 'A1'
api-a R 10:25 'A2'
api-a F 10:30 'A2' ******
api-b S 10:30 'B1'
api-b S 10:25 'B2'
api-b S 10:20 'B3'
api-c S 10:20 'C1'
api-c S 10:30 'C1'
api-c F 10:40 'C3'
I'll provide my implementation below please let me know if any improvements is needed.
I'll create map with the latest status as 'F'
Map will have
api-a -> api-a,F,10:30,'A3'
api-c -> api-c,F,10:40,'C3'
Then I'll create map, this will contain records
with status as 'R' for those records in previous map.
Map above will have.
api-a -> api-a,R,10:25,'A2'
I'll change the reason from the original list from both the maps.
private List<DetailRecord> changeReasonFunction(List<DetailRecord> records) {
Map<String, DetailRecord> eventsFailed = records.stream()
.collect(toMap(DetailRecord::getAuditedBy, Function.identity(),
BinaryOperator.maxBy(Comparator.comparing(DetailRecord::getAuditTimestamp))))
.values()
.stream().filter(detRec -> FAILED.getStatus().equals(detRec.getStatus()))
.collect(toMap(DetailRecord::getAuditedBy, Function.identity()));
Map<String, DetailRecord> retryRecordForFailedEvents = records.stream()
.filter(rec -> eventsFailed.keySet().contains(rec.getAuditedBy()))
.filter(rec -> FAILED_RECOVERABLE_ERROR.getStatus().equals(rec.getStatus()))
.collect(toMap(DetailRecord::getAuditedBy, Function.identity(),
BinaryOperator.maxBy(Comparator.comparing(DetailRecord::getAuditTimestamp))));
records.forEach(rec -> {
if (eventsFailed.containsKey(rec.getAuditedBy()) &&
eventsFailed.get(rec.getAuditedBy()).getAuditTimestamp()
.compareTo(rec.getAuditTimestamp()) == 0) {
DetailRecord obj = retryRecordForFailedEvents.get(rec.getAuditedBy());
if (obj != null) {
rec.setExceptionReason(obj.getExceptionReason());
rec.setExceptionDetail(obj.getExceptionDetail());
}
}
});
return records;
}
You can try this with groupingBy collector like this:
private List<DetailRecord> changeReasonFunction(List<DetailRecord> records) {
return records.stream()
.collect(Collectors.groupingBy(DetailRecord::getAuditedBy))
.values().stream()
.flatMap(list -> {
Optional<DetailRecord> maxRecord = list.stream().max(Comparator.comparing(DetailRecord::getTime));
if (maxRecord.isPresent()
&& maxRecord.get().getStatus().equals("F")
&& list.stream().anyMatch(detailRecord -> detailRecord.getStatus().equals("R"))) {
maxRecord.get().setReason(
list.stream()
.filter(detailRecord -> detailRecord.getStatus().equals("R"))
.findAny().get().getReason());
}
return list.stream();
}).collect(Collectors.toList());
}
You'll need to modify it based on the datatypes you are using. I used:
public class DetailRecord {
private String auditedBy;
private String status;
private Instant time;
private String reason;
}
Test:
List<DetailRecord> l = new ArrayList<>();
l.add(new DetailRecord("api-a", "S", Instant.parse("2019-01-01T10:20:00.00Z"), "A1"));
l.add(new DetailRecord("api-a", "R", Instant.parse("2019-01-01T10:25:00.00Z"), "A2"));
l.add(new DetailRecord("api-a", "F", Instant.parse("2019-01-01T10:30:00.00Z"), "A3"));
l.add(new DetailRecord("api-b", "S", Instant.parse("2019-01-01T10:30:00.00Z"), "B1"));
l.add(new DetailRecord("api-b", "S", Instant.parse("2019-01-01T10:25:00.00Z"), "B2"));
l.add(new DetailRecord("api-b", "S", Instant.parse("2019-01-01T10:20:00.00Z"), "B3"));
l.add(new DetailRecord("api-c", "S", Instant.parse("2019-01-01T10:20:00.00Z"), "C1"));
l.add(new DetailRecord("api-c", "S", Instant.parse("2019-01-01T10:30:00.00Z"), "C1"));
l.add(new DetailRecord("api-c", "F", Instant.parse("2019-01-01T10:40:00.00Z"), "C3"));
System.out.println(changeReasonFunction(l));
Output is as you expected.
I asume the same DetailRecord class as in this answer.
At the beginning we have grouped inputList based on auditedBy
Map<String, List<DetailRecord>> groupByCollection = inputList.stream()
.collect(Collectors.groupingBy(DetailRecord::getAuditedBy));
Now on every list we can make substitution. List is sorted, in reduce method there is searching for the first record with status "R". If it is found, it is preserved and searching is continue for record with status "F", if it is found the substitution is made.
groupByCollection.entrySet().forEach(e -> e.getValue().stream()
.sorted(Comparator.comparing(DetailRecord::getTime))
.reduce(null, (DetailRecord a, DetailRecord v) -> {
if (a == null && v.getStatus().equals("R")) {
return v;
}
if (a != null && v.getStatus().equals("F")) {
v.setReason(a.getReason());
}
return a;
}, (a1, a2) -> a1)
);
groupByCollection.entrySet()
.forEach(e -> e.getValue().stream().forEach(System.out::println));
Example init list code:
List<String> inputStrings = Arrays.asList(
"api-a S 10:20 'A1'",
"api-a R 10:25 'A2'",
"api-a F 10:30 'A3'",
"api-b S 10:30 'B1'",
"api-b S 10:25 'B2'",
"api-b S 10:20 'B3'",
"api-c S 10:20 'C1'",
"api-c S 10:30 'C1'",
"api-c F 10:40 'C3'"
);
List<DetailRecord> inputList = inputStrings.stream()
.map(s -> s.split("\\s+"))
.map(s -> new DetailRecord(s[0], s[1], Instant.parse("2019-01-01T" + s[2] + ":00.00Z"), s[3].replaceAll("'", "")))
.collect(Collectors.toList());

Return results from data table in a sequence using linq

I'm fetching rows from excel sheet in my application that holds attendance records from the bio metric machine. In order to get the best result i have to remove the redundant data. For that I have to manage check in and checkout timings at regular intervals. For instance, First check in time for entering, and then checkout time for lunch, then again check in for returning back, and last check out for going home. Meanwhile the rows in excel contains multiple check ins and check outs as the employee tends to do more that once for both.
I have managed to get records from excel and added to data table. Now for the sequence and sorting part I'm struggling to achieve my desired result. Below is my code.
protected void btnSaveAttendance_Click(object sender, EventArgs e)
{
try
{
if (FileUpload1.HasFile && Path.GetExtension(FileUpload1.FileName) == ".xls")
{
using (var excel = new OfficeOpenXml.ExcelPackage(FileUpload1.PostedFile.InputStream))
{
var tbl = new DataTable();
var ws = excel.Workbook.Worksheets.First();
var hasHeader = true; // adjust accordingly
// add DataColumns to DataTable
foreach (var firstRowCell in ws.Cells[1, 1, 1, ws.Dimension.End.Column])
tbl.Columns.Add(hasHeader ? firstRowCell.Text
: String.Format("Column {0}", firstRowCell.Start.Column));
// add DataRows to DataTable
int startRow = hasHeader ? 2 : 1;
for (int rowNum = startRow; rowNum <= ws.Dimension.End.Row; rowNum++)
{
var wsRow = ws.Cells[rowNum, 1, rowNum, ws.Dimension.End.Column];
DataRow row = tbl.NewRow();
foreach (var cell in wsRow)
row[cell.Start.Column - 1] = cell.Text;
tbl.Rows.Add(row);
}
var distinctNames = (from row in tbl.AsEnumerable()
select row.Field<string>("Employee Code")).Distinct();
DataRow[] dataRows = tbl.Select().OrderBy(u => u["Employee Code"]).ToArray();
var ss = dataRows.Where(p => p.Field<string>("Employee Code") == "55").ToArray();
}
}
}
catch (Exception ex) { }
}
The result i'm getting is:
Employee Code Employee Name Date Time In / Out
55 Alex 12/27/2018 8:59 IN
55 Alex 12/27/2018 8:59 IN
55 Alex 12/27/2018 13:00 OUT
55 Alex 12/27/2018 13:00 OUT
55 Alex 12/27/2018 13:48 IN
55 Alex 12/27/2018 13:49 IN
55 Alex 12/27/2018 18:08 OUT
And I want to have first In and then out and then in and then out. This would iterate four times to generate the result.
Expected result is:
Employee Code Employee Name Date Time In / Out
55 Alex 12/27/2018 8:59 IN
55 Alex 12/27/2018 13:00 OUT
55 Alex 12/27/2018 13:48 IN
55 Alex 12/27/2018 18:08 OUT
Can you try to do groupby in the result like below
ss=ss.GroupBy(x=>x.DateTime).ToArray();
Build a logic, if your result have 2 successive In/Out as a sample like below.
Here In I considered as field name
var tt;
for(int i=0;i<ss.Count();i++)
{
if(ss[i].In=="In" && (tt!=null || tt.LastOrDefault().In!="In"))
tt=ss[i];
else if(ss[i].In=="Out" && (tt!=null || tt.LastOrDefault().In!="Out"))
tt=ss[i];
}

Pig sum fails with +ve and -ve values

I have below data
primary,first,second
1,393440.09,354096.08
1,4410533.33,3969479.99
1,-4803973.41,-4323576.07
I have to aggregate and sum first and second column. Below is the script I am executing
data_load= load <filelocation> using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') As (primary:double, first:double,second:double)
dataAgrr = group data_load by primary;
sumData = FOREACH dataAgrr GENERATE
group as data,
SUM(data_load.first) as first,
SUM(data_load.second) as second,
SUM(data_load.primary) as primary;
After executing, below Output is produced:
(1.0,0.009999999951105565,-5.820766091346741E-11,3.0)
But when we manually adding second column (354096.08, 3969479.99, -4323576.07) gives 0.
Pig uses Java "double" internally.
Testing with a sample code below
import java.math.BigDecimal;
public class TestSum {
public static void main(String[] args) {
double d1 = 354096.08;
double d2 = 3969479.99;
double d3 = -4323576.07;
System.err.println("Total in double is " + ((d3 + d2 ) + d1));
BigDecimal bd1 = new BigDecimal("354096.08");
BigDecimal bd2 = new BigDecimal("3969479.99");
BigDecimal bd3 = new BigDecimal("-4323576.07");
System.err.println("Total in BigDecimal is " + bd3.add(bd2).add(bd1));
}
}
This produces
Total in double is -5.820766091346741E-11
Total in BigDecimal is 0.00
If you need a better precision, you may want to try using "bigdecimal" instead of "double" in your script.

How to remove duplicates on a column basis in Pig

will anyone help me to remove the old records from my csv file and keep recent record using Pig.
EX: input
Key1 sta DATE
XXXXX P38 17-10-2017
XXXXX P38 12-10-2017
YYYYY P38 11-10-2017
YYYYY P38 23-09-2017
YYYYY P38 14-09-2017
ZZZZZ P38 25-10-2017
ZZZZZ P38 10-10-2017
My expected output would be
Key1 sta DATE
XXXXX P38 17-10-2017
YYYYY P38 11-10-2017
ZZZZZ P38 25-10-2017
And header also be included in an out put.
Please suggest how can I achieve this?
Nested foreach can be used for this case,
A = LOAD '....' AS (
B =
FOREACH (GROUP A BY key1) {
orderd = ORDER A BY date DESC;
ltsrow = LIMIT orderd 1;
GENERATE FLATTEN(ltsrow);
};
STORE B into 'output' using PigStorage('\t', '-schema');
To learn about nested foreach, look at this,
https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
https://community.mapr.com/thread/22034-apache-pig-nested-foreach-explaination
and on saving output with schema,
https://hadoopified.wordpress.com/2012/04/22/pigstorage-options-schema-and-source-tagging/
Below will work for you.
a = load 'pig.txt' USING PigStorage(' ') AS (name:chararray,code:chararray,x1:chararray);
b = FOREACH a GENERATE name,code,ToDate(x1,'dd-mm-yyyy') AS x1;
grpd = GROUP b BY name;
firstrecords = FOREACH grpd {
sorted = order b by x1 desc;
toprecord = limit sorted 1;
generate group,FLATTEN(toprecord);
};
dump firstrecords;

SUM, AVG, in Pig are not working

I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.
Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)
Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;

Resources