I'm facing a data transformation issue :
I have the table here under with 3 columns : client, event, timestamp.
And I basically want to assign a sequence number to all events for a given client based on timestamp, which is basically the "Sequence" columns I added hereunder.
Client Event TimeStamp Sequence
C1 Ph 2014-01-30 12:15:23 1
C1 Me 2014-01-31 15:11:34 2
C1 Me 2014-01-31 17:16:05 3
C2 Me 2014-02-01 09:22:52 1
C2 Ph 2014-02-01 17:22:52 2
I can't figure out how to create this sequence number in hive or Pig. Would you have any clue ?
Thanks in advance !
Guillaume
Put all the records in a bag (by say grouping all), sort the tuples inside bag by TimeStamp field and then use Enumerate function.
Something like below (I did not execute the code, so you might need to clean it up a bit):
// assuming input contains 3 columns - client, event, timestamp
input2 = GROUP input all;
input3 = FOREACH input2
{
sorted = ORDER input BY timestamp;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
}
We eventually modified enumerate source the following way and it works great :
public void accumulate(Tuple arg0) throws IOException {
nevents=13;
i=nevents+1;
DataBag inputBag = (DataBag)arg0.get(0);
Tuple t2 = TupleFactory.getInstance().newTuple();
for (Tuple t : inputBag) {
Tuple t1 = TupleFactory.getInstance().newTuple(t.getAll());
tampon=t1.get(2).toString();
if (tampon.equals("NA souscription Credit Conso")) {
if (i <= nevents) {
outputBag.add(t2);
t2=TupleFactory.getInstance().newTuple();
}
i=0;
t2.append(t1.get(0).toString());
t2.append(t1.get(1).toString());
t2.append(t1.get(2).toString());
i++;
}
else if (i < nevents) {
t2.append(tampon);
i++;
}
else if (i == nevents) {
t2.append(tampon);
outputBag.add(t2);
i++;
t2=TupleFactory.getInstance().newTuple();
}
if (count % 1000000 == 0) {
outputBag.spill();
count = 0;
}
;
count++;
}
if (t2.size()!=0) {
outputBag.add(t2);
}
}
Related
I have the following dataset:
Movies : moviename, genre1, genre2, genre3 ..... genre19
(All the genres above have values 0 or 1, 1 indicates that the movie is of that genre)
Now i want to find which movie(s) has least genre?
I tried the below Pig script:
items = load 'path' using PigStorage('|') as (mName:chararray,g1:int,g2:int,g3:int,g4:int,g5:int,g6:int,g7:int,g8:int,g9:int,g10:int,g11:int,g12:int,g13:int,g14:int,g15:int,g16:int,g17:int,g18:int,g19:int);
sumGenre = foreach items generate mName, g1+g2+g3+g4+g5+g6+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17+g18+g19 as sumOfGenres;
groupAll = group sumGenre All;
In the next step by using MIN(sumGenre.sumofGenres), i can get a genre which is the MIN value , but what am looking for is to get a moviename which has the least no. of genres, alongside the number of genres of that movie.
Can someone please help?
1. I want to know is there any other easy way to get the sum of g1+g2+...g19?
2. Also the output : movie(s) that has the least genre?
After the groupAll
r1 = minGenre = foreach groupAll generate MIN(sumGenre.sumOfGenres) as minG;
do left outer join between r1 by minG with sumGenre by sumOfGenres;
to get the list of movies having least genre..
Hope this will help..
for dynamic row field sum u can use UDF like this..
public class DynRowSum extends EvalFunc<Integer>
{
public Integer exec(Tuple v) throws IOException
{
List<Object> olist = v.getAll();
int sum = 0;
int cnt=0;
for( Object o : olist){
cnt++;
if (cnt!=1) {
int val= (Integer)o;
sum = sum + val;
}
}
return new Integer(sum);
}
}
In pig update the script like this..
grunt>sumGenre = foreach items generate mName,DynRowSum(*) as sumOfGenres;
Advantage here you will get if genre increase or decrease code will remain same..
a = LOAD 'path';
b = FOREACH a generate FLATTEN(STRSPLIT($0, '\\|'));
c = FOREACH b generate $0 as movie, FLATTEN(TOBAG(*)) as genre;
d = FILTER c BY movie!=genre;
e = GROUP d BY $0;
f = FOREACH e GENERATE group, SUM(d);
i = ORDER f BY $1;
j = LIMIT i 1;
Let's assue that rowkey 1 has values for f1:c1, f1:c2
Where as rowkey 2 has values for f1:c1 only. row 2 doesn't have f1:c2.
How do i recognize such rows(the ones without column populated)?
You want to know from row then try like this...
HTable t = new HTable(conf....);
ResultScanner scanner = t.getScanner(new Scan());
for (Result rr = scanner.next(); rr != null; rr = scanner.next())
{
if (rr.getValue("YourFamily" , "YourQualifier").equals(Bytes.toBytes("d"))
{
// do some thing
} else { return; // if you want to skip } } }
See Result
result.getFamilyMap() is one more way. But its not recommended due to performance.. see the doc of this method
However, HTableDescriptor.html has already hasfamily method.
boolean hasFamily(byte[] familyName)
Checks to see if this table contains the given column family
I have a remote table which gets updated daily. I need to fetch the new rows and insert in my local table. I wish to achieve this without using a POJO for the table so that the whole process is fast as well scalable to accommodate any more tables that I may need to add to this process.
Is it possible to eliminate the POJO ? Will it be more efficient than using a POJO ?
Including the code that I have written:
int columnCount = rs.getMetaData().getColumnCount();
if(rs.next())
{
aListObj= new ArrayList<String>() ;
for(int i=1 ; i <= columnCount ; i++)
aListObj.add(rs.getObject(i).toString());
}else
{
if(rs != null)
rs.close();
if(srcConn != null )
srcConn.close();
}
return aListObj ;
I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
sum1 = SUM(A.rcvd);
sum2 = SUM(A.sent);
generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;
When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:
(223,123)
(334,444)
(21,12344)
(...,...)
All I really want to do is add all those records together and show that total amount of received and sent: (?,?).
Note: I have tried changing the variable type to int, long, and chararray with no success either.
Some of the errors I am getting while trying to solve this are:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First make sure that the fields that you are summing up are of type int
Use - DESCRIBE A; to check the data type
After that, I think since you have used filter condition and then used group by on F1 -
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
So, while summing up you should use F1 instead of A -
counter = foreach grpd1 {
sum1 = SUM(F1.rcvd);
sum2 = SUM(F1.sent);
generate sum1, sum2;
};
Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A'
I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps.
PS - I am also a student and new to PIG.
A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C1 = foreach grpd1 generate SUM(F1.rcvd);
dump C1;
C2 = foreach grpd1 generate SUM(F1.sent);
dump C2;
NOTE: More info here.
Hope I helped a little!
Please try the following
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;
The input to reducer is as follows
key: 12
List<values> :
1,2,3,2013-12-23 10:21:44
1,2,3,2013-12-23 10:21:59
1,2,3,2013-12-23 10:22:07
The output needed is as follows:
1,2,3,2013-12-23 10:21:44,15
1,2,3,2013-12-23 10:21:59,8
1,2,3,2013-12-23 10:22:07,0
Please note last column is 10:21:59 minus 10:21:44. Date(next) - Date(current)
I tried loading into memory and subtracting but it is causing java heap memory issue. Your help is highly appreciated. data size for this key is huge > 1 GB and not able to fit into main memory.
Perhaps something along the lines of this pseudocode in your reduce() method:
long lastDate = 0;
V lastValue = null;
for (V value : values) {
currentDate = parseDateIntoMillis(value);
if (lastValue != null) {
context.write(key, lastValue.toString() + "," + (currentDate - lastDate));
}
lastDate = currentDate;
lastValue = value;
}
context.write(key, lastValue.toString() + "," + 0);
Obviously there will be tidying up to do but the general idea is fairly simple.
Note that because of your requirement to include the date of the next value as part of the current value calculation, the iteration through the values skips the first write, hence the additional write after the loop to ensure all values are accounted for.
If you have any questions feel free to ask away.
you can do by following code
reduce (LongWritable key, Iterable<String> Values, context){
Date currentDate = null;
LongWritable diff = new LongWritable();
for (String value : values) {
Date nextDate = new Date(value.toString().split(",")[3]);
if (currentDate != null) {
diff.set(Math.abs(nextDate.getTime()-currentDate.getTime())/1000)
context.write(key, diff);
}
currentDate = nextDate;
}
}