How we can combine multiple rows in a single row in Pig

How we can combine multiple rows in a single row in Pig - hadoop

I need to combine this multiple tuples in a single one using Pig script. Could you please provide some guidelines?
dump requestFile;
Current Output
(Logging Transaction ID:21214,/var/log/tibco/,NESS-A-1-LPNameRequesttoNESS.log,tibcoTest log)
(Default Data:LP Name Request Message Executed Successfully)
(LoanPath Request ID: 88128640)
(RequestGroupID#: )
(SplitCount#: 2 )
(SplitIndex: 1)
(Correlation ID : 88128640-1 )
Desired output
(Logging Transaction ID:21214,/var/log/tibco/,NESS-A-1-LPNameRequesttoNESS.log,tibcoTest log,Default Data:LP Name Request Message Executed Successfully,LoanPath Request ID: 88128640,RequestGroupID#: ,SplitCount#: 2,SplitIndex: 1)
(Correlation ID : 88128640-1 )

What about:
requestFile = foreach requestFile generate flatten(tuple);
G = GROUP requestFile ALL;
F = FOREACH G generate requestFile;

Related

ArangoDB: How to run 2 queries in parallel in community edition

Hi I have written the below 2 queries and would like to run in these queries in parallel and not execute them sequentially. Is it possible to execute them parallelly in the community edition of the ArangoDB?
FOR d IN Transaction
FILTER d._to == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Incoming Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}
FOR d IN Transaction
FILTER d._from == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Outgoing Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}

of course it is possible to run multiple requests in parallel. Just fire 2 curl calls to _api/cursor or use 2 different arangosh shells.
Or run 2 curl calls in the same shell and use the x-arango-async header for each request to retrieve the result asynchronously as documented here: https://www.arangodb.com/docs/stable/http/async-results-management.html#async-execution-and-later-result-retrieval

Snappydata - sql put into on jobserver don't aggregate values

I'm trying to create a jar to run on snappy-job shell with streaming.
I have aggregation function and it works in windows perfectly. But I need to have a table with one value for each key. Base on a example from github a create a jar file and now I have problem with put into sql command.
My code for aggregation:
val resultStream: SchemaDStream = snsc.registerCQ("select publisher, cast(sum(bid)as int) as bidCount from " +
"AggrStream window (duration 1 seconds, slide 1 seconds) group by publisher")
val conf = new ConnectionConfBuilder(snsc.snappySession).build()
resultStream.foreachDataFrame(df => {
df.write.insertInto("windowsAgg")
println("Data received in streaming window")
df.show()
println("Updating table updateTable")
val conn = ConnectionUtil.getConnection(conf)
val result = df.collect()
val stmt = conn.prepareStatement("put into updateTable (publisher, bidCount) values " +
"(?,?+(nvl((select bidCount from updateTable where publisher = ?),0)))")
result.foreach(row => {
println("row" + row)
val publisher = row.getString(0)
println("publisher " + publisher)
val bidCount = row.getInt(1)
println("bidcount : " + bidCount)
stmt.setString(1, publisher)
stmt.setInt(2, bidCount)
stmt.setString(3, publisher)
println("Prepared Statement after bind variables set: " + stmt.toString())
stmt.addBatch()
}
)
stmt.executeBatch()
conn.close()
})
snsc.start()
snsc.awaitTermination()
}
I have to update or insert to table updateTable, but during update command the current value have to added to the one from stream.
And now :
What I see when I execute the code:
select * from updateTable;
PUBLISHER |BIDCOUNT
--------------------------------------------
publisher333 |10
Then I sent message to kafka:
1488487984048,publisher333,adv1,web1,geo1,11,c1
and again select from updateTable:
select * from updateTable;
PUBLISHER |BIDCOUNT
--------------------------------------------
publisher333 |11
the Bidcount value is overwritten instead of added.
But when I execute the put into command from snappy-sql shell it works perfectly:
put into updateTable (publisher, bidcount) values ('publisher333',4+
(nvl((select bidCount from updateTable where publisher =
'publisher333'),0)));
1 row inserted/updated/deleted
snappy> select * from updateTable;
PUBLISHER |BIDCOUNT
--------------------------------------------
publisher333 |15
Could you help me with this case? Mayby someone has other solution for insert or update value using snappydata ?
Thank you in advanced.

bidCount value is read from tomi_update table in case of streaming but it's getting read from updateTable in case of snappy-sql. Is this intentional? May be you wanted to use updateTable in both the cases ?

Pig- Unable to DUMP data

I have two dataset one for movies and other for ratings
Movies Data looks like
MovieID#Title#Genre
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
Ratings Data looks like
UserID#MovieID#Ratings#RatingsTimestamp
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
My Script is as follows
1) movies_data = LOAD '/user/admin/MoviesDataset/movies_new.dat' USING PigStorage('#') AS (movieid:int,
moviename:chararray,moviegenere:chararray);
2) ratings_data = LOAD '/user/admin/RatingsDataset/ratings_new.dat' USING PigStorage('#') AS (Userid:int,
movieid:int,ratings:int,timestamp:long);
3) moviedata_ratingsdata_join = JOIN movies_data BY movieid, ratings_data BY movieid;
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data.movieid;
5) moviedata_ratingsdata_averagerating = FOREACH moviedata_ratingsdata_join_group GENERATE group,
AVG(moviedata_ratingsdata_join.ratings) AS Averageratings, (moviedata_ratingsdata_join.Userid) AS userid;
6) DUMP moviedata_ratingsdata_averagerating;
I am getting this error
2017-03-25 06:46:50,332 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: moviedata_ratingsdata_join_group: Local Rearrange[tuple]{int}(false) - scope-95 Operator Key: scope-95): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: moviedata_ratingsdata_averagerating: New For Each(false,false)[bag] - scope-83 Operator Key: scope-83): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,Toy Story (1995),Animation|Children's|Comedy), 2nd :(2,Jumanji (1995),Adventure|Children's|Fantasy) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
If remove line 6, script executes successfully
Why can't I DUMP relation that generates in line 5?

Use the disambiguate operator ( :: ) to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators.
Relations movies_data and ratings_data both have a column movieid. When forming relation moviedata_ratingsdata_join_group, Use the :: operator to identify which column movieid to use for GROUP.
So your 4) would look like,
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data::movieid;

Null error for nonempty data using Generate in Pig/Hadoop

I am working on Task 2 in this link:
https://sites.google.com/site/hadoopbigdataoverview/certification-practice-exam
I used the code below
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',');
dump a
a_top = limit a 5
a_top shows that the first 5 rows. There are non-null values for each Year
Then I type
a_clean = filter a BY NOT ($4=='NA');
aa = foreach a_clean generate a_clean.Year;
But that gives the error
ERROR 1200: null
What is wrong with this?
EDIT: I also tried
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',') AS (Year:chararray,Month:chararray,DayofMonth:chararray,DayOfWeek:chararray,DepTime:chararray,CRSDepTime:chararray,ArrTime:chararray,CRSArrTime:chararray,UniqueCarrier:chararray,FlightNum:chararray,TailNum:chararray,ActualElapsedTime:chararray,CRSElapsedTime:chararray,AirTime:chararray,ArrDelay:chararray,DepDelay:chararray,Origin:chararray,Dest:chararray,Distance:chararray,TaxiIn:chararray,TaxiOut:chararray,Cancelled:chararray,CancellationCode:chararray,Diverted:chararray,CarrierDelay:chararray,WeatherDelay:chararray,NASDelay:chararray,SecurityDelay:chararray,LateAircraftDelay:chararray);
and
aa = foreach a_clean generate a_clean.Year
but the error was
ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay), 2nd :(2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA)

Since you have not specified the schema in the LOAD statement,you will have to refer the columns using order in which they occur.Year seems to be the first column so try this
a_clean = filter a BY ($4 != 'NA');
aa = foreach a_clean generate a.Year;

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

Below is the data I have and the schema for the same is-
student_name, question_number, actual_result(either - false/Correct)
(b,q1,Correct)
(a,q1,false)
(b,q2,Correct)
(a,q2,false)
(b,q3,false)
(a,q3,Correct)
(b,q4,false)
(a,q4,false)
(b,q5,flase)
(a,q5,false)
What I want is to get the count for each student i.e. a/b for total
correct and false answer he/she has made.

For the use case shared, below pig script is suffice.
Pig Script :
student_data = LOAD 'student_data.csv' USING PigStorage(',') AS (student_name:chararray, question_number:chararray, actual_result:chararray);
student_data_grp = GROUP student_data BY student_name;
student_correct_answer_data = FOREACH student_data_grp {
answers = student_data.actual_result;
correct_answers = FILTER answers BY actual_result=='Correct';
incorrect_answers = FILTER answers BY actual_result=='false';
GENERATE group AS student_name, COUNT(correct_answers) AS correct_ans_count, COUNT(incorrect_answers) AS incorrect_ans_count ;
};
Input : student_data.csv :
b,q1,Correct
a,q1,false
b,q2,Correct
a,q2,false
b,q3,false
a,q3,Correct
b,q4,false
a,q4,false
b,q5,false
a,q5,false
Output : DUMP kpi:
-- schema : (student_name, correct_ans_count, incorrect_ans_count)
(a,1,4)
(b,2,3)
Ref : For more details on nested FOR EACH
http://pig.apache.org/docs/r0.12.0/basic.html#foreach
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach

Use this:
data = LOAD '/abc.txt' USING PigStorage(',') AS (name:chararray, number:chararray,result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
and answer will be like:
(a,false,4)
(a,Correct,1)
(b,false,3)
(b,Correct,2)
Hope this is the output you are looking for

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How we can combine multiple rows in a single row in Pig - hadoop

What about: requestFile = foreach requestFile generate flatten(tuple); G = GROUP requestFile ALL; F = FOREACH G generate requestFile;

Related

ArangoDB: How to run 2 queries in parallel in community edition

Snappydata - sql put into on jobserver don't aggregate values

Pig- Unable to DUMP data

Null error for nonempty data using Generate in Pig/Hadoop

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

Categories

Resources