Do all aggregations in a single groupBy or separately?

Do all aggregations in a single groupBy or separately? - performance

I need to do a lot of aggregations(around 9-10) on a large data-set in my PySpark code. I can approach it in 2 ways:
Single group by:
df.groupBy(col1, col2).agg({"col3":"sum", "col4":"avg", "col5":"min", "col6":"sum", "col7":"max", "col8":"avg", "col9":"sum"})
Group by and join
temp1 = df.groupBy(col1, col2).agg({"col3":"sum"})
temp2 = df.groupBy(col1, col2).agg({"col4":"avg"})
temp3 = df.groupBy(col1, col2).agg({"col5":"min"})
.
.
.
temp9 = df.groupBy(col1, col2).agg({"col9":"sum"})
And then join all these 9 dataframes to get the final output.
Which one would be more efficient?

TL;DR Go with the first one.
It is not even a competition. Readability alone should be enough to reject the second solution, which is verbose and convoluted.
Not to mention, that the execution plan is just a monstrosity (here only 2 tables!):
== Physical Plan ==
*Project [col1#512L, col2#513L, sum(col3)#597L, avg(col4)#614, min(col5)#631L]
+- *SortMergeJoin [col1#512L, col2#513L], [col1#719L, col2#720L], Inner
:- *Project [col1#512L, col2#513L, sum(col3)#597L, avg(col4)#614]
: +- *SortMergeJoin [col1#512L, col2#513L], [col1#704L, col2#705L], Inner
: :- *Sort [col1#512L ASC NULLS FIRST, col2#513L ASC NULLS FIRST], false, 0
: : +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[sum(col3#514L)])
: : +- Exchange hashpartitioning(col1#512L, col2#513L, 200)
: : +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[partial_sum(col3#514L)])
: : +- *Project [_1#491L AS col1#512L, _2#492L AS col2#513L, _3#493L AS col3#514L]
: : +- *Filter (isnotnull(_1#491L) && isnotnull(_2#492L))
: : +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
: +- *Sort [col1#704L ASC NULLS FIRST, col2#705L ASC NULLS FIRST], false, 0
: +- *HashAggregate(keys=[col1#704L, col2#705L], functions=[avg(col4#707L)])
: +- Exchange hashpartitioning(col1#704L, col2#705L, 200)
: +- *HashAggregate(keys=[col1#704L, col2#705L], functions=[partial_avg(col4#707L)])
: +- *Project [_1#491L AS col1#704L, _2#492L AS col2#705L, _4#494L AS col4#707L]
: +- *Filter (isnotnull(_2#492L) && isnotnull(_1#491L))
: +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
+- *Sort [col1#719L ASC NULLS FIRST, col2#720L ASC NULLS FIRST], false, 0
+- *HashAggregate(keys=[col1#719L, col2#720L], functions=[min(col5#723L)])
+- Exchange hashpartitioning(col1#719L, col2#720L, 200)
+- *HashAggregate(keys=[col1#719L, col2#720L], functions=[partial_min(col5#723L)])
+- *Project [_1#491L AS col1#719L, _2#492L AS col2#720L, _5#495L AS col5#723L]
+- *Filter (isnotnull(_1#491L) && isnotnull(_2#492L))
+- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
compared to plain aggregation (for all columns):
== Physical Plan ==
*HashAggregate(keys=[col1#512L, col2#513L], functions=[max(col7#518L), avg(col8#519L), sum(col3#514L), sum(col6#517L), sum(col9#520L), min(col5#516L), avg(col4#515L)])
+- Exchange hashpartitioning(col1#512L, col2#513L, 200)
+- *HashAggregate(keys=[col1#512L, col2#513L], functions=[partial_max(col7#518L), partial_avg(col8#519L), partial_sum(col3#514L), partial_sum(col6#517L), partial_sum(col9#520L), partial_min(col5#516L), partial_avg(col4#515L)])
+- *Project [_1#491L AS col1#512L, _2#492L AS col2#513L, _3#493L AS col3#514L, _4#494L AS col4#515L, _5#495L AS col5#516L, _6#496L AS col6#517L, _7#497L AS col7#518L, _8#498L AS col8#519L, _9#499L AS col9#520L]
+- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]

Related

Laravel calculate total balance

Here is my Table:
| id | type | balance
| ----|--------| -------
| 1 | credit | 2400
| 2 | credit | 4800
| 3 | debit | 1200
The calculated amount should be 6000. (2400 + 4800 - 1200) = 6000
How can I do this using Eloquent or collection?

Using laravel collection and one sql query.
return Model::all()->reduce(function ($carry, $item) {
return $item->type == 'credit'
? $carry + $item->balance : $carry - $item->balance;
},0);

You can do by this using Eloquent:
Credits
$totalCredits = Model::where('type', 'credit')->sum('balance');
Debits
$totalDebits = Model::where('type', 'debit')->sum('balance');
Balances
$Total = $totalCredits - $totalDebits
If you want SUM only then do this
DB::table("table")->get()->sum("balance")

very slow spark performance

I am a newbie to spark and need some help to debug very slow performance in spark.
I am doing below transformations and its been running for more than 2 hours.
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext( sc )
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#2b33f7a0
scala> val t1_df = hiveContext.sql("select * from T1" )
scala> t1_df.registerTempTable( "T1" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> t1_df.count
17/06/07 07:26:51 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res3: Long = 1732831
scala> val t1_df1 = t1_df.dropDuplicates( Array("c1","c2","c3", "c4" ))
scala> df1.registerTempTable( "ABC" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).count
[Stage 4:====================================================> (89 + 8) / 97]
I am using spark2.1.0 and reading data from hive.2.1.1 on amazon VMs cluster of 7 nodes each with 250GB RAM and 64 virtual cores. With this massive resource, i am expecting this simple query on 1.7 mil recs to fly but its painfully slow.
Any pointers would be of great help.
UPDATES:
Adding explain plan:
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull((c1#26 = c1#26#1398)) || (c1#26 = c1#26#1398))
:- FileScan parquet default.t1_pq[cols
more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<hdr_msg_src:string,hdr_recv_tsmp:timestamp,hdr_desk_id:string,execprc:string,dreg:string,c...
+- BroadcastExchange IdentityBroadcastMode
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- Exchange hashpartitioning(c1#26, c2#59, c3#60L, c4#82, 200)
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- *FileScan parquet default.atn_load_pq[c1#26,c2#59,c3#60L,c4#82] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:string,c2:string,c3:bigint,c4:string>

Although I think your count will always be 0 in your query, you may try to use an left-anti join, and don't forget to cache t1_df to avoid multiple re-computations
val t1_df = hiveContext.sql("select * from T1" ).cache
t1_df
.join(
t1_df.dropDuplicates( Array("c1","c2","c3", "c4" )),
Seq("c1"),
"leftanti"
)
.count()

Spark partition pruning doesn't work on 1.6.0

I created partitioned parquet files on hdfs and created HIVE external table. When I query the table with filter on partitioning column, spark checks all the partition files instead of specific partition. We are on spark 1.6.0.
dataframe:
df = hivecontext.createDataFrame([
("class1", "Economics", "name1", None),
("class2","Economics", "name2", 92),
("class2","CS", "name2", 92),
("class1","CS", "name1", 92)
], ["class","subject", "name", "marks"])
creating parquet partitions:
hivecontext.setConf("spark.sql.parquet.compression.codec", "snappy")
hivecontext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df1.write.parquet("/transient/testing/students", mode="overwrite", partitionBy='subject')
Query:
df = hivecontext.sql('select * from vatmatching_stage.students where subject = "Economics"')
df.show()
+------+-----+-----+---------+
| class| name|marks| subject|
+------+-----+-----+---------+
|class1|name1| 0|Economics|
|class2|name2| 92|Economics|
+------+-----+-----+---------+
df.explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('subject = Economics)
+- 'UnresolvedRelation `vatmatching_stage`.`students`, None
== Analyzed Logical Plan ==
class: string, name: string, marks: bigint, subject: string
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Subquery students
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Optimized Logical Plan ==
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Physical Plan ==
Scan ParquetRelation: vatmatching_stage.students[class#90,name#91,marks#92L,subject#89] InputPaths: hdfs://dev4/transient/testing/students/subject=Art, hdfs://dev4/transient/testing/students/subject=Civil, hdfs://dev4/transient/testing/students/subject=CS, hdfs://dev4/transient/testing/students/subject=Economics, hdfs://dev4/transient/testing/students/subject=Music
But, if I do the same query on HIVE browser we can see HIVE is doing partition pruning.
44 location hdfs://testing/students/subject=Economics
45 name vatmatching_stage.students
46 numFiles 1
47 numRows -1
48 partition_columns subject
49 partition_columns.types string
Is this limitation in spark 1.6.0 or am I missing something here.

Found the root cause of this issue. HiveContext used for querying the table doesn't have spark.sql.hive.convertMetastoreParquet" set to "false". Its set to "true" - default value.
When I set it to "false", I can see its using partition pruning.

Reshape data in pig - change row values to column names

Is there a way to reshape the data in pig?
The data looks like this -
id | p1 | count
1 | "Accessory" | 3
1 | "clothing" | 2
2 | "Books" | 1
I want to reshape the data so that the output would look like this--
id | Accessory | clothing | Books
1 | 3 | 2 | 0
2 | 0 | 0 | 1
Can anyone please suggest some way around?

If its a fixed set of product line the below code might help, otherwise you can go for a custom UDF which helps in achieving the objective.
Input : a.csv
1|Accessory|3
1|Clothing|2
2|Books|1
Pig Snippet :
test = LOAD 'a.csv' USING PigStorage('|') AS (product_id:long,product_name:chararray,rec_cnt:long);
req_stats = FOREACH (GROUP test BY product_id) {
accessory = FILTER test BY product_name=='Accessory';
clothing = FILTER test BY product_name=='Clothing';
books = FILTER test BY product_name=='Books';
GENERATE group AS product_id, (IsEmpty(accessory) ? '0' : BagToString(accessory.rec_cnt)) AS a_cnt, (IsEmpty(clothing) ? '0' : BagToString(clothing.rec_cnt)) AS c_cnt, (IsEmpty(books) ? '0' : BagToString(books.rec_cnt)) AS b_cnt;
};
DUMP req_stats;
Output :DUMP req_stats;
(1,3,2,0)
(2,0,0,1)

Calculating the sum of diffrent columns in TableView

I have a Table Witch look like the below table
TableVeiw<Transaction>
---------------------------------------------------------------------
| id | Transaction date | Name | type | Debit Amount | Credit Amount|
|---------------------------------------------------------------------|
| 1 | 21/02/2016 |Invoice|Credit | | 12000 |
|---------------------------------------------------------------------|
| 2 | 21/02/2016 |Payment|Debit | 20000 | |
|---------------------------------------------------------------------|
| Total Debit | Total Credit |
-----------------------------
The data in Debit amount and Credit amount come from one property of Transaction Object the code snnipet of how to populate those columns is below:
tcCreditAmmout.setCellValueFactory(cellData -> {
Transaction transaction = cellData.getValue() ;
BigDecimal value = null;
if(transaction.getKindOfTransaction() == KindOfTransaction.CREDIT){
value = transaction.getAmountOfTransaction();
}
return new ReadOnlyObjectWrapper<BigDecimal>(value);
});
tcDebitAmmout.setCellValueFactory(cellData -> {
Transaction transaction = cellData.getValue() ;
BigDecimal value = null;
if(transaction.getKindOfTransaction() == KindOfTransaction.DEBIT){
value = transaction.getAmountOfTransaction();
}
return new ReadOnlyObjectWrapper<BigDecimal>(value);
});
I need to calculate the total of :Total Debit(See the above table) and Total Credit (See the above table) every time the TableView item changes via Javafx Bindings, but i have no idea how to acheive this.
Note: Total Debit and Total Credit are Labels ,

Assuming you have
TableView<Transaction> table = ... ;
Label totalDebit = ... ;
Label totalCredit = ... ;
then you just need:
totalDebit.textProperty().bind(Bindings.createObjectBinding(() ->
table.getItems().stream()
.filter(transaction -> transaction.getKindOfTransaction() == KindOfTransaction.DEBIT)
.map(Transaction::getAmountOfTransaction)
.reduce(BigDecimal.ZERO, BigDecimal::add),
table.getItems())
.asString("%.3f"));
and of course
totalCredit.textProperty().bind(Bindings.createObjectBinding(() ->
table.getItems().stream()
.filter(transaction -> transaction.getKindOfTransaction() == KindOfTransaction.CREDIT)
.map(Transaction::getAmountOfTransaction)
.reduce(BigDecimal.ZERO, BigDecimal::add),
table.getItems())
.asString("%.3f"));
If getAmountOfTransaction might change while the transaction is part of the table, then your table's items list must be constructed with an extractor.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Do all aggregations in a single groupBy or separately? - performance

Related

Laravel calculate total balance

very slow spark performance

Spark partition pruning doesn't work on 1.6.0

Reshape data in pig - change row values to column names

Calculating the sum of diffrent columns in TableView

Categories

Resources