I need to do a lot of aggregations(around 9-10) on a large data-set in my PySpark code. I can approach it in 2 ways:
Single group by:
df.groupBy(col1, col2).agg({"col3":"sum", "col4":"avg", "col5":"min", "col6":"sum", "col7":"max", "col8":"avg", "col9":"sum"})
Group by and join
temp1 = df.groupBy(col1, col2).agg({"col3":"sum"})
temp2 = df.groupBy(col1, col2).agg({"col4":"avg"})
temp3 = df.groupBy(col1, col2).agg({"col5":"min"})
.
.
.
temp9 = df.groupBy(col1, col2).agg({"col9":"sum"})
And then join all these 9 dataframes to get the final output.
Which one would be more efficient?
TL;DR Go with the first one.
It is not even a competition. Readability alone should be enough to reject the second solution, which is verbose and convoluted.
Not to mention, that the execution plan is just a monstrosity (here only 2 tables!):
== Physical Plan ==
*Project [col1#512L, col2#513L, sum(col3)#597L, avg(col4)#614, min(col5)#631L]
+- *SortMergeJoin [col1#512L, col2#513L], [col1#719L, col2#720L], Inner
:- *Project [col1#512L, col2#513L, sum(col3)#597L, avg(col4)#614]
: +- *SortMergeJoin [col1#512L, col2#513L], [col1#704L, col2#705L], Inner
: :- *Sort [col1#512L ASC NULLS FIRST, col2#513L ASC NULLS FIRST], false, 0
: : +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[sum(col3#514L)])
: : +- Exchange hashpartitioning(col1#512L, col2#513L, 200)
: : +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[partial_sum(col3#514L)])
: : +- *Project [_1#491L AS col1#512L, _2#492L AS col2#513L, _3#493L AS col3#514L]
: : +- *Filter (isnotnull(_1#491L) && isnotnull(_2#492L))
: : +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
: +- *Sort [col1#704L ASC NULLS FIRST, col2#705L ASC NULLS FIRST], false, 0
: +- *HashAggregate(keys=[col1#704L, col2#705L], functions=[avg(col4#707L)])
: +- Exchange hashpartitioning(col1#704L, col2#705L, 200)
: +- *HashAggregate(keys=[col1#704L, col2#705L], functions=[partial_avg(col4#707L)])
: +- *Project [_1#491L AS col1#704L, _2#492L AS col2#705L, _4#494L AS col4#707L]
: +- *Filter (isnotnull(_2#492L) && isnotnull(_1#491L))
: +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
+- *Sort [col1#719L ASC NULLS FIRST, col2#720L ASC NULLS FIRST], false, 0
+- *HashAggregate(keys=[col1#719L, col2#720L], functions=[min(col5#723L)])
+- Exchange hashpartitioning(col1#719L, col2#720L, 200)
+- *HashAggregate(keys=[col1#719L, col2#720L], functions=[partial_min(col5#723L)])
+- *Project [_1#491L AS col1#719L, _2#492L AS col2#720L, _5#495L AS col5#723L]
+- *Filter (isnotnull(_1#491L) && isnotnull(_2#492L))
+- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
compared to plain aggregation (for all columns):
== Physical Plan ==
*HashAggregate(keys=[col1#512L, col2#513L], functions=[max(col7#518L), avg(col8#519L), sum(col3#514L), sum(col6#517L), sum(col9#520L), min(col5#516L), avg(col4#515L)])
+- Exchange hashpartitioning(col1#512L, col2#513L, 200)
+- *HashAggregate(keys=[col1#512L, col2#513L], functions=[partial_max(col7#518L), partial_avg(col8#519L), partial_sum(col3#514L), partial_sum(col6#517L), partial_sum(col9#520L), partial_min(col5#516L), partial_avg(col4#515L)])
+- *Project [_1#491L AS col1#512L, _2#492L AS col2#513L, _3#493L AS col3#514L, _4#494L AS col4#515L, _5#495L AS col5#516L, _6#496L AS col6#517L, _7#497L AS col7#518L, _8#498L AS col8#519L, _9#499L AS col9#520L]
+- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
Related
Here is my Table:
| id | type | balance
| ----|--------| -------
| 1 | credit | 2400
| 2 | credit | 4800
| 3 | debit | 1200
The calculated amount should be 6000. (2400 + 4800 - 1200) = 6000
How can I do this using Eloquent or collection?
Using laravel collection and one sql query.
return Model::all()->reduce(function ($carry, $item) {
return $item->type == 'credit'
? $carry + $item->balance : $carry - $item->balance;
},0);
You can do by this using Eloquent:
Credits
$totalCredits = Model::where('type', 'credit')->sum('balance');
Debits
$totalDebits = Model::where('type', 'debit')->sum('balance');
Balances
$Total = $totalCredits - $totalDebits
If you want SUM only then do this
DB::table("table")->get()->sum("balance")
I am a newbie to spark and need some help to debug very slow performance in spark.
I am doing below transformations and its been running for more than 2 hours.
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext( sc )
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#2b33f7a0
scala> val t1_df = hiveContext.sql("select * from T1" )
scala> t1_df.registerTempTable( "T1" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> t1_df.count
17/06/07 07:26:51 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res3: Long = 1732831
scala> val t1_df1 = t1_df.dropDuplicates( Array("c1","c2","c3", "c4" ))
scala> df1.registerTempTable( "ABC" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).count
[Stage 4:====================================================> (89 + 8) / 97]
I am using spark2.1.0 and reading data from hive.2.1.1 on amazon VMs cluster of 7 nodes each with 250GB RAM and 64 virtual cores. With this massive resource, i am expecting this simple query on 1.7 mil recs to fly but its painfully slow.
Any pointers would be of great help.
UPDATES:
Adding explain plan:
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull((c1#26 = c1#26#1398)) || (c1#26 = c1#26#1398))
:- FileScan parquet default.t1_pq[cols
more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<hdr_msg_src:string,hdr_recv_tsmp:timestamp,hdr_desk_id:string,execprc:string,dreg:string,c...
+- BroadcastExchange IdentityBroadcastMode
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- Exchange hashpartitioning(c1#26, c2#59, c3#60L, c4#82, 200)
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- *FileScan parquet default.atn_load_pq[c1#26,c2#59,c3#60L,c4#82] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:string,c2:string,c3:bigint,c4:string>
Although I think your count will always be 0 in your query, you may try to use an left-anti join, and don't forget to cache t1_df to avoid multiple re-computations
val t1_df = hiveContext.sql("select * from T1" ).cache
t1_df
.join(
t1_df.dropDuplicates( Array("c1","c2","c3", "c4" )),
Seq("c1"),
"leftanti"
)
.count()
I created partitioned parquet files on hdfs and created HIVE external table. When I query the table with filter on partitioning column, spark checks all the partition files instead of specific partition. We are on spark 1.6.0.
dataframe:
df = hivecontext.createDataFrame([
("class1", "Economics", "name1", None),
("class2","Economics", "name2", 92),
("class2","CS", "name2", 92),
("class1","CS", "name1", 92)
], ["class","subject", "name", "marks"])
creating parquet partitions:
hivecontext.setConf("spark.sql.parquet.compression.codec", "snappy")
hivecontext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df1.write.parquet("/transient/testing/students", mode="overwrite", partitionBy='subject')
Query:
df = hivecontext.sql('select * from vatmatching_stage.students where subject = "Economics"')
df.show()
+------+-----+-----+---------+
| class| name|marks| subject|
+------+-----+-----+---------+
|class1|name1| 0|Economics|
|class2|name2| 92|Economics|
+------+-----+-----+---------+
df.explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('subject = Economics)
+- 'UnresolvedRelation `vatmatching_stage`.`students`, None
== Analyzed Logical Plan ==
class: string, name: string, marks: bigint, subject: string
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Subquery students
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Optimized Logical Plan ==
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Physical Plan ==
Scan ParquetRelation: vatmatching_stage.students[class#90,name#91,marks#92L,subject#89] InputPaths: hdfs://dev4/transient/testing/students/subject=Art, hdfs://dev4/transient/testing/students/subject=Civil, hdfs://dev4/transient/testing/students/subject=CS, hdfs://dev4/transient/testing/students/subject=Economics, hdfs://dev4/transient/testing/students/subject=Music
But, if I do the same query on HIVE browser we can see HIVE is doing partition pruning.
44 location hdfs://testing/students/subject=Economics
45 name vatmatching_stage.students
46 numFiles 1
47 numRows -1
48 partition_columns subject
49 partition_columns.types string
Is this limitation in spark 1.6.0 or am I missing something here.
Found the root cause of this issue. HiveContext used for querying the table doesn't have spark.sql.hive.convertMetastoreParquet" set to "false". Its set to "true" - default value.
When I set it to "false", I can see its using partition pruning.
Is there a way to reshape the data in pig?
The data looks like this -
id | p1 | count
1 | "Accessory" | 3
1 | "clothing" | 2
2 | "Books" | 1
I want to reshape the data so that the output would look like this--
id | Accessory | clothing | Books
1 | 3 | 2 | 0
2 | 0 | 0 | 1
Can anyone please suggest some way around?
If its a fixed set of product line the below code might help, otherwise you can go for a custom UDF which helps in achieving the objective.
Input : a.csv
1|Accessory|3
1|Clothing|2
2|Books|1
Pig Snippet :
test = LOAD 'a.csv' USING PigStorage('|') AS (product_id:long,product_name:chararray,rec_cnt:long);
req_stats = FOREACH (GROUP test BY product_id) {
accessory = FILTER test BY product_name=='Accessory';
clothing = FILTER test BY product_name=='Clothing';
books = FILTER test BY product_name=='Books';
GENERATE group AS product_id, (IsEmpty(accessory) ? '0' : BagToString(accessory.rec_cnt)) AS a_cnt, (IsEmpty(clothing) ? '0' : BagToString(clothing.rec_cnt)) AS c_cnt, (IsEmpty(books) ? '0' : BagToString(books.rec_cnt)) AS b_cnt;
};
DUMP req_stats;
Output :DUMP req_stats;
(1,3,2,0)
(2,0,0,1)
I have a Table Witch look like the below table
TableVeiw<Transaction>
---------------------------------------------------------------------
| id | Transaction date | Name | type | Debit Amount | Credit Amount|
|---------------------------------------------------------------------|
| 1 | 21/02/2016 |Invoice|Credit | | 12000 |
|---------------------------------------------------------------------|
| 2 | 21/02/2016 |Payment|Debit | 20000 | |
|---------------------------------------------------------------------|
| Total Debit | Total Credit |
-----------------------------
The data in Debit amount and Credit amount come from one property of Transaction Object the code snnipet of how to populate those columns is below:
tcCreditAmmout.setCellValueFactory(cellData -> {
Transaction transaction = cellData.getValue() ;
BigDecimal value = null;
if(transaction.getKindOfTransaction() == KindOfTransaction.CREDIT){
value = transaction.getAmountOfTransaction();
}
return new ReadOnlyObjectWrapper<BigDecimal>(value);
});
tcDebitAmmout.setCellValueFactory(cellData -> {
Transaction transaction = cellData.getValue() ;
BigDecimal value = null;
if(transaction.getKindOfTransaction() == KindOfTransaction.DEBIT){
value = transaction.getAmountOfTransaction();
}
return new ReadOnlyObjectWrapper<BigDecimal>(value);
});
I need to calculate the total of :Total Debit(See the above table) and Total Credit (See the above table) every time the TableView item changes via Javafx Bindings, but i have no idea how to acheive this.
Note: Total Debit and Total Credit are Labels ,
Assuming you have
TableView<Transaction> table = ... ;
Label totalDebit = ... ;
Label totalCredit = ... ;
then you just need:
totalDebit.textProperty().bind(Bindings.createObjectBinding(() ->
table.getItems().stream()
.filter(transaction -> transaction.getKindOfTransaction() == KindOfTransaction.DEBIT)
.map(Transaction::getAmountOfTransaction)
.reduce(BigDecimal.ZERO, BigDecimal::add),
table.getItems())
.asString("%.3f"));
and of course
totalCredit.textProperty().bind(Bindings.createObjectBinding(() ->
table.getItems().stream()
.filter(transaction -> transaction.getKindOfTransaction() == KindOfTransaction.CREDIT)
.map(Transaction::getAmountOfTransaction)
.reduce(BigDecimal.ZERO, BigDecimal::add),
table.getItems())
.asString("%.3f"));
If getAmountOfTransaction might change while the transaction is part of the table, then your table's items list must be constructed with an extractor.