Sql Window function on whole dataframe in spark - spark-streaming

I am working on spark streaming project which consumes data from Kafka in every 3 minutes. I want to calculate moving sum of value. Below is the sample logic for a rdd which works well. I want to know will this logic work for spark streaming. I read some docs that you have to assign rang of data. ex - Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1) But I want to calculate the logic on whole dataframe. Does the below logic work for the whole value of dataframe or It will take only the range of value of dataframe.
val customers = spark.sparkContext.parallelize(List(("Alice", "2016-05-01", 50.00),
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec1 = Window.partitionBy("name").orderBy("date")
customers.withColumn( "movingSum",
sum(customers("amountSpent")).over(wSpec1) ).show()
output
+-----+----------+-----------+---------+
| name| date|amountSpent|movingSum|
+-----+----------+-----------+---------+
| Bob|2016-05-01| 25.0| 25.0|
| Bob|2016-05-04| 29.0| 54.0|
| Bob|2016-05-06| 27.0| 81.0|
|Alice|2016-05-01| 50.0| 50.0|
|Alice|2016-05-03| 45.0| 95.0|
|Alice|2016-05-04| 55.0| 150.0|
+-----+----------+-----------+---------+

Related

How to convert Iterable[String, String, String] to DataFrame?

I have a dataset of (String, String, String) which is about 6GB. After parsing the dataset I did groupby using (element => element._2) and got RDD[(String, Iterable[String, String, String])]. Then foreach element in groupby I am doing toList in-order to convert it to DataFrame.
val dataFrame = groupbyElement._2.toList.toDF()
But It is taking a huge amount of time to save data as parquet file format.
Is there any efficient way I can use?
N.B. I have five node cluster. Each node has 28 GB RAM and 4 cores. I am using standalone mode and giving 16 GB RAM to each executor.
You can try using the dataframe/dataset methods instead of those for RDD. It can look something like this:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq(
("ABC", "123", "a"),
("ABC", "321", "b"),
("BCA", "123", "c")).toDF("Col1", "Col2", "Col3")
scala> df.show
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
| ABC| 123| a|
| ABC| 321| b|
| BCA| 123| c|
+----+----+----+
val df2 = df
.groupBy($"Col2")
.agg(
collect_list($"Col1") as "Col1_list"),
collect_list($"Col3") as "Col3_list"))
scala> df2.show
+----+----------+---------+
|Col2| Col1_list|Col3_list|
+----+----------+---------+
| 123|[ABC, BCA]| [a, c]|
| 321| [ABC]| [b]|
+----+----------+---------+
Additionally, instead of reading the data into a RDD you could make use of the methods to get a dataframe directly.

How to remove the parentheses around records when saveAsTextFile on RDD[(String, Int)]? [duplicate]

This question already has answers here:
How to remove parentheses around records when saveAsTextFile on RDD[(String, Int)]?
(6 answers)
Closed 6 years ago.
How do I remove the parenthesis "(" and ")" from the output by the below spark job?
When I try to read the spark output using PigScript it creates a problem.
My code:
scala> val words = Array("HI","HOW","ARE")
words: Array[String] = Array(HI, HOW, ARE)
scala> val wordsRDD = sc.parallelize(words)
wordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val keyvalueRDD = wordsRDD.map(elem => (elem,1))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:25
scala> val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:27
scala> wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")
Output as per above code :
hadoop dfs -cat /user/cloudera/outputfiles/part*
(HOW,1)
(ARE,1)
(HI,1)
But I want the output of spark to be stored as below as without parenthesis
HOW,1
ARE,1
HI,1
Now I want to read the above output using a PigScript.
LOAD statement in Pigscript treats "(HOW" as first atom and "1)" as second atom
Is there anyway we can get rid off parenthesis in spark code itself as I don't want to apply the fix for this in pigscript..
Pig script :
records = LOAD '/user/cloudera/outputfiles' USING PigStorage(',') AS (word:chararray);
dump records;
Pig output :
((HOW)
((ARE)
((HI)
Use map transformation before you save the records to outputfiles directory, e.g.
wordcountRDD.map { case (k, v) => s"$k, $v" }.saveAsTextFile("/user/cloudera/outputfiles")
See Spark's documentation about map.
I strongly recommend using Datasets instead.
scala> words.toSeq.toDS.groupBy("value").count().show
+-----+-----+
|value|count|
+-----+-----+
| HOW| 1|
| ARE| 1|
| HI| 1|
+-----+-----+
scala> words.toSeq.toDS.groupBy("value").count.write.csv("outputfiles")
$ cat outputfiles/part-00199-aa752576-2f65-481b-b4dd-813262abb6c2-c000.csv
HI,1
See Spark SQL, DataFrames and Datasets Guide.
This format is a format of Tuple. You can manually define your format:
val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
// here we set custom format
.map(x => x._1 + "," + x._2)
wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")

spark sql distance to nearest holiday

In pandas I have a function similar to
indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]
which calculates the distance to the nearest holiday and stores that as a new column.
searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.
How can I achieve this in spark or hive?
This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:
Assuming this dataset of holidays:
holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))
And dataset of dates of 2016 in dataframe:
from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])
The UDF can use pandas searchsorted but would need to install pandas on executors. Insted you can use plan python like this:
def nearest_holiday(date):
last_holiday = index.value[0]
for next_holiday in index.value:
if next_holiday >= date:
break
last_holiday = next_holiday
if last_holiday > date:
last_holiday = None
if next_holiday < date:
next_holiday = None
return (last_holiday, next_holiday)
from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])
from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)
And can be used with withColumn:
df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)
+----------+-----------------------+
|date |holiday |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03] |
|2016-01-02|[null,2016-01-03] |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows

Spark withColumn performance

I wrote some code in spark as follows:
val df = sqlContext.read.json("s3n://blah/blah.gz").repartition(200)
val newdf = df.select("KUID", "XFF", "TS","UA").groupBy("KUID", "XFF","UA").agg(max(df("TS")) as "TS" ).filter(!(df("UA")===""))
val dfUdf = udf((z: String) => {
val parser: UserAgentStringParser = UADetectorServiceFactory.getResourceModuleParser();
val readableua = parser.parse(z)
Array(readableua.getName,readableua.getOperatingSystem.getName,readableua.getDeviceCategory.getName)
})
val df1 = newdf.withColumn("useragent", dfUdf(col("UA"))) ---PROBLEM LINE 1
val df2= df1.map {
case org.apache.spark.sql.Row(col1:String,col2:String,col3:String,col4:String, col5: scala.collection.mutable.WrappedArray[String]) => (col1,col2,col3,col4, col5(0), col5(1), col5(2))
}.toDF("KUID", "XFF","UA","TS","browser", "os", "device")
val dataset =df2.dropDuplicates(Seq("KUID")).drop("UA")
val mobile = dataset.filter(dataset("device")=== "Smartphone" || dataset("device") === "Tablet" ).
mobile.write.format("com.databricks.spark.csv").save("s3n://blah/blah.csv")
Here is a sample of the input data
{"TS":"1461762084","XFF":"85.255.235.31","IP":"10.75.137.217","KUID":"JilBNVgx","UA":"Flixster/1066 CFNetwork/758.3.15 Darwin/15.4.0" }
So in the above code snippet, i am reading a gz file of 2.4GB size. The read is taking 9minutes.The i group by ID and take the max timestamp.However(at PROBLEM LINE 1) the line which adds a column(with Column) is taking 2 hours.This line takes a User Agent and tries to derive OS,Device, Broswer info. Is this the wrong way to do things here.
I am running this on 4 node AWS cluster with r3.4xlarge ( 8 cores and 122Gb memory) with the following configuration
--executor-memory 30G --num-executors 9 --executor-cores 5
The problem here is that gzip is not splittable, and cannot be read in parallel. What happens in the background is that a single process will download the file from the bucket and then it will repartition it to distribute the data across the cluster. Please re-encode the input data to a splittable format. If the input file does not change a lot, you could for example consider bzip2 (because encoding is quite expensive and might take some time).
Update: Picking up answer from Roberto and sticking it here for the benefit of all
You are creating a new parser for every row within the UDF : val parser: UserAgentStringParser = UADetectorServiceFactory.getResourceModuleParser(); . It's probably expensive to construct it, you should construct one outside the UDF and use it as a closure

How to limit number of concurrent jobs that are starting by Pig script?

I am trying to implement simple data processing flow for POC in Pig using Hortonworks sandbox.
The idea is following: there is some set of already processed data. New data set should be added to old data without duplicates.
For testing purpose I use very small data sets (less than 10 KB).
For virtual machine I've allocated 4GB of RAM and 2 of 4 processor cores.
Here is my Pig script:
-- CONFIGURABLE PROPERTIES
%DEFAULT atbInput '/user/hue/ATB_Details/in/1'
%DEFAULT atbOutputBase '/user/hue/ATB_Details/out/1'
%DEFAULT atbPrevOutputBase '/user/hue/ATB_Details/in/empty'
%DEFAULT validData 'valid'
%DEFAULT invalidData 'invalid'
%DEFAULT billDateDimensionName 'tmlBillingDate'
%DEFAULT admissionDateDimensionName 'tmlAdmissionDate'
%DEFAULT dischargeDateDimensionName 'tmlDischargeDate'
%DEFAULT arPostDateDimensionName 'tmlARPostDate'
%DEFAULT patientTypeDimensionName 'dicPatientType'
%DEFAULT patientTypeCodeDimensionName 'dicPatientTypeCode'
REGISTER bdw-all-deps-1.0.jar;
DEFINE toDateDimension com.epam.bigdata.etl.udf.ToDateDimension();
DEFINE toCodeDimension com.epam.bigdata.etl.udf.ToCodeDimension();
DEFINE isValid com.epam.bigdata.etl.udf.atbdetails.IsValidFunc();
DEFINE isGarbage com.epam.bigdata.etl.udf.atbdetails.IsGarbageFunc();
DEFINE toAccounntBalanceCategory com.epam.bigdata.etl.udf.atbdetails.ToBalanceCategoryFunc();
DEFINE isEndOfMonth com.epam.bigdata.etl.udf.IsLastDayOfMonthFunc();
DEFINE toBalanceCategoryId com.epam.bigdata.etl.udf.atbdetails.ToBalanceCategoryIdFunc();
rawData = LOAD '$atbInput';
--CLEANSING
SPLIT rawData INTO garbage IF isGarbage($0),
cleanLines OTHERWISE;
splitRecords = FOREACH cleanLines GENERATE FLATTEN(STRSPLIT($0, '\\|'));
cleanData = FOREACH splitRecords GENERATE
$0 AS Id:LONG,
$1 AS FacilityName:CHARARRAY,
$2 AS SubFacilityName:CHARARRAY,
$3 AS PeriodDate:CHARARRAY,
$4 AS AccountNumber:CHARARRAY,
$5 AS RAC:CHARARRAY,
$6 AS ServiceTypeCode:CHARARRAY,
$7 AS ServiceType:CHARARRAY,
$8 AS AdmissionDate:CHARARRAY,
$9 AS DischargeDate:CHARARRAY,
$10 AS BillDate:CHARARRAY,
$11 AS PatientTypeCode:CHARARRAY,
$12 AS PatientType:CHARARRAY,
$13 AS InOutType:CHARARRAY,
$14 AS FinancialClassCode:CHARARRAY,
$15 AS FinancialClass:CHARARRAY,
$16 AS SystemIPGroupCode:CHARARRAY,
$17 AS SystemIPGroup:CHARARRAY,
$18 AS CurrentInsuranceCode:CHARARRAY,
$19 AS CurrentInsurance:CHARARRAY,
$20 AS InsuranceCode1:CHARARRAY,
$21 AS InsuranceBalance1:DOUBLE,
$22 AS InsuranceCode2:CHARARRAY,
$23 AS InsuranceBalance2:DOUBLE,
$24 AS InsuranceCode3:CHARARRAY,
$25 AS InsuranceBalance3:DOUBLE,
$26 AS InsuranceCode4:CHARARRAY,
$27 AS InsuranceBalance4:DOUBLE,
$28 AS InsuranceCode5:CHARARRAY,
$29 AS InsuranceBalance5:DOUBLE,
$30 AS AgingBucket:CHARARRAY,
$31 AS AccountBalance:DOUBLE,
$32 AS TotalCharges:DOUBLE,
$33 AS TotalPayments:DOUBLE,
$34 AS EstimatedRevenue:DOUBLE,
$35 AS CreateDateTime:CHARARRAY,
$36 AS UniqueFileId:LONG,
$37 AS PatientBalance:LONG,
$38 AS VendorCode:CHARARRAY;
--VALIDATION
SPLIT cleanData INTO validData IF isValid(*),
invalidData OTHERWISE;
--Dimension update--
--MACROS
DEFINE mergeDateDimension(validDataSet, dimensionFieldName, previousDimensionFile) RETURNS merged {
dates = FOREACH $validDataSet GENERATE $dimensionFieldName;
oldDimensions = LOAD '$previousDimensionFile' USING PigStorage('|') AS (
id:LONG,
monthName:CHARARRAY,
monthId:INT,
year:INT,
fiscalYear:INT,
originalDate:CHARARRAY);
oldOriginalDates = FOREACH oldDimensions GENERATE originalDate;
allDates = UNION dates, oldOriginalDates;
uniqueDates = DISTINCT allDates;
$merged = FOREACH uniqueDates GENERATE toDateDimension($0);
};
DEFINE mergeCodeDimension(validDataSet, dimensionFieldName, previousDimensionFile, outputIdField) RETURNS merged {
newCodes = FOREACH $validDataSet GENERATE $dimensionFieldName as newCode;
oldDim = LOAD '$previousDimensionFile' USING PigStorage('|') AS (
id:LONG,
code:CHARARRAY);
allCodes = COGROUP oldDim BY code, newCodes BY newCode;
grouped = FOREACH allCodes GENERATE
(IsEmpty(oldDim) ? 0L : SUM(oldDim.id)) as id,
group AS code;
ranked = RANK grouped BY id DESC, code DESC DENSE;
$merged = FOREACH ranked GENERATE
((id == 0L) ? $0 : id) as $outputIdField,
code AS $dimensionFieldName;
};
--DATE DIMENSIONS
billDateDim = mergeDateDimension(validData, BillDate, '$atbPrevOutputBase/dimensions/$billDateDimensionName');
STORE billDateDim INTO '$atbOutputBase/dimensions/$billDateDimensionName';
admissionDateDim = mergeDateDimension(validData, AdmissionDate, '$atbPrevOutputBase/dimensions/$admissionDateDimensionName');
STORE admissionDateDim INTO '$atbOutputBase/dimensions/$admissionDateDimensionName';
dischDateDim = mergeDateDimension(validData, DischargeDate, '$atbPrevOutputBase/dimensions/$dischargeDateDimensionName');
STORE dischDateDim INTO '$atbOutputBase/dimensions/$dischargeDateDimensionName';
arPostDateDim = mergeDateDimension(validData, PeriodDate, '$atbPrevOutputBase/dimensions/$arPostDateDimensionName');
STORE arPostDateDim INTO '$atbOutputBase/dimensions/$arPostDateDimensionName';
--CODE DIMENSION
patientTypeDim = mergeCodeDimension(validData, PatientType, '$atbPrevOutputBase/dimensions/$patientTypeDimensionName', PatientTypeId);
STORE patientTypeDim INTO '$atbOutputBase/dimensions/$patientTypeDimensionName' USING PigStorage('|');
patientTypeCodeDim = mergeCodeDimension(validData, PatientTypeCode, '$atbPrevOutputBase/dimensions/$patientTypeCodeDimensionName', PatientTypeCodeId);
STORE patientTypeCodeDim INTO '$atbOutputBase/dimensions/$patientTypeCodeDimensionName' USING PigStorage('|');
The problem is that when I run this script it never completes (gets stuck).
In Job Browser I can see one completed job and multiple jobs with 0% progress.
If I comment out processing of last three files - everything works fine (i.e. three parallel jobs succeed).
I've tried few approaches to fix this issue:
-no_multiquery Pig parameter - allows to execute script completely using only one job at time. Main disadvantage is huge number of generated jobs (26) and very long execution time (near 15 mins for described script and almost 40 mins for more complicated version).
Work only with parts that I develop and test by commenting out other parts - this is not an option for long term perspective.
Change mapred.capacity-scheduler.maximum-system-jobs property in mapred-site.xml so there should be less than three jobs at once as described here.
Change mapred.capacity-scheduler.queue.default.maximum-capacity in capacity-scheduler.xml for configuring default queue. But this approach didn't worked for me as well as previous.
Allocate more memory for sandbox virtual machine and mappers and reducers - no effect.
So my question is how can I limit the number of concurrent jobs that are starting by Pig script?
Or maybe there is other configuration fix that allow concurrent execution of multiple jobs?
[UPDATE]
If I run the same script with the same input data from shell console - everything works fine.
So I assume that there is some issue with HUE.
[UPDATE]
If I start more complex script from console it also gets stuck, but in this case number of parallel jobs is 8.
Last time we saw this it was because the cluster had only one map task.
You can use EXEC as described here:
http://pig.apache.org/docs/r0.11.1/perf.html#Implicit-Dependencies

Resources