Spark finding gaps in timestamps - algorithm

I have a Pair RDD, that consists of (Key, (Timestamp,Value)) entries.
When reading the data, the entries are sorted by the timestamp, so each partition of the RDD should be ordered by the timestamp. What I want to do is, find for every key, the biggest gap between 2 sequential timestamps.
I'm thinking about this problem for a long time now, and I don't see how that could be realized given the functions sparks provide. The problems I see are: I loose the order information when I do a simple map, so that is not a possibility. It also seems to me that a groupByKey fails because there are too many entries for a specific key, Trying to do that gives me a java.io.IOException: No space left on device
Any help about how to approach this would be immensely helpful.

As suggested by The Archetypal Paul you can use DataFrame and window functions. First required imports:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
Next data has to be converted to a DataFrame:
val df = rdd.mapValues(_._1).toDF("key", "timestamp")
To be able to use lag function we'll need a window definition:
val keyTimestampWindow = Window.partitionBy("key").orderBy("timestamp")
which can be used to select:
val withGap = df.withColumn(
"gap", $"timestamp" - lag("timestamp", 1).over(keyTimestampWindow)
)
Finally groupBy with max:
withGap.groupBy("key").max("gap")
Following the second advice by The Archetypal Paul you can sort by key and timestamp.
val sorted = rdd.mapValues(_._1).sortBy(identity)
With data arranged like this you can find maximum gap for each key by sliding and reducing by key:
import org.apache.spark.mllib.rdd.RDDFunctions._
sorted.sliding(2).collect {
case Array((key1, val1), (key2, val2)) if key1 == key2 => (key1, val2 - val1)
}.reduceByKey(Math.max(_, _))
Another variant of the same idea to repartition and sort first:
val partitionedAndSorted = rdd
.mapValues(_._1)
.repartitionAndSortWithinPartitions(
new org.apache.spark.HashPartitioner(rdd.partitions.size)
)
Data like this can be transformed
val lagged = partitionedAndSorted.mapPartitions(_.sliding(2).collect {
case Seq((key1, val1), (key2, val2)) if key1 == key2 => (key1, val2 - val1)
}, preservesPartitioning=true)
and reduceByKey:
lagged.reduceByKey(Math.max(_, _))

Related

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

Iteratively running queries on Apache Spark

I've been trying to execute 10,000 queries over a relatively large dataset 11M. More specifically I am trying to transform an RDD using filter based on some predicate and then compute how many records conform to that filter by applying the COUNT action.
I am running Apache Spark on my local machine having 16GB of memory and an 8-core CPU. I have set the --driver-memory to 10G in order to cache the RDD in memory.
However, because I have to re-do this operation 10,000 times it takes unusually long for this to finish. I am also attaching my code hoping it will make things more clear.
Loading the queries and the dataframe I am going to query against.
//load normalized dimensions
val df = spark.read.parquet("/normalized.parquet").cache()
//load query ranges
val rdd = spark.sparkContext.textFile("part-00000")
Parallelizing the execution of queries
In here, my queries are collected in a list and using par are executed in parallel. I then collect the required parameters that my query needs, to filter the Dataset. The isWithin function calls a function and tests whether the Vector contained in my dataset is within the given bounds by my queries.
Now after filtering my dataset, I execute count to get the number of records that exist in the filtered dataset and then create a string reporting how many that was.
val results = queries.par.map(q => {
val volume = q(q.length-1)
val dimensions = q.slice(0, q.length-1)
val count = df.filter(row => {
val v = row.getAs[DenseVector]("scaledOpen")
isWithin(volume, v, dimensions)
}).count
q.mkString(",")+","+count
})
Now, what I have in mind is that this task is generally really hard given the large dataset that I have and trying to run such thing on a single machine. I know this could be much faster on something running on top of Spark or by utilizing an index. However, I am wondering if there is a way to make it faster as it is.
Just because you parallelize access to a local collection it doesn't mean that anything is executed in parallel. Number of jobs that can be executed concurrently is limited by the cluster resources not driver code.
At the same time Spark is designed for high latency batch jobs. If number of jobs goes into tens of thousands you just cannot expect things to be fast.
One thing you can try is to push filters down into a single job. Convert DataFrame to RDD:
import org.apache.spark.mllib.linalg.{Vector => MLlibVector}
import org.apache.spark.rdd.RDD
val vectors: RDD[org.apache.spark.mllib.linalg.DenseVector] = df.rdd.map(
_.getAs[MLlibVector]("scaledOpen").toDense
)
map vectors to {0, 1} indicators:
import breeze.linalg.DenseVector
// It is not clear what is the type of queries
type Q = ???
val queries: Seq[Q] = ???
val inds: RDD[breeze.linalg.DenseVector[Long]] = vectors.map(v => {
// Create {0, 1} indicator vector
DenseVector(queries.map(q => {
// Define as before
val volume = ???
val dimensions = ???
// Output 0 or 1 for each q
if (isWithin(volume, v, dimensions)) 1L else 0L
}): _*)
})
aggregate partial results:
val counts: breeze.linalg.DenseVector[Long] = inds
.aggregate(DenseVector.zeros[Long](queries.size))(_ += _, _ += _)
and prepare final output:
queries.zip(counts.toArray).map {
case (q, c) => s"""${q.mkString(",")},$c"""
}

Filter Spark DataFrame if between any set of points in list

I have a spark dataframe (using the scala interface) that has columns of timestamp, asset (a string), tag (a string) and value (a double). Here is an excerpt of it:
+--------------------+-----+--------+-------------------+
| timestamp|asset| tag| value|
+--------------------+-----+--------+-------------------+
|2013-01-03 23:36:...| G4| BTGJ2_2| 116.985626221|
|2013-01-15 00:36:...| G4| TTXD1_6| 66.887382507|
|2013-01-05 13:03:...| G4|TTXD1_22| 40.913497925|
|2013-01-12 04:43:...| G4|TTXD1_23| 60.834510803|
|2013-01-08 17:54:...| G4| LTB1D| 106.534744263|
|2013-01-02 04:15:...| G4| WEXH| 255.981292725|
|2013-01-07 10:54:...| G4| BTTA1_7| 100.743843079|
|2013-01-05 11:29:...| G4| CDFH_10| 388.560668945|
|2013-01-10 09:10:...| G4| LTB1D| 112.226242065|
|2013-01-13 15:09:...| G4|TTXD1_15| 63.970848083|
|2013-01-15 01:23:...| G4| TTIB| 67.993904114|
I also have an Array[List[Timestamp]], where each List is of size two and holds starting and ending times for intervals of interest. For example:
event_times: Array[List[java.sql.Timestamp]] = Array(List(2013-01-02 00:00:00.0, 2013-01-02 12:00:00.0), List(2013-01-10 00:00:00.0, 2013-01-12 06:00:00.0))
holds two intervals of interest: one from midnight to 12:00 on 2013-01-02, and another from midnight 2013-01-10 to 6:00 on 2013-01-12
Here's my question: How can I filter the dataframe to return values such that the timestamp is in any of the intervals? For any one interval, I can do
df.filter(df("timestamp").between(start, end))
Since I don't know how many elements are in the Array (how many intervals I have), I can't just have a long series of filters.
For the example above, I would want to keep rows 4, 6, and 9.
What I have now is a loop over the Array, and am getting the appropriate subset for each one. However, that is probably slower than having it all be in a big filter right?
You can convert your timestamps list into a DataFrame and join it with your initial DataFrame on corresponding timestamps. I've created a simple example to illustrate this process :
//Dummy data
val data = List(
("2013-01-02 00:30:00.0", "116.985626221"),
("2013-01-03 00:30:00.0", "66.887382507"),
("2013-01-11 00:30:00.0", "12.3456")
)
//Convert data to DataFrame
val dfData = sc.parallelize(data).toDF("timestamp", "value")
//Timestamp intervals list
val filterList = Array(
List("2013-01-02 00:00:00.0", "2013-01-02 12:00:00.0"),
List("2013-01-10 00:00:00.0", "2013-01-12 06:00:00.0")
)
//Convert the intervals list to a DataFrame
val dfIntervals = sc.parallelize(
filterList.map(l => (l(0),l(1)))
).toDF("start_ts","end_ts")
//Join both dataframes (inner join, since you only want matching rows)
val joined = dfData.as("data").join(
dfIntervals.as("inter"),
$"data.timestamp".between($"inter.start_ts", $"inter.end_ts")
)

Spark - How to count number of records by key

This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a count column in the dataset that I can use as the value in a groupByKey or reduceByKey. I thought of using a reduceByKey() but that requires a key-value pair and I only want to count the key and make a counter as the value. How do I go about this?
val lines = sc.textFile("/home/cloudera/desktop/file.txt")
val split_lines = lines.map(_.split(","))
val femaleOnly = split_lines.filter(x => x._10 == "Female")
Here is where I am stuck. The country is index 13 in the dataset also.
The output should something look like this:
(Australia, 201000)
(America, 420000)
etc
Any help would be great.
Thanks
You're nearly there! All you need is a countByValue:
val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()
// Prints (Australia, 230), (America, 23242), etc.
(In your example, I assume you meant x(10) rather than x._10)
All together:
sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x(10) == "Female")
.map(_(13))
.countByValue()
Have you considered manipulating your RDD using the Dataframes API ?
It looks like you're loading a CSV file, which you can do with spark-csv.
Then it's a simple matter (if your CSV is titled with the obvious column names) of:
import com.databricks.spark.csv._
val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field
.filter($"gender" === "Female")
.groupBy("country").count().show()
If you want to go deeper in this kind of manipulation, here's the guide:
https://spark.apache.org/docs/latest/sql-programming-guide.html
You can easily create a key, it doesn't have to be in the file/database. For example:
val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x._10 == "Female")
.map(x => (x._13, x._10)) // <<<< here you generate a new key
.groupByKey();

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Resources