Spark - How to count number of records by key - hadoop

This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a count column in the dataset that I can use as the value in a groupByKey or reduceByKey. I thought of using a reduceByKey() but that requires a key-value pair and I only want to count the key and make a counter as the value. How do I go about this?
val lines = sc.textFile("/home/cloudera/desktop/file.txt")
val split_lines = lines.map(_.split(","))
val femaleOnly = split_lines.filter(x => x._10 == "Female")
Here is where I am stuck. The country is index 13 in the dataset also.
The output should something look like this:
(Australia, 201000)
(America, 420000)
etc
Any help would be great.
Thanks

You're nearly there! All you need is a countByValue:
val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()
// Prints (Australia, 230), (America, 23242), etc.
(In your example, I assume you meant x(10) rather than x._10)
All together:
sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x(10) == "Female")
.map(_(13))
.countByValue()

Have you considered manipulating your RDD using the Dataframes API ?
It looks like you're loading a CSV file, which you can do with spark-csv.
Then it's a simple matter (if your CSV is titled with the obvious column names) of:
import com.databricks.spark.csv._
val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field
.filter($"gender" === "Female")
.groupBy("country").count().show()
If you want to go deeper in this kind of manipulation, here's the guide:
https://spark.apache.org/docs/latest/sql-programming-guide.html

You can easily create a key, it doesn't have to be in the file/database. For example:
val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x._10 == "Female")
.map(x => (x._13, x._10)) // <<<< here you generate a new key
.groupByKey();

Related

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

Spark finding gaps in timestamps

I have a Pair RDD, that consists of (Key, (Timestamp,Value)) entries.
When reading the data, the entries are sorted by the timestamp, so each partition of the RDD should be ordered by the timestamp. What I want to do is, find for every key, the biggest gap between 2 sequential timestamps.
I'm thinking about this problem for a long time now, and I don't see how that could be realized given the functions sparks provide. The problems I see are: I loose the order information when I do a simple map, so that is not a possibility. It also seems to me that a groupByKey fails because there are too many entries for a specific key, Trying to do that gives me a java.io.IOException: No space left on device
Any help about how to approach this would be immensely helpful.
As suggested by The Archetypal Paul you can use DataFrame and window functions. First required imports:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
Next data has to be converted to a DataFrame:
val df = rdd.mapValues(_._1).toDF("key", "timestamp")
To be able to use lag function we'll need a window definition:
val keyTimestampWindow = Window.partitionBy("key").orderBy("timestamp")
which can be used to select:
val withGap = df.withColumn(
"gap", $"timestamp" - lag("timestamp", 1).over(keyTimestampWindow)
)
Finally groupBy with max:
withGap.groupBy("key").max("gap")
Following the second advice by The Archetypal Paul you can sort by key and timestamp.
val sorted = rdd.mapValues(_._1).sortBy(identity)
With data arranged like this you can find maximum gap for each key by sliding and reducing by key:
import org.apache.spark.mllib.rdd.RDDFunctions._
sorted.sliding(2).collect {
case Array((key1, val1), (key2, val2)) if key1 == key2 => (key1, val2 - val1)
}.reduceByKey(Math.max(_, _))
Another variant of the same idea to repartition and sort first:
val partitionedAndSorted = rdd
.mapValues(_._1)
.repartitionAndSortWithinPartitions(
new org.apache.spark.HashPartitioner(rdd.partitions.size)
)
Data like this can be transformed
val lagged = partitionedAndSorted.mapPartitions(_.sliding(2).collect {
case Seq((key1, val1), (key2, val2)) if key1 == key2 => (key1, val2 - val1)
}, preservesPartitioning=true)
and reduceByKey:
lagged.reduceByKey(Math.max(_, _))

Merging inputs in distributed application

INTRODUCTION
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
...
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
...
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
QUESTION
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
set.add(value)
}
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
}).collect()
for (line <- res) {
println(line)
}
}
}
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

minimum value in dictionary using linq

I have a dictionary of type
Dictionary<DateTime,double> dictionary
How can I retrive a minimum value and key coresponding to this value from this dictionary using linq ?
var min = dictionary.OrderBy(kvp => kvp.Value).First();
var minKey = min.Key;
var minValue = min.Value;
This is not very efficient though; you might want to consider MoreLinq's MinBy extension method.
If you are performing this query very often, you might want to consider a different data-structure.
Aggregate
var minPair = dictionary.Aggregate((p1, p2) => (p1.Value < p2.Value) ? p1 : p2);
Using the mighty Aggregate method.
I know that MinBy is cleaner in this case, but with Aggregate you have more power and its built-in. ;)
Dictionary<DateTime, double> dictionary;
//...
double min = dictionary.Min(x => x.Value);
var minMatchingKVPs = dictionary.Where(x => x.Value == min);
You could combine it of course if you really felt like doing it on one line, but I think the above is easier to read.
var minMatchingKVPs = dictionary.Where(x => x.Value == dictionary.Min(y => y.Value));
You can't easily do this efficiently in normal LINQ - you can get the minimal value easily, but finding the key requires another scan through. If you can afford that, use Jess's answer.
However, you might want to have a look at MinBy in MoreLINQ which would let you write:
var pair = dictionary.MinBy(x => x.Value);
You'd then have the pair with both the key and the value in, after just a single scan.
EDIT: As Nappy says, MinBy is also in System.Interactive in Reactive Extensions.

Resources