Fast Ways To Update Value in Map - algorithm

I having map structure like this:
val map = mutable.Map[String, Double].empty
Than I add value to my map like this
map("apple") = 10.34
But for next value of apple I want to add to 10.34 so I doing this
val oldVal = map("apple")
map("apple) = oldVal + 2.34
Is there more faster way I can do this? Because I have to read big file and I want fast update on map. Thank you for your advices.

val map = mutable.Map.empty[String, Double].withDefaultValue(0.0)
//put new
map("apple") = 10.34
//update existing
map("apple") += 2.34
//update not existing
map("orange") += 0.34

When using Scala it's generally better to avoid using mutable objects. This is to avoid any issues with concurrency which is fairly simple in Scala.
If I understood your question correctly you have a map that you want to update with values. When reading from a file you could create a new map with values to add:
val m = Map("a" -> 1, "b" -> 2)
val other = Map("a" -> 3, "c" -> 4) // created from a file
Now you can update the first map with values from the second map to get this:
val updated = m.map{ case (k, v) => { if(other.contains(k)) {(k, v + other.get(k).get)} else (k,v) }}
Now you can use updated to perform other operations with.

Related

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

Fast alternative to SortedMap.mapValues

The following code demonstrates the problem I have (in the real case SortedMap keys are Joda DateTime, and maps contain several thousands of elements).
import java.io.{ByteArrayOutputStream, ObjectOutputStream}
import scala.collection.immutable.SortedMap
object Main extends App {
val s = SortedMap(1 -> "A", 2 -> "B", 3 -> "C")
def f(s: String) = s
val sMap = s.map(kv => kv._1 -> f(kv._2)) // slow: rebuilds Map, as keys could change
val sMapValues = s.mapValues(f) // fast, but creates a view only
val so = new ByteArrayOutputStream
val oos = new ObjectOutputStream(so)
oos.writeObject(s) // works
oos.writeObject(sMap) // works
oos.writeObject(sMapValues) // does not work - view only
oos.close()
so.close()
}
The problem is while mapValues has a good performance for SortedMap, the result is not a real collection but a view, and as such it cannot be serialized. The simple solution of mapping both keys and values works, but is slow, as the tree representation is rebuilt, map does not know I am not changing the keys.
Is there any fast alternative to SortedMap.mapValues, which outputs a serializable result?
Try transform:
val sMapValues = s.transform((_,v) => f(v))
Although the key and the value are provided to the transformation lambda, the result is applied only to the value, the key is unchanged.

Merging inputs in distributed application

INTRODUCTION
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
...
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
...
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
QUESTION
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
set.add(value)
}
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
}).collect()
for (line <- res) {
println(line)
}
}
}
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce

How can I improve the performance when completing a table with statistical methods in Apache-Spark?

I have a dataset with 10 field and 5000 rows. I want to complete this dataset with some statistical methods in Spark with Scala. I filled the empty cells in a field with the mean value of that field, if it consists of continuous values and I put most frequent value in the field, if it consists of discrete values. Here is my code:
for(col <- cols){
val datacount = table.select(col).rdd.map(r => r(0)).filter(_ == null).count()
if(datacount > 0)
{
if (continuous_lst contains col) // put mean of data to null values
{
var avg = table.select(mean(col)).first()(0).asInstanceOf[Double]
df = df.na.fill(avg, Seq(col))
}
else if(discrete_lst contains col) // put most frequent categorical value to null values
{
val group_df = df.groupBy(col).count()
val sorted = group_df.orderBy(desc("count")).take(1)
val most_frequent = sorted.map(t => t(0))
val most_frequent_ = most_frequent(0).toString.toDouble.toInt
val type__ = ctype.filter(t => t._1 == col)
val type_ = type__.map(t => t._2)
df = df.na.fill(most_frequent_, Seq(col))
}
}
}
The problem is that this code works very slowly with this data. I use spark-submit with executor memory 8G parameter. And I use repartition(4) parameter before sending the data to this function.
I should work bigger sized datasets. So how can I speed up this code?
Thanks for your help.
Here is a suggestion:
import org.apache.spark.sql.funcitons._
def most_frequent(df: DataFrame, col: Column) = {
df.select(col).map { case Row(colVal) => (colVal, 1) }
.reduceByKey(_ + _)
.reduce({case ((val1, cnt1), (val2, cnt2)) => if (cnt1 > cnt2) (val1, cnt1) else (val2, cnt2)})._1
}
val new_continuous_cols = continuous_lst.map {
col => coalesce(col, mean(col)).as(col.toString)
}.toArray
val new_discrete_cols = discrete_lst.map {
col => coalesce(col, lit(most_frequent(table, col)).as(col.toString))
}.toArray
val all_new_cols = new_continuous_cols ++ new_discrete_cols
val newDF = table.select(all_new_cols: _*)
Considerations:
I assumed that continuous_lst and discrete_lstare lists of Column. If they are lists of String the idea is the same, but some adjustments are necessary;
Note that I used map and reduce to calculate the most frequent value of a column. That can be better than grouping by and aggregating in some cases. (Maybe there is room for improvement here, by calculating the most frequent values for all discrete columns at once?);
Additionally, I used coalesce to replace null values, instead of fill. This may result in some improvement as well. (More info about the coalesce function in the scaladoc API);
I cannot test at the moment, so there may be something missing that I didn't see.

Spark - How to count number of records by key

This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a count column in the dataset that I can use as the value in a groupByKey or reduceByKey. I thought of using a reduceByKey() but that requires a key-value pair and I only want to count the key and make a counter as the value. How do I go about this?
val lines = sc.textFile("/home/cloudera/desktop/file.txt")
val split_lines = lines.map(_.split(","))
val femaleOnly = split_lines.filter(x => x._10 == "Female")
Here is where I am stuck. The country is index 13 in the dataset also.
The output should something look like this:
(Australia, 201000)
(America, 420000)
etc
Any help would be great.
Thanks
You're nearly there! All you need is a countByValue:
val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()
// Prints (Australia, 230), (America, 23242), etc.
(In your example, I assume you meant x(10) rather than x._10)
All together:
sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x(10) == "Female")
.map(_(13))
.countByValue()
Have you considered manipulating your RDD using the Dataframes API ?
It looks like you're loading a CSV file, which you can do with spark-csv.
Then it's a simple matter (if your CSV is titled with the obvious column names) of:
import com.databricks.spark.csv._
val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field
.filter($"gender" === "Female")
.groupBy("country").count().show()
If you want to go deeper in this kind of manipulation, here's the guide:
https://spark.apache.org/docs/latest/sql-programming-guide.html
You can easily create a key, it doesn't have to be in the file/database. For example:
val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x._10 == "Female")
.map(x => (x._13, x._10)) // <<<< here you generate a new key
.groupByKey();

Resources