Fast alternative to SortedMap.mapValues - performance

The following code demonstrates the problem I have (in the real case SortedMap keys are Joda DateTime, and maps contain several thousands of elements).
import java.io.{ByteArrayOutputStream, ObjectOutputStream}
import scala.collection.immutable.SortedMap
object Main extends App {
val s = SortedMap(1 -> "A", 2 -> "B", 3 -> "C")
def f(s: String) = s
val sMap = s.map(kv => kv._1 -> f(kv._2)) // slow: rebuilds Map, as keys could change
val sMapValues = s.mapValues(f) // fast, but creates a view only
val so = new ByteArrayOutputStream
val oos = new ObjectOutputStream(so)
oos.writeObject(s) // works
oos.writeObject(sMap) // works
oos.writeObject(sMapValues) // does not work - view only
oos.close()
so.close()
}
The problem is while mapValues has a good performance for SortedMap, the result is not a real collection but a view, and as such it cannot be serialized. The simple solution of mapping both keys and values works, but is slow, as the tree representation is rebuilt, map does not know I am not changing the keys.
Is there any fast alternative to SortedMap.mapValues, which outputs a serializable result?

Try transform:
val sMapValues = s.transform((_,v) => f(v))
Although the key and the value are provided to the transformation lambda, the result is applied only to the value, the key is unchanged.

Related

Fast Ways To Update Value in Map

I having map structure like this:
val map = mutable.Map[String, Double].empty
Than I add value to my map like this
map("apple") = 10.34
But for next value of apple I want to add to 10.34 so I doing this
val oldVal = map("apple")
map("apple) = oldVal + 2.34
Is there more faster way I can do this? Because I have to read big file and I want fast update on map. Thank you for your advices.
val map = mutable.Map.empty[String, Double].withDefaultValue(0.0)
//put new
map("apple") = 10.34
//update existing
map("apple") += 2.34
//update not existing
map("orange") += 0.34
When using Scala it's generally better to avoid using mutable objects. This is to avoid any issues with concurrency which is fairly simple in Scala.
If I understood your question correctly you have a map that you want to update with values. When reading from a file you could create a new map with values to add:
val m = Map("a" -> 1, "b" -> 2)
val other = Map("a" -> 3, "c" -> 4) // created from a file
Now you can update the first map with values from the second map to get this:
val updated = m.map{ case (k, v) => { if(other.contains(k)) {(k, v + other.get(k).get)} else (k,v) }}
Now you can use updated to perform other operations with.

Fastest way to convert key value pairs to grouped by key objects map using java 8 stream

Model:
public class AgencyMapping {
private Integer agencyId;
private String scoreKey;
}
public class AgencyInfo {
private Integer agencyId;
private Set<String> scoreKeys;
}
My code:
List<AgencyMapping> agencyMappings;
Map<Integer, AgencyInfo> agencyInfoByAgencyId = agencyMappings.stream()
.collect(groupingBy(AgencyMapping::getAgencyId,
collectingAndThen(toSet(), e -> e.stream().map(AgencyMapping::getScoreKey).collect(toSet()))))
.entrySet().stream().map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(Collectors.toMap(AgencyInfo::getAgencyId, identity()));
Is there a way to get the same result and use more simpler code and faster?
You can simplify the call to collectingAndThen(toSet(), e -> e.stream().map(AgencyMapping::getScoreKey).collect(toSet())))) with a call to mapping(AgencyMapping::getScoreKey, toSet()).
Map<Integer, AgencyInfo> resultSet = agencyMappings.stream()
.collect(groupingBy(AgencyMapping::getAgencyId,
mapping(AgencyMapping::getScoreKey, toSet())))
.entrySet()
.stream()
.map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(toMap(AgencyInfo::getAgencyId, identity()));
A different way to see it using a toMap collector:
Map<Integer, AgencyInfo> resultSet = agencyMappings.stream()
.collect(toMap(AgencyMapping::getAgencyId, // key extractor
e -> new HashSet<>(singleton(e.getScoreKey())), // value extractor
(left, right) -> { // a merge function, used to resolve collisions between values associated with the same key
left.addAll(right);
return left;
}))
.entrySet()
.stream()
.map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(toMap(AgencyInfo::getAgencyId, identity()));
The latter example is arguably more complicated than the former. Nevertheless, your approach is pretty much the way to go apart from using mapping as opposed to collectingAndThen as mentioned above.
Apart from that, I don't see anything else you can simplify with the code shown.
As for faster code, if you're suggesting that your current approach is slow in performance then you may want to read the answers here that speak about when you should consider going parallel.
You are collecting to an intermediate map, then streaming the entries of this map to create AgencyInfo instances, which are finally collected to another map.
Instead of all this, you could use Collectors.toMap to collect directly to a map, mapping each AgencyMapping object to the desired AgencyInfo and merging the scoreKeys as needed:
Map<Integer, AgencyInfo> agencyInfoByAgencyId = agencyMappings.stream()
.collect(Collectors.toMap(
AgencyMapping::getAgencyId,
mapping -> new AgencyInfo(
mapping.getAgencyId(),
new HashSet<>(Set.of(mapping.getScoreKey()))),
(left, right) -> {
left.getScoreKeys().addAll(right.getScoreKeys());
return left;
}));
This works by grouping the AgencyMapping elements of the stream by AgencyMapping::getAgencyId, but storing AgencyInfo objects in the map instead. We get these AgencyInfo instances from manually mapping each original AgencyMapping object. Finally, we're merging AgencyInfo instances that are already in the map by means of a merge function that folds left scoreKeys from one AgencyInfo to another.
I'm using Java 9's Set.of to create a singleton set. If you don't have Java 9, you can replace it with Collections.singleton.

How history RDDs are preserved for further use in the given code

{
var history: RDD[(String, List[String]) = sc.emptyRDD()
val dstream1 = ...
val dstream2 = ...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
val joined = historyDStream.join(dstream2)
... do stuff with joined as above, obtain dstreamFiltered ...
dstreamFiltered.foreachRDD{rdd =>
val formatted = rdd.map{case (k,(v1,v2)) => (k,v1) }
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
history.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
history.count() //action to materialize this transformation
}
This code logic is working fine for preserving all the previous RDDs which didn't successfully joined and saved for the future batches so that whenever we get a record with corresponding joining key for this RDD , we perform the join but I didn't got how this history is build up.
We can understand how the history builds up in this case by observing how the RDD lineage evolves over time.
We need two pieces of previous knowledge:
RDDs are immutable structures
Operations on RDD can be expressed in functional terms by the function to be applied and references to the input RDDs.
Let's see a quick example, using the classical wordCount:
val txt = sparkContext.textFile(someFile)
val words = txt.flatMap(_.split(" "))
In simplified terms, txt is a HadoopRDD(someFile). words is a MapPartitionsRDD(txt, flatMapFunction). We speak of the lineage of words as the DAG (Direct Acyclic Graph) that is formed of this chaining of operations.: HadoopRDD <-- MapPartitionsRDD.
We can apply the same principles to our streaming operation:
At iteration 0, we have
var history: RDD[(String, List[String]) = sc.emptyRDD()
// -> history: EmptyRDD
...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd.union(EmptyRDD)
join, filter
// underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred)
map
// -> underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.unpersist(false)
// EmptyRDD.unpersist (does nothing, it was never persisted)
history = formatted
// history = ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.persist(...)
// history marked for persistence (at the next action)
history.count()
// ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f).count()
// cache result of: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
At iteration 1, we have (adding rdd0, rdd1 as iteration index):
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f))
join, filter
// underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred)
map
// -> underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.unpersist(false)
// history0.unpersist (marks the previous result for removal, we used it already for our computation above)
history = formatted
// history1 = ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.persist(...)
// new history marked for persistence (at the next action)
history.count()
// ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f).count()
// cache result sothat we don't need to compute it next time
This iterative process goes on with each iteration.
As we can see, the graph representing the RDD computation keeps on growing. cache reduces the cost of making all calculations each time. checkpoint is needed every so often to write a concrete computed value of this growing graph so that we can use it as baseline instead of having to evaluate the whole chain.
An interesting way to see this process in action is by adding a line within the foreachRDD to inspect the current lineage:
...
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
println(history.toDebugString())
...

Merging inputs in distributed application

INTRODUCTION
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
...
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
...
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
QUESTION
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
set.add(value)
}
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
}).collect()
for (line <- res) {
println(line)
}
}
}
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce

Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files.
How can I do that?
You can add as many data sinks to a DataSet program as you need.
For example in a program like this:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...);
// apply MapFunction and emit
data.map(new YourMapper()).writeToText("/foo/bar");
// apply FilterFunction and emit
data.filter(new YourFilter()).writeToCsv("/foo/bar2");
You read a DataSet data from a CSV file. This data is given to two subsequent transformations:
To a MapFunction and its result is written to a text file.
To a FilterFunction and the non-filtered tuples are written to a CSV file.
You can also have multiple data source and branch and merge data sets (using union, join, coGroup, cross, or broadcast sets) as you like.
You can use HadoopOutputFormat API in Flink like this:
class IteblogMultipleTextOutputFormat[K, V] extends MultipleTextOutputFormat[K, V] {
override def generateActualKey(key: K, value: V): K =
NullWritable.get().asInstanceOf[K]
override def generateFileNameForKeyValue(key: K, value: V, name: String): String =
key.asInstanceOf[String]
}
and we can using IteblogMultipleTextOutputFormat as follow:
val multipleTextOutputFormat = new IteblogMultipleTextOutputFormat[String, String]()
val jc = new JobConf()
FileOutputFormat.setOutputPath(jc, new Path("hdfs:///user/iteblog/"))
val format = new HadoopOutputFormat[String, String](multipleTextOutputFormat, jc)
val batch = env.fromCollection(List(("A", "1"), ("A", "2"), ("A", "3"),
("B", "1"), ("B", "2"), ("C", "1"), ("D", "2")))
batch.output(format)
for more information you can see:http://www.iteblog.com/archives/1667

Resources