Better way to scan data using scala and spark - algorithm

Problem
The input data has 2 types of records, lets call them R and W.
I need to traverse this data in Sequence from top to bottom in such a way that if the current record is of type W, it has to be merged with a map(lets call it workMap). If the key of that W-type record is already present in the map, the value of this record is added to it, otherwise a new entry is made into workMap.
If the current record is of type R, the workMap calculated until this record, is attached to the current record.
For example, if this is the order of records -
W1- a -> 2
W2- b -> 3
W3- a -> 4
R1
W4- c -> 1
R2
W5- c -> 4
Where W1, W2, W3, W4 and W5 are of type W; And R1 and R2 are of type R
At the end of this function, I should have the following -
R1 - { a -> 6,
b -> 3 } //merged(W1, W2, W3)
R2 - { a -> 6,
b -> 3,
c -> 1 } //merged(W1, W2, W3, W4)
{ a -> 6,
b -> 3,
c -> 5 } //merged(W1, W2, W3, W4, W5)
I want all the R-type records attached to the intermediate workMaps calculated until that point; And the final workMap after the last record is processed.
Here is the code that I have written -
def calcPerPartition(itr: Iterator[(InputKey, InputVal)]):
Iterator[(ReportKey, ReportVal)] = {
val workMap = mutable.HashMap.empty[WorkKey, WorkVal]
val reportList = mutable.ArrayBuffer.empty[(ReportKey, Reportval)]
while (itr.hasNext) {
val temp = itr.next()
val (iKey, iVal) = (temp._1, temp._2)
if (iKey.recordType == reportType) {
//creates a new (ReportKey, Reportval)
reportList += getNewReportRecord(workMap, iKey, iVal)
}
else {
//if iKey is already present, merge the values
//other wise adds a new entry
updateWorkMap(workMap, iKey, iVal)
}
}
val workList: Seq[(ReportKey, ReportVal)] = workMap.toList.map(convertToReport)
reportList.iterator ++ workList.iterator
}
ReportKey class is like this -
case class ReportKey (
// the type of record - report or work
rType: Int,
date: String,
.....
)
There are two problems with this approach that I am asking help for -
I have to keep track of a reportList - a list of R type records attached with intermediate workMaps. As the data grows, the reportList also grows and I am running into OutOfMemoryExceptions.
I have to combine reportList and workMap records in the same data structure and then return them. If there is any other elegant way, I would definitely consider changing this design.
For the sake of completeness - I am using spark. The function calcPerPartition is passed as argument for mapPartitions on an RDD. I need the workMaps from each partition to do some additional calculations later.
I know that if I don't have to return workMaps from each partition, the problem becomes much easier, like this -
...
val workMap = mutable.HashMap.empty[WorkKey, WorkVal]
itr.scanLeft[Option[(ReportKey, Reportval)]](
None)((acc: Option[(ReportKey, Reportval)],
curr: (InputKey, InputVal)) => {
if (curr._1.recordType == reportType) {
val rec = getNewReportRecord(workMap, curr._1, curr._2)
Some(rec)
}
else {
updateWorkMap(workMap, curr._1, curr._2)
None
}
})
val reportList = scan.filter(_.isDefined).map(_.get)
//workMap is still empty after the scanLeft.
...
Sure, I can do a reduce operation on the input data to derive the final workMap but I would need to look at the data twice. Considering that the input data set is huge, I want to avoid that too.
But unfortunately I need the workMaps at a latter step.
So, is there a better way to solve the above problem? If I can't solve problem 2 at all(according to this), is there any other way I can avoid storing R records(reportList) in a list or scan the data more than once?

I don't yet have a better design for the second question - if you can avoid combining reportList and workMap into a single data structure but we can certainly avoid storing R type records in a list.
Here is how we can re-write the calcPerPartition from the above question -
def calcPerPartition(itr: Iterator[(InputKey, InputVal)]):
Iterator[Option[(ReportKey, ReportVal)]] = {
val workMap = mutable.HashMap.empty[WorkKey, WorkVal]
var finalWorkMap = true
new Iterator[Option[(ReportKey, ReportVal)]](){
override def hasNext: Boolean = itr.hasNext
override def next(): Option[(ReportKey, ReportVal)] = {
val curr = itr.next()
val iKey = curr._1
val iVal = curr._2
val eventKey = EventKey(openKey.date, openKey.symbol)
if (iKey.recordType == reportType) {
Some(getNewReportRecord(workMap, iKey, iVal))
}
else {
//otherwise update the generic interest map but don't accumulate anything
updateWorkMap(workMap, iKey, iVal)
if (itr.hasNext) {
next()
}
else {
if(finalWorkMap){
finalWorkMap = false //because we want a final only once
Some(workMap.map(convertToReport))
}
else {
None
}
}
}
}
}
}
Instead of storing results in a list, we defined an iterator. That solved most of the memory issues we had around this issue.

Related

Algorithm / data structure for resolving nested interpolated values in this example?

I am working on a compiler and one aspect currently is how to wait for interpolated variable names to be resolved. So I am wondering how to take a nested interpolated variable string and build some sort of simple data model/schema for unwrapping the evaluated string so to speak. Let me demonstrate.
Say we have a string like this:
foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}
That has 1, 2, and 3 levels of nested interpolations in it. So essentially it should resolve something like this:
wait for x, y, one, two, and c to resolve.
when both x and y resolve, then resolve a{x}-{y} immediately.
when both one and two resolve, resolve baz{one}-{two}.
when a{x}-{y}, baz{one}-{two}, and c all resolve, then finally resolve the whole expression.
I am shaky on my understanding of the logic flow for handling something like this, wondering if you could help solidify/clarify the general algorithm (high level pseudocode or something like that). Mainly just looking for how I would structure the data model and algorithm so as to progressively evaluate when the pieces are ready.
I'm starting out trying and it's not clear what to do next:
{
dependencies: [
{
path: [x]
},
{
path: [y]
}
],
parent: {
dependency: a{x}-{y} // interpolated term
parent: {
dependencies: [
{
}
]
}
}
}
Some sort of tree is probably necessary, but I am having trouble figuring out what it might look like, wondering if you could shed some light on that with some pseudocode (or JavaScript even).
watch the leaf nodes at first
then, when the children of a node are completed, propagate upward to resolving the next parent node. This would mean once x and y are done, it could resolve a{x}-{y}, but then wait until the other nodes are ready before doing the final top-level evaluation.
You can just simulate it by sending "events" to the system theoretically, like:
ready('y')
ready('c')
ready('x')
ready('a{x}-{y}')
function ready(variable) {
if ()
}
...actually that may not work, not sure how to handle the interpolated nodes in a hacky way like that. But even a high level description of how to solve this would be helpful.
export type SiteDependencyObserverParentType = {
observer: SiteDependencyObserverType
remaining: number
}
export type SiteDependencyObserverType = {
children: Array<SiteDependencyObserverType>
node: LinkNodeType
parent?: SiteDependencyObserverParentType
path: Array<string>
}
(What I'm currently thinking, some TypeScript)
Here is an approach in JavaScript:
Parse the input string to create a Node instance for each {} term, and create parent-child dependencies between the nodes.
Collect the leaf Nodes of this tree as the tree is being constructed: group these leaf nodes by their identifier. Note that the same identifier could occur multiple times in the input string, leading to multiple Nodes. If a variable x is resolved, then all Nodes with that name (the group) will be resolved.
Each node has a resolve method to set its final value
Each node has a notify method that any of its child nodes can call in order to notify it that the child has been resolved with a value. This may (or may not yet) lead to a cascading call of resolve.
In a demo, a timer is set up that at every tick will resolve a randomly picked variable to some number
I think that in your example, foo, and a might be functions that need to be called, but I didn't elaborate on that, and just considered them as literal text that does not need further treatment. It should not be difficult to extend the algorithm with such function-calling features.
class Node {
constructor(parent) {
this.source = ""; // The slice of the input string that maps to this node
this.texts = []; // Literal text that's not part of interpolation
this.children = []; // Node instances corresponding to interpolation
this.parent = parent; // Link to parent that should get notified when this node resolves
this.value = undefined; // Not yet resolved
}
isResolved() {
return this.value !== undefined;
}
resolve(value) {
if (this.isResolved()) return; // A node is not allowed to resolve twice: ignore
console.log(`Resolving "${this.source}" to "${value}"`);
this.value = value;
if (this.parent) this.parent.notify();
}
notify() {
// Check if all dependencies have been resolved
let value = "";
for (let i = 0; i < this.children.length; i++) {
const child = this.children[i];
if (!child.isResolved()) { // Not ready yet
console.log(`"${this.source}" is getting notified, but not all dependecies are ready yet`);
return;
}
value += this.texts[i] + child.value;
}
console.log(`"${this.source}" is getting notified, and all dependecies are ready:`);
this.resolve(value + this.texts.at(-1));
}
}
function makeTree(s) {
const leaves = {}; // nodes keyed by atomic names (like "x" "y" in the example)
const tokens = s.split(/([{}])/);
let i = 0; // Index in s
function dfs(parent=null) {
const node = new Node(parent);
const start = i;
while (tokens.length) {
const token = tokens.shift();
i += token.length;
if (token == "}") break;
if (token == "{") {
node.children.push(dfs(node));
} else {
node.texts.push(token);
}
}
node.source = s.slice(start, i - (tokens.length ? 1 : 0));
if (node.children.length == 0) { // It's a leaf
const label = node.texts[0];
leaves[label] ??= []; // Define as empty array if not yet defined
leaves[label].push(node);
}
return node;
}
dfs();
return leaves;
}
// ------------------- DEMO --------------------
let s = "foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}";
const leaves = makeTree(s);
// Create a random order in which to resolve the atomic variables:
function shuffle(array) {
for (var i = array.length - 1; i > 0; i--) {
var j = Math.floor(Math.random() * (i + 1));
[array[j], array[i]] = [array[i], array[j]];
}
return array;
}
const names = shuffle(Object.keys(leaves));
// Use a timer to resolve the variables one by one in the given random order
let index = 0;
function resolveRandomVariable() {
if (index >= names.length) return; // all done
console.log("\n---------------- timer tick --------------");
const name = names[index++];
console.log(`Variable ${name} gets a value: "${index}". Calling resolve() on the connected node instance(s):`);
for (const node of leaves[name]) node.resolve(index);
setTimeout(resolveRandomVariable, 1000);
}
setTimeout(resolveRandomVariable, 1000);
your idea of building a dependency tree it's really likeable.
Anyway I tryed to find a solution as simplest possible.
Even if it already works, there are many optimizations possible, take this just as proof of concept.
The background idea it's produce a List of Strings which you can read in order where each element it's what you need to solve progressively. Each element might be mandatory to solve something that come next in the List, hence for the overall expression. Once you solved all the chunks you have all pieces to solve your original expression.
It's written in Java, I hope it's understandable.
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Objects;
public class StackOverflow {
public static void main(String[] args) {
String exp = "foo{a{x}-{y}}-{baz{one}-{two}}-foo{c}";
List<String> chunks = expToChunks(exp);
//it just reverse the order of the list
Collections.reverse(chunks);
System.out.println(chunks);
//output -> [c, two, one, baz{one}-{two}, y, x, a{x}-{y}]
}
public static List<String> expToChunks(String exp) {
List<String> chunks = new ArrayList<>();
//this first piece just find the first inner open parenthesys and its relative close parenthesys
int begin = exp.indexOf("{") + 1;
int numberOfParenthesys = 1;
int end = -1;
for(int i = begin; i < exp.length(); i++) {
char c = exp.charAt(i);
if (c == '{') numberOfParenthesys ++;
if (c == '}') numberOfParenthesys --;
if (numberOfParenthesys == 0) {
end = i;
break;
}
}
//this if put an end to recursive calls
if(begin > 0 && begin < exp.length() && end > 0) {
//add the chunk to the final list
String substring = exp.substring(begin, end);
chunks.add(substring);
//remove from the starting expression the already considered chunk
String newExp = exp.replace("{" + substring + "}", "");
//recursive call for inner element on the chunk found
chunks.addAll(Objects.requireNonNull(expToChunks(substring)));
//calculate other chunks on the remained expression
chunks.addAll(Objects.requireNonNull(expToChunks(newExp)));
}
return chunks;
}
}
Some details on the code:
The following piece find the begin and the end index of the first outer chunk of expression. The background idea is: in a valid expression the number of open parenthesys must be equal to the number of closing parenthesys. The count of open(+1) and close(-1) parenthesys can't ever be negative.
So using that simple loop once I find the count of parenthesys to be 0, I also found the first chunk of the expression.
int begin = exp.indexOf("{") + 1;
int numberOfParenthesys = 1;
int end = -1;
for(int i = begin; i < exp.length(); i++) {
char c = exp.charAt(i);
if (c == '{') numberOfParenthesys ++;
if (c == '}') numberOfParenthesys --;
if (numberOfParenthesys == 0) {
end = i;
break;
}
}
The if condition provide validation on the begin and end indexes and stop the recursive call when no more chunks can be found on the remained expression.
if(begin > 0 && begin < exp.length() && end > 0) {
...
}

Java 8 Collections max size with count

I have a class like below.
class Student {
public Student(Set<String> seminars) {
this.seminar = seminars;
}
Set<String> seminar;
public Set<String> getSeminar()
{
return seminar;
}
}
And created a set of students like below.
List<Student> students = new ArrayList<Student>();
Set<String> seminars = new HashSet<String>();
seminars.add("SeminarA");
seminars.add("SeminarB");
seminars.add("SeminarC");
students.add(new Student(seminars)); //Student 1 - 3 seminars
students.add(new Student(seminars)); //Student 2 - 3 seminars
seminars = new HashSet<String>();
seminars.add("SeminarA");
seminars.add("SeminarB");
students.add(new Student(seminars)); //Student 3 - 2 seminars
seminars = new HashSet<String>();
students.add(new Student(seminars)); //Student 4 - 0 seminars
Now the question is "I'm trying to get the count of students who has atteneded maximus seminars" As you can see there are 2 students who has attended to 3(maximum) seminars, So I needed to get that count.
I achieved the same using 2 different statements using stream
OptionalInt max = students.stream()
.map(Student::getSeminar)
.mapToInt(Set::size)
.max();
long count = students.stream()
.map(Student::getSeminar)
.filter(size -> size.size() == max.getAsInt())
.count();
is there a way to achieve the same using one statement?
To solve this, please use the follwing code:
students.stream()
.map(Student::getSeminar)
.collect(Collectors.groupingBy(Set::size, Collectors.counting()));
Your output will be:
{0=1, 2=1, 3=2}
As you can see, there are two students that are going to the maximum number of seminars which is three.
First we changed all the students objects from the stream with the actual sets and then we used Collectors groupingBy() method to group the sets by size.
If you want to get only the number of the students, please use following code:
students.stream()
.map(Student::getSeminar)
.collect(Collectors.groupingBy(Set::size, Collectors.counting()))
.values().stream()
.max(Comparator.comparing(a->a))
.get();
Your output will be: 2.
In most practical situations, your approach of traversing the list twice is the best one. It is simple and the code is easy to understand. Fight the urge to conserve statements -- you don't get charged per semicolon!
However, there are some conceivable scenarios when the source data cannot be traversed twice. Perhaps it comes from some slow or once-only source like a network stream.
If you have such a situation, you might want to define a custom collector allMaxBy that collects all max elements into a downstream collector.
Then you would be able to write
long maxCount = students.stream()
.collect(allMaxBy(
comparingInt(s -> s.getSeminar().size()),
counting()
));
Here's an implementation for allMaxBy:
public static <T, A, R> Collector<T, ?, R> allMaxBy(Comparator<? super T> cmp, Collector<? super T, A, R> downstream) {
class AllMax {
T val;
A acc = null; // null means empty
void add(T t) {
int c = acc == null ? 1 : cmp.compare(t, val);
if (c > 0) {
val = t;
acc = downstream.supplier().get();
downstream.accumulator().accept(acc, t);
} else if (c == 0) {
downstream.accumulator().accept(acc, t);
}
}
AllMax merge(AllMax other) {
if (other.acc == null) {
return this;
} else if (this.acc == null) {
return other;
}
int c = cmp.compare(this.val, other.val);
if (c == 0) {
this.acc = downstream.combiner().apply(this.acc, other.acc);
}
return c >= 0 ? this : other;
}
R finish() {
return downstream.finisher().apply(acc);
}
}
return Collector.of(AllMax::new, AllMax::add, AllMax::merge, AllMax::finish);
}
This is ugly, but the following would work:
long count = students.stream()
.filter(s -> s.getSeminar().size() == students
.stream().mapToInt(a -> a.getSeminar().size())
.max().orElse(0))
.count();
Explanation: This streams the students list and filters the students so that only students with the maximum number of seminars (which is what the nested lambda is) remain and then takes the count of that stream.
Little modified #Alex's solution, to sort the keys while doing the grouping
TreeMap<Integer, Long> agg = students.stream()
.map(Student::getSeminar)
.collect(Collectors.groupingBy(Set::size, TreeMap::new, Collectors.counting()));
System.out.println(agg);
System.out.println(agg.lastEntry().getKey() + " - " + agg.lastEntry().getValue());
output
3 - 2

Using par map to increase performance

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
}
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
}
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
}
})
}
}
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on
the size is 8000 which translates to 64'000'000 comparisons. I'm
using a synchronized block so that multiple threads are not writing
to same file at same time. A performance improvment im considering
is populating a separate collection within the inner loop ( userList.par.map(yUser => {)
and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can
handle a List that contains 8000 items instead of above example of 100 ?
I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.
EDIT:
One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.
Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...
Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

Accessing position information in a scala combinatorparser kills performance

I wrote a new combinator for my parser in scala.
Its a variation of the ^^ combinator, which passes position information on.
But accessing the position information of the input element really cost performance.
In my case parsing a big example need around 3 seconds without position information, with it needs over 30 seconds.
I wrote a runnable example where the runtime is about 50% more when accessing the position.
Why is that? How can I get a better runtime?
Example:
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.combinator.Parsers
import scala.util.matching.Regex
import scala.language.implicitConversions
object FooParser extends RegexParsers with Parsers {
var withPosInfo = false
def b: Parser[String] = regexB("""[a-z]+""".r) ^^# { case (b, x) => b + " ::" + x.toString }
def regexB(p: Regex): BParser[String] = new BParser(regex(p))
class BParser[T](p: Parser[T]) {
def ^^#[U](f: ((Int, Int), T) => U): Parser[U] = Parser { in =>
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
val inwo = in.drop(start - offset)
p(inwo) match {
case Success(t, in1) =>
{
var a = 3
var b = 4
if(withPosInfo)
{ // takes a lot of time
a = inwo.pos.line
b = inwo.pos.column
}
Success(f((a, b), t), in1)
}
case ns: NoSuccess => ns
}
}
}
def main(args: Array[String]) = {
val r = "foo"*50000000
var now = System.nanoTime
parseAll(b, r)
var us = (System.nanoTime - now) / 1000
println("without: %d us".format(us))
withPosInfo = true
now = System.nanoTime
parseAll(b, r)
us = (System.nanoTime - now) / 1000
println("with : %d us".format(us))
}
}
Output:
without: 2952496 us
with : 4591070 us
Unfortunately, I don't think you can use the same approach. The problem is that line numbers end up implemented by scala.util.parsing.input.OffsetPosition which builds a list of every line break every time it is created. So if it ends up with string input it will parse the entire thing on every call to pos (twice in your example). See the code for CharSequenceReader and OffsetPosition for more details.
There is one quick thing you can do to speed this up:
val ip = inwo.pos
a = ip.line
b = ip.column
to at least avoid creating pos twice. But that still leaves you with a lot of redundant work. I'm afraid to really solve the problem you'll have to build the index as in OffsetPosition yourself, just once, and then keep referring to it.
You could also file a bug report / make an enhancement request. This is not a very good way to implement the feature.

how to read immutable data structures from file in scala

I have a data structure made of Jobs each containing a set of Tasks. Both Job and Task data are defined in files like these:
jobs.txt:
JA
JB
JC
tasks.txt:
JB T2
JA T1
JC T1
JA T3
JA T2
JB T1
The process of creating objects is the following:
- read each job, create it and store it by id
- read task, retrieve job by id, create task, store task in the job
Once the files are read this data structure is never modified. So I would like that tasks within jobs would be stored in an immutable set. But I don't know how to do it in an efficient way. (Note: the immutable map storing jobs may be left immutable)
Here is a simplified version of the code:
class Task(val id: String)
class Job(val id: String) {
val tasks = collection.mutable.Set[Task]() // This sholud be immutable
}
val jobs = collection.mutable.Map[String, Job]() // This is ok to be mutable
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = new Job(line.trim)
jobs += (job.id -> job)
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = jobs(tokens(0).trim)
val task = new Task(job.id + "." + tokens(1).trim)
job.tasks += task
}
Thanks in advance for every suggestion!
The most efficient way to do this would be to read everything into mutable structures and then convert to immutable ones at the end, but this might require a lot of redundant coding for classes with a lot of fields. So instead, consider using the same pattern that the underlying collection uses: a job with a new task is a new job.
Here's an example that doesn't even bother reading the jobs list--it infers it from the task list. (This is an example that works under 2.7.x; recent versions of 2.8 use "Source.fromPath" instead of "Source.fromFile".)
object Example {
class Task(val id: String) {
override def toString = id
}
class Job(val id: String, val tasks: Set[Task]) {
def this(id0: String, old: Option[Job], taskID: String) = {
this(id0 , old.getOrElse(EmptyJob).tasks + new Task(taskID))
}
override def toString = id+" does "+tasks.toString
}
object EmptyJob extends Job("",Set.empty[Task]) { }
def read(fname: String):Map[String,Job] = {
val map = new scala.collection.mutable.HashMap[String,Job]()
scala.io.Source.fromFile(fname).getLines.foreach(line => {
line.split("\t") match {
case Array(j,t) => {
val jobID = j.trim
val taskID = t.trim
map += (jobID -> new Job(jobID,map.get(jobID),taskID))
}
case _ => /* Handle error? */
}
})
new scala.collection.immutable.HashMap() ++ map
}
}
scala> Example.read("tasks.txt")
res0: Map[String,Example.Job] = Map(JA -> JA does Set(T1, T3, T2), JB -> JB does Set(T2, T1), JC -> JC does Set(T1))
An alternate approach would read the job list (creating jobs as new Job(jobID,Set.empty[Task])), and then handle the error condition of when the task list contained an entry that wasn't in the job list. (You would still need to update the job list map every time you read in a new task.)
I did a feel changes for it to run on Scala 2.8 (mostly, fromPath instead of fromFile, and () after getLines). It may be using a few Scala 2.8 features, most notably groupBy. Probably toSet as well, but that one is easy to adapt on 2.7.
I don't have the files to test it, but I changed this stuff from val to def, and the type signatures, at least, match.
class Task(val id: String)
class Job(val id: String, val tasks: Set[Task])
// read tasks
val tasks = (
for {
line <- io.Source.fromPath("tasks.txt").getLines().toStream
tokens = line.split("\t")
jobId = tokens(0).trim
task = new Task(jobId + "." + tokens(1).trim)
} yield jobId -> task
).groupBy(_._1).map { case (key, value) => key -> value.map(_._2).toSet }
// read jobs
val jobs = Map() ++ (
for {
line <- io.Source.fromPath("jobs.txt").getLines()
job = new Job(line.trim, tasks(line.trim))
} yield job.id -> job
)
You could always delay the object creation until you have all the data read in from the file, like:
case class Task(id: String)
case class Job(id: String, tasks: Set[Task])
import scala.collection.mutable.{Map,ListBuffer}
val jobIds = Map[String, ListBuffer[String]]()
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = line.trim
jobIds += (job.id -> new ListBuffer[String]())
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = tokens(0).trim
val task = job.id + "." + tokens(1).trim
jobIds(job) += task
}
// create objects
val jobs = jobIds.map { j =>
Job(j._1, Set() ++ j._2.map { Task(_) })
}
To deal with more fields, you could (with some effort) make a mutable version of your immutable classes, used for building. Then, convert as needed:
case class Task(id: String)
case class Job(val id: String, val tasks: Set[Task])
object Job {
class MutableJob {
var id: String = ""
var tasks = collection.mutable.Set[Task]()
def immutable = Job(id, Set() ++ tasks)
}
def mutable(id: String) = {
val ret = new MutableJob
ret.id = id
ret
}
}
val mutableJobs = collection.mutable.Map[String, Job.MutableJob]()
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = Job.mutable(line.trim)
jobs += (job.id -> job)
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = jobs(tokens(0).trim)
val task = Task(job.id + "." + tokens(1).trim)
job.tasks += task
}
val jobs = for ((k,v) <- mutableJobs) yield (k, v.immutable)
One option here is to have some mutable but transient configurer class along the lines of the MutableMap above but then pass this through in some immutable form to your actual class:
val jobs: immutable.Map[String, Job] = {
val mJobs = readMutableJobs
immutable.Map(mJobs.toSeq: _*)
}
Then of course you can implement readMutableJobs along the lines you have already coded

Resources