I am trying to solve a problem where i have 3 columns in csv like below
connection Distance Duration
Prague<>Berlin 400 4
Warsaw<>Berlin 600 6
Berlin<>Munich 800 8
Munich<>Vienna 400 3.5
Munich<>Stuttgart 800 8
Stuttgart<>Freiburg 150 2
I need to find out how many cities i can cover in given time from the origin city
Example if i would give input as
Input: Berlin, 10
Output: ["Prague","Munich","Warsaw"]
Input : Berlin, 30
Output : ["Prague","Munich","Warsaw", "Vienna", "Stuttgart",
This is something a Graph problem in real time.
I am trying this with Scala, can someone help please.
Below what i tried:
I made it working partially.
import scalax.collection.Graph // or scalax.collection.mutable.Graph
import scalax.collection.GraphPredef._, scalax.collection.GraphEdge._
import scalax.collection.edge.WDiEdge
import scalax.collection.edge.Implicits._
val rows = """Prague<>Berlin,400,4
I am preparing the input for my application.
Below i am having a list of cities which are present in the given file.
NOTE: We can have it from file itself while reading and kept in list. Here i kept all as lowercase
val cityList = List("warsaw","berlin","prague","munich","vienna","stuttgart","freiburg")
Now creating a case class:
case class Bus(connection: String, distance: Int, duration: Float)
val buses: List[Bus] = rows.map(row => {
val r =
Bus(r(0).toLowerCase, r(1).toInt, r(2).toFloat)
case class City(name: String)
// case class BusMeta(distance: Int, duration: Float)
val t = buses.map(bus => {
val s = bus.connection.split("<>")
City(s.head) ~ City(s.last) % bus.duration
val busGraph = Graph(t:_*)
From above we will create a Graph as required from the input file. "busGraph" in my case.
import scala.collection.mutable
val travelFrom = ("BERLIN").toLowerCase
val travelDuration = 16F
val possibleCities: mutable.Set[String] = mutable.Set()
if (cityList.contains(travelFrom)){
busGraph.nodes.get(City(travelFrom)).edges.filter(_.weight <= travelDuration).map(edge => edge.map(_.name)).flatten.toSet.filterNot(_ == travelFrom).foreach(possibleCities.add)
println("City PRESENT in File")
println("City Not Present in File")
I am geting Output here :
possibleCities: scala.collection.mutable.Set[String] = Set(munich, warsaw, prague)
Expected Output : Set(munich, warsaw, prague, stuttgart, Vienna)
Your solution only finds direct routes (that's why your output is shorter than expected). To get complete answer, you need to also consider connections, by recursively traversing the graph from each of the direct destinations.
Also, do not use mutable collections, they are evil.
Here is a possible solution for you:
// First, create the graph structure
def routes: Map[String, (String, Double)] = Source
.flatMap { case Array(from_to, dist, time) =>
val Array(from,to) = from_to.split("<>")
Seq(from -> (to, time.toDouble), to -> (from, time.toDouble))
// Now search for suitable routes
def reachable(
routes: Map[String, Seq[(String, Double)]],
from: String,
limit: Double,
cut: Set[String] = Set.empty
): Set[String] = routes
.getOrElse(from, Nil)
.filter(_._2 <= limit)
.filterNot { case (n, _) => cut(n) }
.flatMap { case(name, time) =>
reachable(routes, name, limit - time, cut + from) + name
// And here is how you use it
def main(argv: Array[String]): Unit = {
val Array(from, limit) = new Scanner(System.in).nextLine().split("\\s")
val reach = reachable(routes, from, limit.toDouble)
Do a breadth first search from the origin city, stopping going deeper when you reach the time limit. Output the stops reached by the search.
To my best knowledge this tasks should be solved with Graph Adjacency Matrix, which first need to build from input data.
In your particular case the Graph Adjacency Matrix would be 2D and contains cities on rows and columns and weight of direction as value.
See screenshot from Excel with example below,
At the first iteration you search for possible routes from starting cities and store city name (row/column id) and weight.
Each next iteration you try to add route and compare with limit (can you add it or not also make sure you are not adding same city)
To store results you will need again 2D array, where first element is you possible route and next element is a Tuple of visited city and value taken.
After few iterations you should get all possible options and just provide summary of founded.
TL;DR; Most of Graph programmatical algorithms use or depends (with different extent) on Graph Adjacency Matrix
I have my data stored in a JSON format using the following structure:
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
|generationId| values|
| 1|[-36.0431, -35.91...|
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))
The following code demonstrates the problem I have (in the real case SortedMap keys are Joda DateTime, and maps contain several thousands of elements).
import java.io.{ByteArrayOutputStream, ObjectOutputStream}
import scala.collection.immutable.SortedMap
object Main extends App {
val s = SortedMap(1 -> "A", 2 -> "B", 3 -> "C")
def f(s: String) = s
val sMap = s.map(kv => kv._1 -> f(kv._2)) // slow: rebuilds Map, as keys could change
val sMapValues = s.mapValues(f) // fast, but creates a view only
val so = new ByteArrayOutputStream
val oos = new ObjectOutputStream(so)
oos.writeObject(s) // works
oos.writeObject(sMap) // works
oos.writeObject(sMapValues) // does not work - view only
The problem is while mapValues has a good performance for SortedMap, the result is not a real collection but a view, and as such it cannot be serialized. The simple solution of mapping both keys and values works, but is slow, as the tree representation is rebuilt, map does not know I am not changing the keys.
Is there any fast alternative to SortedMap.mapValues, which outputs a serializable result?
Try transform:
val sMapValues = s.transform((_,v) => f(v))
Although the key and the value are provided to the transformation lambda, the result is applied only to the value, the key is unchanged.
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
for (line <- res) {
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce
I want a simple graph like:
The data I have is a simple list of transactions with two properties:
I tried d3.layout.histogram().bins() but it seems it only supports counting the transactions.
I mustn't be the only one looking for that, am I ?
Ok, so the IRC folks helped me out and pointed to nest, which works great (this is CoffeeScript):
nested_data = d3.nest()
.key((d) -> d3.time.day(d.timestamp))
.rollup((a) -> d3.sum(a, (d) -> d.amount))
.entries(incoming_data) # An array of {timestamp: ..., amount: ...} objects
# Optional
nested_data.map (d) ->
d.date = new Date(d.key)
The trick here is d3.time.day which takes a timestamp, and tells you which day (12 a.m. in the night) that timestamp belongs to. This function and the other ones like d3.time.week, etc.. can bin timeseries very well.
The other trick is the nest().rollup() function, which after being grouped by key(), sum all of the events on a given day.
Last thing I wanted, was to interpolate empty values on the days where I had no transactions. This is the last part of the code:
# Interpolate empty vals
nested_data.sort((a, b) -> d3.descending(a.date, b.date))
ex = d3.extent(nested_data, (d) -> d.date)
each_day = d3.time.days(ex[0], ex[1])
# Build a hashmap with the days we have
data_hash = {}
angular.forEach(data, (d) ->
data_hash[d.date] = d.values
# Build a new array for each day, including those where we didn't have transactions
new_data = []
angular.forEach(each_day, (d) ->
val = 0
if data_hash[d]
val = data_hash[d]
new_data.push({date: d, values: val})
final_data = new_data
Hope this helps somebody!
The histogram code doesn't support this, but you can easily do the binning yourself. Assuming that you have a date and a count for each transaction, you can bin by day like this.
var bins = {};
transactions.forEach(function(t) {
var key = t.date.toDateString();
bins[key] = bins[key] || 0;
bins[key] += t.amount;
You can obviously parse the date string back into a date if you need it; the point of using .toDateString() here is that the time part is chopped off and everything binned by day. If you want to bin by another time interval, you can use the same technique and extract a different part of the date.
It's ok when I run the example-6-llda-learn.scala as follows:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
But it's not ok when I change the last line from:
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
And the method of CVB0 cost much memory.I train a corpus of 10,000 documents with about 10 labels each document,it will cost 30G memory.
I've encountered the same situation and indeed I believe it's a bug. Check GIbbsLabeledLDA.scala in edu.stanford.nlp.tmt.model.llda under the src/main/scala folder, from line 204:
val z = doc.labels(zI);
val pZ = (doc.theta(z)+topicSmoothing(z)) *
(countTopicTerm(z)(term)+termSmooth) /
doc.labels is self-explanatory, and doc.theta records the distribution (counts, actually) of its labels, which has the same size as doc.labels.
zI is index variable iterating doc.labels, while the value z gets the actual label number. Here comes the problem: it's possible this documents has only one label - say 1000 - therefore zI is 0 and z is 1000, then doc.theta(z) gets out of range.
I suppose the solution would be to modify doc.theta(z) to doc.theta(zI).
(I'm trying to check whether the results would be meaningful, anyway this bug has made me not so confident in this toolbox.)