spark streaming mapwithstate's confuse

spark streaming mapwithstate's confuse - spark-streaming

as the code show:
val mappedData = new ArrayBuffer[E]
val wrappedState = new StateImpl[S]()
// Call the mapping function on each record in the data iterator, and accordingly
// update the states touched, and collect the data returned by the mapping function
dataIterator.foreach { case (key, value) =>
wrappedState.wrap(newStateMap.get(key))
val returned = mappingFunction(batchTime, key, Some(value), wrappedState)
if (wrappedState.isRemoved) {
newStateMap.remove(key)
} else if (wrappedState.isUpdated
|| (wrappedState.exists && timeoutThresholdTime.isDefined)) {
newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)
}
mappedData ++= returned
}
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
Spark streaming mapWithState timeout delayed?
deltaMap ll mark for deleting when the key's removeTimedoutData=true
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
the openhashmap ll remove the key when DELTA_CHAIN_LENGTH_THRESHOLD >= 20
my question is :
1: a key which in this current batch timeout ll be executed because "wrappedState.exists && timeoutThresholdTime.isDefined" and "removeTimedoutData && timeoutThresholdTime.isDefined " all true when checkpoint invokes
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
but what does it means by executing "mappedData ++= returned" two times for a timeout key.
val returned = mappingFunction(batchTime, key, Some(value), wrappedState)
mappedData ++= returned
and
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
2: when a key is marked for delete but not remove from openhashmap ,then next batch data which contain this key comes,"wrappedState.exists && timeoutThresholdTime.isDefined" and "removeTimedoutData && timeoutThresholdTime.isDefined" still all true and they ll be executed another times?

i reviewed the code of mapwithstate again and got answer
my question is :
1: a key which in this current batch timeout ll be executed because "wrappedState.exists && timeoutThresholdTime.isDefined" and "removeTimedoutData && timeoutThresholdTime.isDefined " all true when checkpoint invokes
a key this current batch timeout then the key have to wait for "doFullScan = true" which means "batchtime*DEFAULT_CHECKPOINT_DURATION_MULTIPLIER(defalut:10)",now the key already timeout for sometimes and "doFullScan = true",the key ll be update it's state first:
if (wrappedState.isUpdated|| (wrappedState.exists && timeoutThresholdTime.isDefined)) {
newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)
}
then the key meet the conditions :
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
in this code,first of all Get all the keys and states whose updated time is older than the give threshold time and then change the key's flag "defined = true timingOut = true removed = false updated = false"
the code "mappingFunction(batchTime, key, None, wrappedState)" ll be executed and the value of this key ll be None and in mapwihtstateFunction u can do something by "if(state.isTimeout){dosomething}"
questions 2
a key was marked as delete in stateMap then it ll be remove form openhashmap when "DELTA_CHAIN_LENGTH_THRESHOLD >= 20" but even if the key not be removed from openhashmap and u can not get it from statMap because statemap do not return a key with "delete flag"

Related

Space that is available at all Tarantool nodes (vshard)

I use Tarantool with vshard module. When bucket_id is set, the data is distributed across the cluster. Each node has its own set of bucket_ids. How to make a space so that it is fully accessible on every node (dictionary)?

I use the following snippet to run a function on all storages (may be there's a better way).
Suppose we have a function on storages:
local function putMyData(rows)
box.atomic(function()
for _, row in ipairs(rows)
box.mySpace:put(row)
end
end)
end
box.schema.func.create("putMyData", { if_not_exists = true })
box.schema.role.grant("public", "execute", "function", "putMyData", { if_not_exists = true })
rawset(_G, "putMyData", putMyData)
Then the following helper may be used to call the function:
local function callAll(mode, fnName, args, resHandler, timeoutSec)
local replicaSets, err = vshard.router.routeall()
if err ~= nil then
error(err)
end
local count = 0
for _, _ in pairs(replicaSets) do
count = count + 1
end
local channel = fiber.channel(count)
local method
if mode == "read" then
method = "callbro"
else
method = "callrw"
end
for _, replicaSet in pairs(replicaSets) do
fiber.create(
function()
local res, fErr = replicaSet[method](replicaSet,
fnName, args, { timeout = timeoutSec or opts.timeout })
channel:put({ res = res, err = fErr })
end)
end
local results = { }
for i = 1, count do
local val = channel:get()
if val.err ~= nil then
error(val.err)
end
if resHandler == nil then
results[i] = val.res
else
resHandler(val.res, results)
end
end
return results
end

Why does Newtonsoft Replace function only replace the value if it is changed?

Take the following code:
JProperty toke = new JProperty("value", new JValue(50)); //toke.Value is 50
toke.Value.Replace(new JValue(20)); //toke.Value is 20
This works as expected. Now examine the following code:
JValue val0 = new JValue(50);
JProperty toke = new JProperty("value", val0); //toke.Value is 50
JValue val1 = new JValue(20);
toke.Value.Replace(val1); //toke.Value is 20
This also works as expected, but there is an important detail. val0 is no longer part of the toke's JSON tree, and val1 is part of the JSON tree; this means that val0 has no valid parent, while val1 does.
Now take this code.
JValue val0 = new JValue(50);
JProperty toke = new JProperty("value", val0); //toke.Value is 50
JValue val1 = new JValue(50);
toke.Value.Replace(val1); //toke.Value is 50
The behavior is different; val0 is still part of toke's JSON tree, and val1 is not. Now val0 has a valid parent, while val1 does not.
This is a critical distinction, and if you are using Newtonsoft JSON tree's to represent a structure, and storing JTokens as references into the tree, the way the references are structure can change based on the value being Replaced, which seems incorrect.
Is there any flaw with my reasoning? Or is behavior incorrect, as I believe it is?

I think you have a valid point: Replace should replace the token instance and set the parent properly even if the tokens have the same values.
This works as you would expect if the property value is a JObject and you replace it with an identical JObject:
JObject obj1 = JObject.Parse(#"{ ""foo"" : 1 }");
JProperty prop = new JProperty("bar", obj1);
JObject obj2 = JObject.Parse(#"{ ""foo"" : 1 }");
prop.Value.Replace(obj2);
Console.WriteLine("obj1 parent is " +
(ReferenceEquals(obj1.Parent, prop) ? "prop" : "not prop")); // "not prop"
Console.WriteLine("obj2 parent is " +
(ReferenceEquals(obj2.Parent, prop) ? "prop" : "not prop")); // "prop"
However, the code seems to have been deliberately written to work differently for JValues. In the source code we see that JToken.Replace() calls JContainer.ReplaceItem(), which in turn calls SetItem(). In the JProperty class, SetItem() is implemented like this:
internal override void SetItem(int index, JToken item)
{
if (index != 0)
{
throw new ArgumentOutOfRangeException();
}
if (IsTokenUnchanged(Value, item))
{
return;
}
if (Parent != null)
{
((JObject)Parent).InternalPropertyChanging(this);
}
base.SetItem(0, item);
if (Parent != null)
{
((JObject)Parent).InternalPropertyChanged(this);
}
}
You can see that it checks whether the value is "unchanged", and if so, it returns without doing anything. If we look at the implementation of IsTokenUnchanged() we see this:
internal static bool IsTokenUnchanged(JToken currentValue, JToken newValue)
{
JValue v1 = currentValue as JValue;
if (v1 != null)
{
// null will get turned into a JValue of type null
if (v1.Type == JTokenType.Null && newValue == null)
{
return true;
}
return v1.Equals(newValue);
}
return false;
}
So, if the current token is a JValue, it checks whether it Equals the other token, otherwise the token is automatically considered to have changed. And Equals for a JValue is of course based on whether the underlying primitives themselves are equal.
I cannot speak to the reasoning behind this implementation decision, but it seems to be worth reporting an issue to the author. The "correct" fix, I think, would be to make SetItem use ReferenceEquals(Value, item) instead of IsTokenUnchanged(Value, item).

Scala/functional/without libs - check if string permutation of other

How could you check to see if one string is a permutation of another using scala/functional programming with out complex pre-built functions like sorted()?
I'm a Python dev and what I think trips me up the most is that you can't just iterate through a dictionary of character counts comparing to another dictionary of character counts, then just exit when there isn't a match, you can't just call break.

Assume this is the starting point, based on your description:
val a = "aaacddba"
val b = "aabaacdd"
def counts(s: String) = s.groupBy(identity).mapValues(_.size)
val aCounts = counts(a)
val bCounts = counts(b)
This is the simplest way:
aCounts == bCounts // true
This is precisely what you described:
def isPerm(aCounts: Map[Char,Int], bCounts: Map[Char,Int]): Boolean = {
if (aCounts.size != bCounts.size)
return false
for ((k,v) <- aCounts) {
if (bCounts.getOrElse(k, 0) != v)
return false
}
return true
}
This is your method, but more scala-ish. (It also breaks as soon as a mismatch is found, because of how foreach is implemented):
(aCounts.size == bCounts.size) &&
aCounts.forall { case (k,v) => bCounts.getOrElse(k, 0) == v }
(Also, Scala does have break.)
Also, also: you should read the answer to this question.

Another option using recursive function, which will also 'break' immediately once mismatch is detected:
import scala.annotation.tailrec
#tailrec
def isPerm1(a: String, b: String): Boolean = {
if (a.length == b.length) {
a.headOption match {
case Some(c) =>
val i = b.indexOf(c)
if (i >= 0) {
isPerm1(a.tail, b.substring(0, i) + b.substring(i + 1))
} else {
false
}
case None => true
}
} else {
false
}
}
Out of my own curiosity I also create two more versions which use char counts map for matching:
def isPerm2(a: String, b: String): Boolean = {
val cntsA = a.groupBy(identity).mapValues(_.size)
val cntsB = b.groupBy(identity).mapValues(_.size)
cntsA == cntsB
}
and
def isPerm3(a: String, b: String): Boolean = {
val cntsA = a.groupBy(identity).mapValues(_.size)
val cntsB = b.groupBy(identity).mapValues(_.size)
(cntsA == cntsB) && cntsA.forall { case (k, v) => cntsB.getOrElse(k, 0) == v }
}
and roughly compare their performance by:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
// Match
time((1 to 10000).foreach(_ => isPerm1("apple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm2("apple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm3("apple"*100,"elppa"*100)))
// Mismatch
time((1 to 10000).foreach(_ => isPerm1("xpple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm2("xpple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm3("xpple"*100,"elppa"*100)))
and the result is:
Match cases
isPerm1 = 2337999406ns
isPerm2 = 383375133ns
isPerm3 = 382514833ns
Mismatch cases
isPerm1 = 29573489ns
isPerm2 = 381622225ns
isPerm3 = 417863227ns
As can be expected, the char counts map speeds up positive cases but can slow down negative cases (overhead on building the char counts map).

why doesn't it terminate?

var m_root : Node = root
private def insert(key: Int, value: Int): Node = {
if(m_root == null) {
m_root = Node(key, value, null, null)
}
var t : Node = m_root
var flag : Int = 1
while (t != null && flag == 1) {
if(key == t.key) {
t
}
else if(key < t.key) {
if(t.left == null) {
t.left = Node(key, value, null, null)
flag = 0
} else {
t = t.left
}
} else {
if(t.right == null) {
t.right = Node(key, value, null, null)
flag = 0
} else {
t = t.right
}
}
}
t
}
I wrote iterative version insert a node to binary search tree. I want to terminate when node is created, but it doesn't stop, because I think I didn't assign terminating condition. How to I edit my code to terminate when a node inserted in?

I'm not sure exactly what behaviour you want, but the cause is quite clear.
Your loop is a while condition, which will loop until t is null. So while t is non-null the loop will continue.
You only ever assign t to non-null values - in fact you're specifically checking for the null case and stopping it happening by creating a new node.
So either you need to reconsider your loop condition, or ensure t does in fact become null in some cases, depending on what your actual algorithm requirements are.
And since you're returning t at the bottom, I suggest the while condition is wrong; the only possible way this could terminate is if t is null at this point, so it would be pointless to return this anyway.

The first clause of your "if" statement in the loop
if(key == t.key) {
t
}
... does nothing if the comparison is true. It doesn't terminate the loop. The statement t is not synonymous with return t here. You can set flag = 0 at that point to terminate the loop.

how to read immutable data structures from file in scala

I have a data structure made of Jobs each containing a set of Tasks. Both Job and Task data are defined in files like these:
jobs.txt:
JA
JB
JC
tasks.txt:
JB T2
JA T1
JC T1
JA T3
JA T2
JB T1
The process of creating objects is the following:
- read each job, create it and store it by id
- read task, retrieve job by id, create task, store task in the job
Once the files are read this data structure is never modified. So I would like that tasks within jobs would be stored in an immutable set. But I don't know how to do it in an efficient way. (Note: the immutable map storing jobs may be left immutable)
Here is a simplified version of the code:
class Task(val id: String)
class Job(val id: String) {
val tasks = collection.mutable.Set[Task]() // This sholud be immutable
}
val jobs = collection.mutable.Map[String, Job]() // This is ok to be mutable
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = new Job(line.trim)
jobs += (job.id -> job)
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = jobs(tokens(0).trim)
val task = new Task(job.id + "." + tokens(1).trim)
job.tasks += task
}
Thanks in advance for every suggestion!

The most efficient way to do this would be to read everything into mutable structures and then convert to immutable ones at the end, but this might require a lot of redundant coding for classes with a lot of fields. So instead, consider using the same pattern that the underlying collection uses: a job with a new task is a new job.
Here's an example that doesn't even bother reading the jobs list--it infers it from the task list. (This is an example that works under 2.7.x; recent versions of 2.8 use "Source.fromPath" instead of "Source.fromFile".)
object Example {
class Task(val id: String) {
override def toString = id
}
class Job(val id: String, val tasks: Set[Task]) {
def this(id0: String, old: Option[Job], taskID: String) = {
this(id0 , old.getOrElse(EmptyJob).tasks + new Task(taskID))
}
override def toString = id+" does "+tasks.toString
}
object EmptyJob extends Job("",Set.empty[Task]) { }
def read(fname: String):Map[String,Job] = {
val map = new scala.collection.mutable.HashMap[String,Job]()
scala.io.Source.fromFile(fname).getLines.foreach(line => {
line.split("\t") match {
case Array(j,t) => {
val jobID = j.trim
val taskID = t.trim
map += (jobID -> new Job(jobID,map.get(jobID),taskID))
}
case _ => /* Handle error? */
}
})
new scala.collection.immutable.HashMap() ++ map
}
}
scala> Example.read("tasks.txt")
res0: Map[String,Example.Job] = Map(JA -> JA does Set(T1, T3, T2), JB -> JB does Set(T2, T1), JC -> JC does Set(T1))
An alternate approach would read the job list (creating jobs as new Job(jobID,Set.empty[Task])), and then handle the error condition of when the task list contained an entry that wasn't in the job list. (You would still need to update the job list map every time you read in a new task.)

I did a feel changes for it to run on Scala 2.8 (mostly, fromPath instead of fromFile, and () after getLines). It may be using a few Scala 2.8 features, most notably groupBy. Probably toSet as well, but that one is easy to adapt on 2.7.
I don't have the files to test it, but I changed this stuff from val to def, and the type signatures, at least, match.
class Task(val id: String)
class Job(val id: String, val tasks: Set[Task])
// read tasks
val tasks = (
for {
line <- io.Source.fromPath("tasks.txt").getLines().toStream
tokens = line.split("\t")
jobId = tokens(0).trim
task = new Task(jobId + "." + tokens(1).trim)
} yield jobId -> task
).groupBy(_._1).map { case (key, value) => key -> value.map(_._2).toSet }
// read jobs
val jobs = Map() ++ (
for {
line <- io.Source.fromPath("jobs.txt").getLines()
job = new Job(line.trim, tasks(line.trim))
} yield job.id -> job
)

You could always delay the object creation until you have all the data read in from the file, like:
case class Task(id: String)
case class Job(id: String, tasks: Set[Task])
import scala.collection.mutable.{Map,ListBuffer}
val jobIds = Map[String, ListBuffer[String]]()
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = line.trim
jobIds += (job.id -> new ListBuffer[String]())
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = tokens(0).trim
val task = job.id + "." + tokens(1).trim
jobIds(job) += task
}
// create objects
val jobs = jobIds.map { j =>
Job(j._1, Set() ++ j._2.map { Task(_) })
}
To deal with more fields, you could (with some effort) make a mutable version of your immutable classes, used for building. Then, convert as needed:
case class Task(id: String)
case class Job(val id: String, val tasks: Set[Task])
object Job {
class MutableJob {
var id: String = ""
var tasks = collection.mutable.Set[Task]()
def immutable = Job(id, Set() ++ tasks)
}
def mutable(id: String) = {
val ret = new MutableJob
ret.id = id
ret
}
}
val mutableJobs = collection.mutable.Map[String, Job.MutableJob]()
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = Job.mutable(line.trim)
jobs += (job.id -> job)
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = jobs(tokens(0).trim)
val task = Task(job.id + "." + tokens(1).trim)
job.tasks += task
}
val jobs = for ((k,v) <- mutableJobs) yield (k, v.immutable)

One option here is to have some mutable but transient configurer class along the lines of the MutableMap above but then pass this through in some immutable form to your actual class:
val jobs: immutable.Map[String, Job] = {
val mJobs = readMutableJobs
immutable.Map(mJobs.toSeq: _*)
}
Then of course you can implement readMutableJobs along the lines you have already coded

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

spark streaming mapwithstate's confuse - spark-streaming

Related

Space that is available at all Tarantool nodes (vshard)

Why does Newtonsoft Replace function only replace the value if it is changed?

Scala/functional/without libs - check if string permutation of other

why doesn't it terminate?

how to read immutable data structures from file in scala

Categories

Resources