global class variable used in a spark streaming process : is it a broadcasted variable? - spark-streaming

I just need to know if a global public class variable, used in a SparkStreaming process will be considered as a broadcasted variable.
For now, I succeeded to use a pre-setted variable "inventory" into a JavaDStream transformation.
class Foo {
public static Map<String,String> inventory;
public static void main(String args) {
inventory = Inventory.load(); // here i set the variable
SparkSession sparkSession = ...
JavaStreamingContext ssc = ... // here i initialize the Spark Streaming Context
JavaInputDStream<ConsumerRecord<String, String>> records = ...
JavaDStream<Map<String,Object>> processedRecords = records.flatMap(rawRecord->{
return f(rawRecord,inventory); // just an example...
}
}
}
What I understand is that the part into the lambda expression (rawRecord) is a distributed one, and then I presume that "inventory" is broadcasted to each executor that performs the process, is that so?

A global class variable is different from a broadcast variable .
Using a class variable is fine but this can be inefficient, especially
for large variables such as a lookup table or a machine learning
model. The reason for this is that when you use a variable in a
closure or a class variable in your case , it must be deserialised on
the worker nodes many times (one per task). Moreover, if you use the
same variable in multiple Spark actions and jobs, it will be re-sent
to the workers with every job instead of once.
Broadcast variables are shared, immutable variables that
are cached on every machine in the cluster instead of serialized with every single task
All you need to do is
Broadcast<Map<String,String>> broadcast = ssc.sparkContext().broadcast(inventory);
and access it
broadcast.value().get(key)

Yes, you have to broadcast that variable to keep available for all the executors in the distributed environment.

Related

JShell access to variables defined outside of jshell instance

From inside jShell script, is it possible to access or register variables that are defined in code that's also creating JShell?
Currently there seems to be no mechanism to either access or register a variable to Shell instance, or return none string types from inside JShell (like objects or lambda etc.)
ex:
import jdk.jshell.JShell;
import jdk.jshell.JShellException;
import jdk.jshell.SnippetEvent;
import java.util.List;
public class Main {
public static void main(String[] args) throws JShellException {
var localVar = 1;
JShell shell = JShell.create();
// How to register localVar variable with shell instance or access variables from scope
List events = shell.eval("var x = localVar;");
SnippetEvent event = events.get(0);
System.out.println("Kind: " + event.snippet().kind() + ", Value: " + event.value());
}
}
While you can't access local names like in your example, you can create a JShell instance that executes in the same JVM that created it. For this you would use the LocalExecutionControl. Using this execution control you could move localVar to a static field in your Main class and then access it from "inside" the JShell code with Main.localVar.
Unfortunately as the API is designed to support execution providers that could be in a different process or even a different machine, the return type is a string. If you are interested in a hack, the IJava jupyter kernel needed to an implementation of eval that returned an Object which ended up using an ExecutionControl implementation based on the DirectExecutionControl that stored the result of an eval call in a map and returned a unique id to reference that result. Then using the shell you would have to lookup the result from the id returned by eval (think of something like results.get(eval(sourceCode))). That implementation is on github in IJavaExecutionControl.java and IJavaExecutionControlProvider.java with a sample usage in CodeEvaluator.java#L72 if you are interested in taking any of it (MIT license).

How to get that the limit exceeded when I use limit() on a range of items from stream using Java 8 lambda?

How should I know without using another condition to compare the map.size() with limitValue, that the limit was exceeding when my stream iterated?
Here,
for limitValue = 3, it should return false.
for limitValue = 4, it should return true.
I can not use an outside int field as it must be final to be used inside lambda.
import java.util.*;
import java.util.stream.*;
public class Test {
public static void main(String[] args) throws Exception {
Map<Integer, String> map = new HashMap<>();
map.put(1, "foo");
map.put(2, "bar");
map.put(3, "baz");
int limitValue = 3;
String result = map.entrySet()
.stream()
.limit(limitValue)
.map(entry -> entry.getKey() + " - " + entry.getValue())
.collect(Collectors.joining(", "));
System.out.println(result);
}
}
I can not use an outside int field as it must be final to be used
inside lambda.
Yes, this is because, within a lambda expression, you can only reference local variables whose value doesn’t change (in java).
This is a good thing in a way as mutating a variable(s) inside a lambda is not thread safe when executing in parallel.
So, the system is helping you prevent such scenarios at compile time by allowing only final or effectively final variables to be used in lambdas.
Note, this restriction only holds for local variables.
Anyhow, my advice is not to mutate variables that are not solely contained within a given function itself as it introduces a side-effect and side-effects in behavioral parameters to stream operations are, in general, discouraged.
Keep things simple and proceed with the below approach.
boolean exceeded = limitValue > map.size();

When I use storm, how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?

The topology looks like this.
how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?
Bolt.execute() is called for each incoming tuple, regardless what the producer was (and you cannot change this). If you want to process multiple tuples from different producers at once, you need to write custom UDF code.
You need an input buffer for each producer, that can buffer incoming tuples (maybe a LinkedList<Tuple> as bolt member)
For each incoming tuple, you add the tuple to the corresponding buffer (you can access the producer information in the tuple's meta data, via. input.getSourceComponent()
After adding the tuple to the buffer, you check, if each buffer contains at least one tuple: if yes, take one tuple from each buffer an process them (after processing, check the buffers again until at least once buffer is empty) -- of no, just return and do not process anything.
You might want to take a look here (refer to Batching). For bolts that process more complex operations such as aggregation on multiple input tuples, you will need to extend BaseRichBolt and do your own control of the anchoring mechanism.
For this you need to declare your own output collector like this:
private OutputCollector outputCollector;
And then initialise it through your override of the prepare method:
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.outputCollector = outputCollector;
}
Your execute method for BaseRichBolt only receives a tuple as argument, you need to be able to perform the logic to maintain the anchors and using them when emitting.
private final List<Tuple> anchors = new ArrayList<Tuple>();
#Override
public void execute(Tuple tuple) {
if (!isTupleAggregationComplete(anchors, tuple)) {
anchors.add(tuple);
return;
}
// do your computations here!
collector.emit(anchors, new Values(foo,bar,xpto));
anchors.clear();
}
You should implement isTupleAggregationComplete with the necessary logic that checks if the bolt have everything necessary information to proceed with the processing.

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

Hadoop variable set in reducer and read in driver

How I can set a variable in a reducer, which after its execution can be read by the driver after all tasks finish their execution? Something like:
class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception {
...
JobClient.runJob(conf); // reducer sets some variable
String varValue = ...; // variable value is read by driver
}
}
WORKAROUND
I came up with this "ugly" workaround. The main idea is that you create a group of counters in which you hold only one counter where its name is the value you wish to return (you ignore the actual counter value). The code look like this:
// reducer || mapper
reporter.incrCounter("Group name", "counter name -> actual value", 0);
// driver
RunningJob runningJob = JobClient.runJob(conf);
String value = runningJob.getCounters().getGroup("Group name").iterator().next().getName();
The same will work for mappers as well. Though this solves my problem, I think this type of solution is "ugly". Thus I leave the question open.
You can't amend the configuration in a map / reduce task and expect that change to be persisted to configurations in other tasks and / or the job client that submitted the job (lets say you write different values in the reducer - which one 'wins' out and is persisted back?).
You can however write files to HDFS yourself which can then be read back when your job returns - No less ugly really but there isn't a way doesn't involve another technology (Zookeeper, HBase or any other NoSQL / RDB) holding the value between your task ending and you being able to retrieve the value upon job success.

Resources