Kafka stream app can't get its own local state store value - apache-kafka-streams

I'm new comer to kafka and playing WordCount.java which materialized to state store named "counts-store"
created a topic streams-plaintext-input with 2 partitions then running 2 WordCount apps in same machine, say App1 and App2
the issue is why App1 processed the words but local state store return null while App2 is able to get count value from local state store. in other words, one App's local state store is in peer's state store.
here is the producer
kafka-console-producer.bat --bootstrap-server localhost:9092 --topic streams-plaintext-input
log line is added so I can know App1 or App2 process the word
KStream<String,String> words = source.flatMapValues(value -> {
System.out.println(Thread.currentThread().getName() + " text line splitted ["+value+"]");
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+"));
});
so far so good, App1 and App2 process new coming words alternately.
then add query state store code to the same class
for(int i=0; i<1000; i++){
TimeUnit.SECONDS.sleep(10);
String qstr = "new3";
ReadOnlyKeyValueStore<String, Long> keyValueStore = streamsClient.store(StoreQueryParameters.fromNameAndType(StateStoreName, QueryableStoreTypes.keyValueStore()));
System.out.println(new Date().getTime()+ " count for "+qstr+":" + keyValueStore.get(qstr));
}
now producer console enter new3, the interesting log I got:
App1:
streams-wordcount-2p-209b88a2-7f7e-4533-8478-709a01f9ae1e-StreamThread-1 text line splitted [new3]
1666998201324 count for new3:null
1666998211335 count for new3:null
App2 who Not process the word:
1666998204348 count for new3:1
1666998204348 count for new3:1
App1 processed new3 then its state store should has count value, why the count value appears in App2?
in other words, one APP may unable to get its own value from its own local state store, what the idea behind this? does it mean we have to make every state store to be queryable(Interactive Queries)? because it will affect programming model.
please advise, thank you very much!

Related

Why are my storm topology not acking when i send tuple to ElasticSearch

I'm new to using Storm, I've just started a Data Architect training course and it's in this context that I'm facing the problem that brings me to you today.
I'm receiving messages from kakfa via a KafkaSpout named CurrentPriceSpout. So far, everything is working. Then, in my CurrentPriceBolt, I re-issued a tuple so that my data is written in ElasticSearch using the EsCurrentPriceBolt . The problem is here. I can't write my data into ElasticSearch directly, it is only written when I delete my topology.
Is there a Storm parameter that can force the writing of tuples by retrieving acknowledgments?
I tried by adding the option ".addConfiguration(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5)", the tuples are well written in ElasticSearch but not acknowledged. So Storm rewrites them indefinitely.
Thanks for your help
Thierry
I managed to find the answer to my problem.
The main problem was that ES is not designed to ingest as little data as is generated in a study project. ES writes, by default, data in batches of 1000 entries. With this project, I generate one data every 30 seconds, or a batch of 1000 every 500 minutes (or 8h20).
so I reviewed in detail the configuration of my topology and played with the following options:
es.batch.size.entries: 1
es.storm.bolt.flush.entries.size: 1
topology.producer.batch.size: 1
topology.transfer.batch.size: 1
And now it goes like this:
...
...
public class App
{
...
...
public static void main( String[] args ) throws AlreadyAliveException, InvalidTopologyException, AuthorizationException
{
...
...
StormTopology topology = topologyBuilder.createTopology(); // je crée ma topologie Storm
String topologyName = properties.getProperty("storm.topology.name"); // je nomme ma topologie
StormSubmitter.submitTopology(topologyName, getTopologyConfig(properties), topology); // je démarre ma topologie sur mon cluster storm
System.out.println( "Topology on remote cluster : Started!" );
}
private static Config getTopologyConfig(Properties properties)
{
Config stormConfig = new Config();
stormConfig.put("topology.workers", Integer.parseInt(properties.getProperty("topology.workers")));
stormConfig.put("topology.enable.message.timeouts", Boolean.parseBoolean(properties.getProperty("topology.enable.message.timeouts")));
stormConfig.put("topology.message.timeout.secs", Integer.parseInt(properties.getProperty("topology.message.timeout.secs")));
stormConfig.put("topology.transfer.batch.size", Integer.parseInt(properties.getProperty("topology.transfer.batch.size")));
stormConfig.put("topology.producer.batch.size", Integer.parseInt(properties.getProperty("topology.producer.batch.size")));
return stormConfig;
}
...
...
...
}
And now it works!!!

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.
I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.
I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.
Below is a code updateState function I am using :
val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
if (state.isTimingOut()) {
Option(null)
} else {
state.update(session)
Some((key,value,session))
}
}
and the mapWithStae call :
def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
val spec = StateSpec.function(updateState)
spec.timeout(Duration(sessionTimeout))
spec.numPartitions(192)
inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
}
Finally I am applying a feature computing session foreach DStream , as defined below :
def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {
val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
resultSession.features = features
return resultSession
}

Spark Streaming with multiple Kafka streams

I create kafka stream with the following codes:
val streams = (1 to 5) map {i =>
KafkaUtils.createStream[....](
streamingContext,
Map( .... ),
Map(topic -> numOfPartitions),
StorageLevel.MEMORY_AND_DISK_SER
).filter(...)
.mapPartitions(...)
.reduceByKey(....)
val unifiedStream = streamingContext.union(streams)
unifiedStream.foreachRDD(...)
streamingContext.start()
I give each stream different group id. When I run the application, only part of kafka messages are received and the executor is pending at foreachRDD call. If I only create one stream, everything works well. There aren't any exceptions from logging info.
I don't know why the application is stuck there. Does it mean no enough resources?
You want to try set the parameter
SparkConf().set("spark.streaming.concurrentJobs", "5")

Akka actors and Clustering-I'm having trouble with ClusterSingletonManager- unhandled event in state Start

I've got a system that uses Akka 2.2.4 which creates a bunch of local actors and sets them as the routees of a Broadcast Router. Each worker handles some segment of the total work, according to some hash range we pass it. It works great.
Now, I've got to cluster this application for failover. Based on the requirement that only one worker per hash range exist/be triggered on the cluster, it seems to me that setting up each one as a ClusterSingletonManager would make sense..however I'm having trouble getting it working. The actor system starts up, it creates the ClusterSingletonManager, it adds the path in the code cited below to a Broadcast Router, but it never instantiates my actual worker actor to handle my messages for some reason. All I get is a log message: "unhandled event ${my message} in state Start". What am I doing wrong? Is there something else I need to do to start up this single instance cluster? Am I sending the wrong actor a message?
here's my akka config(I use the default config as a fallback):
akka{
cluster{
roles=["workerSystem"]
min-nr-of-members = 1
role {
workerSystem.min-nr-of-members = 1
}
}
daemonic = true
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = ${akkaPort}
}
}
actor{
provider = akka.cluster.ClusterActorRefProvider
single-message-bound-mailbox {
# FQCN of the MailboxType. The Class of the FQCN must have a public
# constructor with
# (akka.actor.ActorSystem.Settings, com.typesafe.config.Config) parameters.
mailbox-type = "akka.dispatch.BoundedMailbox"
# If the mailbox is bounded then it uses this setting to determine its
# capacity. The provided value must be positive.
# NOTICE:
# Up to version 2.1 the mailbox type was determined based on this setting;
# this is no longer the case, the type must explicitly be a bounded mailbox.
mailbox-capacity = 1
# If the mailbox is bounded then this is the timeout for enqueueing
# in case the mailbox is full. Negative values signify infinite
# timeout, which should be avoided as it bears the risk of dead-lock.
mailbox-push-timeout-time = 1
}
worker-dispatcher{
type = PinnedDispatcher
executor = "thread-pool-executor"
# Throughput defines the number of messages that are processed in a batch
# before the thread is returned to the pool. Set to 1 for as fair as possible.
throughput = 500
thread-pool-executor {
# Keep alive time for threads
keep-alive-time = 60s
# Min number of threads to cap factor-based core number to
core-pool-size-min = ${workerCount}
# The core pool size factor is used to determine thread pool core size
# using the following formula: ceil(available processors * factor).
# Resulting size is then bounded by the core-pool-size-min and
# core-pool-size-max values.
core-pool-size-factor = 3.0
# Max number of threads to cap factor-based number to
core-pool-size-max = 64
# Minimum number of threads to cap factor-based max number to
# (if using a bounded task queue)
max-pool-size-min = ${workerCount}
# Max no of threads (if using a bounded task queue) is determined by
# calculating: ceil(available processors * factor)
max-pool-size-factor = 3.0
# Max number of threads to cap factor-based max number to
# (if using a bounded task queue)
max-pool-size-max = 64
# Specifies the bounded capacity of the task queue (< 1 == unbounded)
task-queue-size = -1
# Specifies which type of task queue will be used, can be "array" or
# "linked" (default)
task-queue-type = "linked"
# Allow core threads to time out
allow-core-timeout = on
}
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1
# The parallelism factor is used to determine thread pool size using the
# following formula: ceil(available processors * factor). Resulting size
# is then bounded by the parallelism-min and parallelism-max values.
parallelism-factor = 3.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 1
}
}
}
}
Here's where I create my Actors(its' written in Groovy):
Props clusteredProps = ClusterSingletonManager.defaultProps("worker".toString(), PoisonPill.getInstance(), "workerSystem",
new ClusterSingletonPropsFactory(){
#Override
Props create(Object handOverData) {
log.info("called in ClusterSingetonManager")
Props.create(WorkerActorCreator.create(applicationContext, it.start, it.end)).withDispatcher("akka.actor.worker-dispatcher").withMailbox("akka.actor.single-message-bound-mailbox")
}
} )
ActorRef manager = system.actorOf(clusteredProps, "worker-${it.start}-${it.end}".toString())
String path = manager.path().child("worker").toString()
path
when I try to send a message to the actual worker actor, should the path above resolve? Currently it does not.
What am I doing wrong? Also, these actors live within a Spring application, and the worker actors are set up with some #Autowired dependencies. While this Spring integration worked well in a non-clustered environment, are there any gotchyas in a clustered environment I should be looking out for?
thank you
FYI:I've also posted this in the akka-user google group. Here's the link.
The path in your code is to the ClusterSingletonManager actor that you start on each node with role "workerSystem". It will create a child actor (WorkerActor) with name "worker-${it.start}-${it.end}" on the oldest node in the cluster, i.e. singleton within the cluster.
You should also define the name of the ClusterSingletonManager, e.g. system.actorOf(clusteredProps, "workerSingletonManager").
You can't send the messages to the ClusterSingletonManager. You must send them to the path of the active worker, i.e. including the address of the oldest node. That is illustrated by the ConsumerProxy in the documentation.
I'm not sure you should use a singleton at all for this. All workers will be running on the same node, the oldest. I would prefer to discuss alternative solutions to your problem at the akka-user google group.

JMeter, threads using dynamic incremented value

I have created a JMeter functional test that essentially:
creates a user;
logs in with the user;
deletes the user.
Now, I need to be able to thread this, and dynamically generate a username with a default prefix and a numerically incremented suffix (ie TestUser_1, TestUser_2, ... etc).
I used the counter, and things were working fine until I really punched up the number of threads/loops. When I did this, I was getting a conflict with the counter. Some threads were trying to read the counter, but the counter had already been incremented by another thread. This resulted in trying to delete a thread that was just created, then trying to log in with a thread that was just deleted.
The project is set up like this:
Test Plan
Thread group
Counter
User Defined Variables
Samplers
I was hoping to solve this problem by using the counter to append a number to the user defined variables upon thread execution, but the counter cannot be accessed in the user defined variables.
Any ideas on how I can solve this problem?
Thank you in advance.
I've used the following scheme successfully with any amount of test users:
1. Generate using beanshell-script (in BeanShell Sampler e.g.) csv-file with test-user details, for example:
testUserName001,testPwd001
testUserName002,testPwd002
. . .
testUserName00N,testPwd00N
with the amount of entries you need for the test-run.
This is done once per "N users test-run", in separate Thread Group, in setUp Thread Group or maybe even in separate jmx-script... makes no difference.
You can please find working beanshell-code below.
2. Create your test users IN TEST APPLICATION using previously created users-list.
If you don't need create in application you may skip this.
Thread Group
Number of Threads = 1
Loop Count = 1
. . .
While Controller
Condition = ${__javaScript("${newUserName}"!="",)} // this will repeat until EOF
CSV Data Set Config
Filename = ${__property(user.dir)}${__BeanShell(File.separator,)}${__P(users-list,)} // path to generated users-list
Variable Names = newUserName,newUserPwd // these are test-users details read from file into pointed variables
Delimiter = '
Recycle on EOF? = False
Stop thread on EOF? = True
Sharing Mode = Current thread group
[CREATE TEST USERS LOGIC HERE] // here are actions to create separate user in application
. . .
3. Perform multi-user logic.
Schema like the given above one but Thread Group executed not for 1 but for N threads.
Thread Group
Number of Threads = ${__P(usersCount,)} // set number of users you need to test
Ramp-Up Period = ${__P(rampUpPeriod,)}
Loop Count = X
. . .
While Controller
Condition = ${__javaScript("${newUserName}"!="",)} // this will repeat until EOF
CSV Data Set Config
Filename = ${__property(user.dir)}${__BeanShell(File.separator,)}${__P(users-list,)} // path to generated users-list
Variable Names = newUserName,newUserPwd // these are test-users details read from file into pointed variables
Delimiter = '
Recycle on EOF? = False
Stop thread on EOF? = True
Sharing Mode = Current thread group
[TEST LOGIC HERE] // here are test actions
. . .
The key idea is in using Thread Group + While Controller + CSV Data Set Config combination:
3.1. CSV Data Set Config reads details for each the test users from generated file:
. . . a. only once - because of "Stop thread on EOF? = True";
. . . b. doesn't block file for further access (in another thread groups, e.g., if there are any) - because of "Sharing Mode = Current thread group";
. . . c. pointed variables - "Variable Names = newUserName,newUserPwd" - you will use in further test-actions;
3.2. While Controller forces CSV Data Set Config to read all the entries from generated file - because of defined condition (until the EOF).
3.3. Thread Group will start all the threads with defined ramp-up - or simultaneously if ramp-up = 0.
You can take here template script for described schema: multiuser.jmx.
Beanshell script to generate test-users details looks like below and takes the following args:
- test-users count
- test-user name template ("TestUser_" in your case)
- test-user name format (e.g. 0 - to get TestUser_1, 00 - to get TestUser_01, 000- for TestUser_001, etc... you can simply hardcode this orexclude at all)
- name of generated file.
import java.text.*;
import java.io.*;
import java.util.*;
String [] params = Parameters.split(",");
int count = Integer.valueOf(params[0]);
String testName = params[1];
String nameFormat = params[2];
String usersList = params[3];
StringBuilder contents = new StringBuilder();
try {
DecimalFormat formatter = new DecimalFormat(nameFormat);
FileOutputStream fos = new FileOutputStream(System.getProperty("user.dir") + File.separator + usersList);
for (int i = 1; i <= count; i++) {
String s = formatter.format(i);
String testUser = testName + s;
contents.append(testUser).append(",").append(testUser);
if (i < count) {
contents.append("\n");
}
}
byte [] buffer = contents.toString().getBytes();
fos.write(buffer);
fos.close();
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
All together it will look like:
Sorry if answer is too overloaded.
Hope this helps.
The "User Defined Variables" config element does not pick up the reference variable from the "Counter" config element. I think this is a bug in JMeter. I have verified this behavior in version 3.2.
I added a "BeanShell Sampler" element to work around the issue.
Notice that the reference name of the "Counter" element is INDEX
The RUN_FOLDER variable gets set to a combination of the TESTS_FOLDER variable and the INDEX variable in the "BeanShell Sampler"
The "Debug Sampler" simply gathers a snapshot of the variables so I can see them in the "View Results Tree" listener element. Notice how the RUN_FOLDER variable has the INDEX variable value (5 in this case) appended.

Resources