I have filtered query, which returns about 220 items with 8 fields each without child nodes. It's not tiny portion of data, but definately not the large one.
On the pic below the following happens:
Sec 25 - user click to load data, eventListener is created
Sec 31 - data is completely loaded from Firebase DB
1 min 47 - onDataChange is called and data is presented
I've seen some top process in debugger: FirebaseDatabaseWorker. Looks like it does some loaded data processing. But 1 minute with 30%CPU?? Is it possible to optimize?
Thanks!
final DatabaseReference artRef =
database.getReference("sales/"+store+"/articles");
Query query = artRef.orderByChild("familyId").equalTo(familyId);
query.addListenerForSingleValueEvent(new ValueEventListener() {
#Override
public void onDataChange(DataSnapshot snapshot) {
for (DataSnapshot child : snapshot.getChildren()) {
Article article = child.getValue(Article.class);
article.id = Integer.parseInt(child.getKey());
articles.add(article);
}
adapter.notifyDataSetChanged();
}
Disabling Instant Run in Android Studio made it from 1 minute to about 5 second to call onDataChange.
Related
I'm a newbee Ceph developer, and recenlty reading code of snapshots. From
pg_pool_t::add_unmanaged_snap, it's obvious that the first RBD snapshot id should
start from 2, but in reality, it starts from 4, I wonder whether there are some organisms
in RBD snap, which increments snap_seq, could anyone help me?
Thanks in advance!
Below is
the code of pg_pool_t::add_unmanaged_snap.
void pg_pool_t::add_unmanaged_snap(uint64_t& snapid)
{
ceph_assert(!is_pool_snaps_mode());
if (snap_seq == 0) {
// kludge for pre-mimic tracking of pool vs selfmanaged snaps. after
// mimic this field is not decoded but our flag is set; pre-mimic, we
// have a non-empty removed_snaps to signifiy a non-pool-snaps pool.
removed_snaps.insert(snapid_t(1));
snap_seq = 1;
}
flags |= FLAG_SELFMANAGED_SNAPS;
snapid = snap_seq = snap_seq + 1;
}
Following screenshot is the process of rbd snapshot creation on a totally new rbd pool. Obviously, snapshot id here starts from 4
rbd snapshot creation on a totally new rbd pool
I'm new to using Storm, I've just started a Data Architect training course and it's in this context that I'm facing the problem that brings me to you today.
I'm receiving messages from kakfa via a KafkaSpout named CurrentPriceSpout. So far, everything is working. Then, in my CurrentPriceBolt, I re-issued a tuple so that my data is written in ElasticSearch using the EsCurrentPriceBolt . The problem is here. I can't write my data into ElasticSearch directly, it is only written when I delete my topology.
Is there a Storm parameter that can force the writing of tuples by retrieving acknowledgments?
I tried by adding the option ".addConfiguration(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5)", the tuples are well written in ElasticSearch but not acknowledged. So Storm rewrites them indefinitely.
Thanks for your help
Thierry
I managed to find the answer to my problem.
The main problem was that ES is not designed to ingest as little data as is generated in a study project. ES writes, by default, data in batches of 1000 entries. With this project, I generate one data every 30 seconds, or a batch of 1000 every 500 minutes (or 8h20).
so I reviewed in detail the configuration of my topology and played with the following options:
es.batch.size.entries: 1
es.storm.bolt.flush.entries.size: 1
topology.producer.batch.size: 1
topology.transfer.batch.size: 1
And now it goes like this:
...
...
public class App
{
...
...
public static void main( String[] args ) throws AlreadyAliveException, InvalidTopologyException, AuthorizationException
{
...
...
StormTopology topology = topologyBuilder.createTopology(); // je crée ma topologie Storm
String topologyName = properties.getProperty("storm.topology.name"); // je nomme ma topologie
StormSubmitter.submitTopology(topologyName, getTopologyConfig(properties), topology); // je démarre ma topologie sur mon cluster storm
System.out.println( "Topology on remote cluster : Started!" );
}
private static Config getTopologyConfig(Properties properties)
{
Config stormConfig = new Config();
stormConfig.put("topology.workers", Integer.parseInt(properties.getProperty("topology.workers")));
stormConfig.put("topology.enable.message.timeouts", Boolean.parseBoolean(properties.getProperty("topology.enable.message.timeouts")));
stormConfig.put("topology.message.timeout.secs", Integer.parseInt(properties.getProperty("topology.message.timeout.secs")));
stormConfig.put("topology.transfer.batch.size", Integer.parseInt(properties.getProperty("topology.transfer.batch.size")));
stormConfig.put("topology.producer.batch.size", Integer.parseInt(properties.getProperty("topology.producer.batch.size")));
return stormConfig;
}
...
...
...
}
And now it works!!!
We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.
I am running a database download in a background thread. The threads work fine and I execute group wait before continuing.
The problem I have is that I need to start an activity indicator and it seems that due to the group_wait it gets blocked.
Is there a way to run such heavy process, ensure that all threads get completed while allowing the activity indicator to run?
I start the activity indicator with (I also tried starting the indicator w/o the dispatch_async):
dispatch_async(dispatch_get_main_queue(), {
activityIndicator.startAnimating()
})
After which, I start the thread group:
let group: dispatch_group_t = dispatch_group_create()
let queue: dispatch_queue_t = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0) //also tried QOS_CLASS_BACKGROUND
while iter > 0 {
iter--
dispatch_group_enter(group)
dispatch_group_async(group, queue, {
do {
print("in queue \(iter)")
temp += try query.findObjects()
query.skip += query.limit
} catch let error as NSError {
print("Fetch failed: \(error.localizedDescription)")
}
dispatch_group_leave(group)
})
}
// Wait for all threads to finish and proceed
As I am using Parse, I have modified the code as follows (psuedo code for simplicity):
trigger the activity indicator with startAnimating()
call the function that hits Parse
set an observer in the Parse class on an int to trigger an action when the value reaches 0
get count of new objects in Parse
calculate how many loop iterations I need to pull all the data (using max objects per query = 1000 which is Parse max)
while iterations > 0 {
create a Parse query object
set the query skip value
use query.findObjectsInBackroundWithBlock ({
pull objects and add to a temp array
observer--
)}
iterations--
}
When the observer hits 0, trigger a delegate to return to the caller
Works like a charm.
I found a weird issue with MongoItemReader.
This is the configuration I've for a step:
private Step addMnpInformation() throws Exception {
return stepBuilders.get("addMnpInformation")
.chunk(50)
.reader(stagedCdrReader())
.processor(addMnpProcessor())
.writer(stagedCdrWriter())
.build();
}
and this is the reader
#Bean
public MongoItemReader<StagedCdr> stagedCdrReader() throws Exception {
MongoItemReader reader = new MongoItemReader<StagedCdr>();
reader.setTemplate(stagingCdrTransactionManager);
reader.setQuery("{cdrStatus:'PENDING'}");
reader.setTargetType(StagedCdr.class);
HashMap<String,Direction> sort = new HashMap<String, Direction>();
sort.put("sequence", Direction.ASC);
reader.setSort(sort);
reader.setName("stagedCdrReader");
reader.afterPropertiesSet();
return reader;
}
It seems the reader is not reading every other page. The records I'm procesing has a sequential Id and I found it's reading records using a page of 50 (probably my chunk size) but in this way:
from 0 to 49
from 100 to 149
from 200 to 249
.
.
.
to it's not reading
from 50 to 99
from 150 to 199
how many items be skiped is depend on chunk property, for example chunk(50),then 50 items would be skiped, I don't know why!
Robin? Is that you?
I posted the same question to Spring Forums some time ago and forgot about it.
This is why it's happening:
I discovered that the processor was changing the field cdrStatus to COMPLETED. That way, when the reader requested the new page and it's using limit and skip, the changed records of the first page where not included in the new page request so the 'new' first page was skipped and records not retrieved.
What I don't know if it's the expected behavior. You can find the thread here: http://forum.spring.io/forum/spring-projects/batch/129604-problems-with-mongoitemreader