I have spark streaming application running with 4 executor, and I am looking for cache logic which is local to each executor. This cache will be used as a lookup and it will refresh every 20 min. Currently I have tried with python decorator cachetools. But looking for other solutions.
Related
I have a spring worker application that getting messages from Rabbitmq with concurrency 50, and this message, spring application checking variables in DB and with the message, and inserting new results of difference, to DB(for one message to get the difference, we are sending about 5-20 'select' requests, 1 - 5 insert, 1-5 update to DB).
Now problem is that the spring worker application, it uploading very slow to DB when setting concurrency to 200. (200k messages inserted about two days).
And beside this, I have another spring application for monitoring. And everything working very slow, Db, worker app, monitoring app.
How I can do it fastly and optimize. Should I use the Postgres cluster? Or I can implement it in another way.
My Postgres server (intel Xeon 10 cores, 60 GB ram, 1,6TB SSD)
You can use shared cache, load your data on application startup, update your cache in case of update/create operation. Get from cache whatever you want, no need to go in DB
I am developing a microservice. This MS will be deployed to docker containers and will be monitored by Kubernetes. I have to implement a caching solution using hazelcast distributed cache. My requirements are:
Preload the cache on startup of this microservice. For around 3000 stores I have to fetch two specific attributes and cache them.
Every 24 hours refresh the cache.
I implemented Spring #EventListener and on startup to make a database call for the 2 attributes and do a #CachePut and store them in Cache.
I also have a Spring scheduler with cron expression to refresh cache at every 6 AM in morning.
So far so good.
But what I did not realize that in clustered environment - 10-15 instances of my microservice will be in action and will try to do above 2 steps almost simultaneously - thus creating a stampede effect on my database and cache. Does anyone know what to do in this scenario? Is there any good design or even average one which I can follow?
Thanks.
You should be looking to use Hazelcast provided Loading and Storing Persistent Data mechanism that allows 2 options for writing: Write-through and write-behind and read-through for loading data into the cache.
Look for MapLoader and its methods, that will let you warm-up/preload your cluster and you have the freedom to do that with your own implementation.
Check for more details: https://docs.hazelcast.org/docs/3.11/manual/html-single/index.html#loading-and-storing-persistent-data
I'm using quartz 1.8.6 in clustered mode with 4 instances. Now, I observed high contention on table QRTZ_LOCKS. My application also provide webservices for online clients. This webservices also do scheduling of new jobs. Now, I see timeout exceptions on those webservices, because when they want to schedule new job they wait too loooong to obtain lock on QRTZ_LOCKS table. It's important for me to establish 100% reliable operation for webservices (more important than quartz jobs operations). Is it possible to start quartz job runner on 1 instance only and other 3 instances configure with org.quartz.jobStore.isClustered=false to allow them perform scheduling WITHOUT getting lock on QRTZ_LOCKS?
update: Actually, if I plan to run only one instance with job runner and all others just allowed to add new jobs this won't be a cluster anymore. So, actual question would be: is it possible to configure org.quartz.jobStore.isClustered=false to all 4 instances, make only 1 instance run jobs, but allow all 4 to schedule new jobs to same jdbc storage?
Try to turn batch mode on, and set maximum batch count to the amount of threads, available for quartz scheduler.
http://www.ebaytechblog.com/2016/01/14/performance-tuning-on-quartz-scheduler/
As spark runs in-memory what does resource allocation mean in Spark when running on yarn and how does it contrast with hadoop's container allocation?
Just curious to know as hadoop's data and computations are on the disk where as Spark is in-memory.
Hadoop is a framework capable of processing large data. It has two layers. One is a distributed file system layer called HDFS and the second one is the distributed processing layer. In hadoop 2.x, the processing layer is architectured in a generic way so that it can be used for non-mapreduce applications also.
For doing any process, we need system resouces such as memory, network, disk and cpu. The term container came in hadoop 2.x. In hadoop 1.x, the equivalent term was slot. A container is an allocation or share of memory and cpu. YARN is a general resource management framework which enables efficient utilization of the resources in the cluster nodes by proper allocation and sharing.
In-memory process means, the data will be completely loaded into memory and processed without writing the intermediate data to the disk. This operation will be faster as the computation happens in memory without much disk I/O operations. But this needs more memory because the entire data will be loaded into the memory.
Batch process means the data will be taken and processed in batches, intermediate results will be stored in the disk and again supplied to the next process. This also needs memory and cpu for processing, but it will be less as compared to that of fully in-memory processing systems.
YARN's resource manager act as the central resource allocator for applications such as mapreduce, impala (with llama), spark (in yarn mode) etc. So when we trigger a job, it will request the resource manager for the resources required for execution. The resource manager will allocate resources based on the availability. The resources will be allocated in the form of containers. Container is just an allocation of memory and cpu. One job may need multiple containers. Containers will be allocated across the cluster depending upon the availability. The tasks will be executed inside the container.
For example, When we submit a mapreduce job, an MR application master will be launched and it will negotiate with the resource manager for additional resources. Map and reduce tasks will be spawned in the allocated resources.
Similarly when we submit a spark job (YARN mode), a spark application master will be launched and it will negotiate with the resource manager for additional resources. The RDD's will be spawned in the allocated resources.
We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:
SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()
The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.
The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?
Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!
In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.
To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).
It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.
In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.