Performance of Resque jobs - heroku

My Resque job basically takes params hash and stores it into the DB. In the process it does several reads and writes.
These R/Ws take approx. 5ms in total on my local machine and a little bit more on Heroku (I guess it's because of the shared DB).
However, the rate at which the queue is processed is very low / about 2-3 jobs per second. What could be causing this?
Thank you.

Check for a new job, lock a job, do the job, mark it as completed, look for a new job.
You might find that the negotiation to get a new job, accessing Redis etc is causing a lot of overhead. If your task is only 5ms long, it can probably live inside the request-response cycle. Background jobs are great when running a task would extend the response time considerably, very small jobs generally aren't worth the effort involved.

Related

Can I use the same Redis instance for task queue and cache?

I've read responses to a couple similar questions on stackoverflow, and although it seems like sharing a single instance for two purposes is fine, I would like to know the potential downside.
My main concern is the cache filling up the memory and slowing down or breaking the task queue. Is this possible? I use caching heavily, so should I be worried about this scenario?
Theoretically, you can use the same Redis instance for task queue and caching.
There're some downsides
Longer query time
High memory usage
High CPU usage
Backup
Any fail safe task queue, makes a lot of redis calls to move a task from one data structure to other and for other actions. You should check your task queue, how many redis calls it would make in a seconds for 1 queueu and N queues. If the number of Redis queueries is proportional to the number of queues than you should see can your Redis server handles such requests.
Since you're using same Redis instance for task queue and cache the number of entries in your cache could be very large, see it's not going beyond it's memory limit. Losing cache data is fine but you should not loose task queue data.
Due to a large number of queries the CPU utilization would increase, hopefully it won't reach 90% or so, watch for any cpu spike.
Given you're going to use same Redis server for task queue, you should enable backup for Redis server, so that you can restore tasks from the backup. When you're doing backup likely backup would be done for whole data not only task queues.

Flattening Dynamodb write bursts

I'm looking for a creative and most efficient way to flatten write bursts to dynamodb.
I have 4 cron jobs that run every 3 minutes .each on its own thread. due to reason I can't control they start at the same time.
Part of the jobs is to write a few 1000s of rows to dynamodb. This takes normally 10 to 30 seconsa using batch writes.
Because of the timing the 4 jobs do the writing it in parallel.
I'm looking for the most efficient way to distribute the writes over time .
I don't want to add resources of not necessary. Probably the solution includes some kind of cache and additional cron job.
I have memcache available. However there is probably something more efficient than writing to memcache and reading .
Maybe a log file on the server ?
What would you do?
It's php with apache on ububtu.
An established pattern, especially if you just need the writes to get there eventually, is to put your records into an SQS queue first, and have a background task that reads messages from SQS and puts them into the dynamodb and a maximum prescribed rate - this is useful when you don't want to pay for the high write throughput to match your peak periods of writes to the database.
SQS has the benefit of being able to accept messages at almost any scale and yet you can reduce your dynamodb costs by writing rows at a low, predictable pace.

How jobs are assigned to executors in Spark Streaming?

Let's say I've got 2 or more executors in a Spark Streaming application.
I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.
If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?
Even if the previous one didn't finish?
I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.
If you know some links where all of those things are explained, I would really appreciate to see them.
Thank you.
Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.
This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).
The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.
There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.

Finish sidekiq queues much quicker

I reached a point now, where is taking to long for a queue to finish, because new jobs are added to that queue.
What are the best options to overcome this problem.
I already use 50 processors, but I noticed that if I open more, it will take longer for jobs to finish.
My setup:
nginx,
unicorn,
ruby-on-rails 4,
postgresql
Thank you
You need to measure where you are constrained by resources.
If you're seeing things slow down as you add more workers you're likely blocked by your database server. Have you upgraded your Redis server to handle this amount of load? Where are you storing the scraped data to? Can that system handle the increased write load?
If you were blocked on CPU or I/O, you should see the amount of work through the system scale linearly as you add more workers. Since you're seeing things slow down when you scale out, you should measure where your problem is. I'd recommend instrumenting NewRelic for your worker processes and measuring where the time is being spent.
My guess would be that your Redis instance can't handle the load to manage the work queue with 50 worker processes.
EDIT
Based on your comment, it sounds like you're entirely I/O Bound doing web scraping. In that case, you should be increasing the concurrency option for each Sidekiq worker using the -c option to spawn more threads. Having more threads will allow you to continue processing scraping jobs even when scrapers are blocked on network I/O.

Is there a way to configure timeout for speculative execution in Hadoop?

I have hadoop job with tasks that are expected to run for significant length of fime (few minues). However hadoop starts speculative execution too soon. I do not want to turn speculative execution completely off but I want to increase duration of time hadoop waits before considering job for speculative execution. Is there a config option to control this timeout?
Thanks
I don't believe the speculative execution time is currently configurable. On the other hand, there's probably no need to adjust it. Speculative execution is meant to bail you out of slow running tasks (usually due to degraded hardware performance). If you have available cluster resources such that spec exec is kicking in, what's the harm in letting it do so? Note that minutes is not considered "significant" and is more than normal for medium or larger size jobs.
It's also worth noting that while mapper spec exec is almost always fine and low overhead to the system, reducer spec exec can hurt and probably should be disabled. The rationale is that if a mapper is progressing slowly and there are available resources where the data is local (normal), there's no shared overhead. If a reducer is performing slowly, starting another attempt of the same task will simply double the network load - normally the most painful part of reducer execution. If the network is what is causing the reducer to be "slow," starting a second attempt only hurts both attempts.
If you truly have a use case for adjusting the spec exec time, it might be worth filing a jira at http://issues.apache.org.
Hope this helps.

Resources