How does gevent interact with threading.local data? - gevent

Will data stored in a threading.local be unique to a specific co-routine, or will it continue to be unique to a python thread?

According to the documentation of gevent.monkey at http://www.gevent.org/gevent.monkey.html, the thread and threading modules become greenlet-based. "Thread-local storage becomes greenlet-local storage."

Related

Duplicate job name in the beanstalk queue

Our system process many jobs from the queue and there are times that those jobs were not yet finish processing. There is a chance that our system will put jobs with the same name of the jobs that are currently process.
Is there a checker that will tell us that the job with the same name is already in the queue before we add it in the queue?
Thanks guys!
Beanstalkd does not have the facility to look things up within it - it's a job-queue, not a giant array. Other things can be used in conjunction with it though, that do allow for random access to data to record if something has already been done.
If you know all the jobs have a specific identifier, you could put them into Redis or Memcached, probably with some form of prefix, and possibly an expiry beyond which they won't be stored.
Redis also allows for other data structures that could help as well, such as Bloom Filters and the Redis-native Hyperloglog.

Connecting Redis events to Lua Script execution and concurrency issues

I have grouped key value pairs or data structures built using Redisson library. The design is that a change in value of any group of value(s) should be sent as event to subscribing Lua scripts. These scripts then do computations and update another group's key-value pair. This process is implemented as a chain such that once the Lua script updates a key-value per, that in turn generates a event and another Lua script does the work similar to first Lua script based on certain parameters.
Question 1: How to connect the Lua script and the event?
Question 2: Events are pipelined but it may be that my Lua Scripts may have to wait for network IO. In that case, I assume the next event is processed and the subscribing script executed. this for me is a problem because first script hasn't finished updating the key-value pair it needs to and the second script is going ahead with its work. This will cause errors for me. Is there a way to get over this?
Question 3: How to emit events from Redisson datastructures and I need the Lua script to understand that data structure's structure. How?
At the time of writing, Redis (3.2.9) does not allow blocking commands inside Lua scripts, including the subscribe command. So it is impossible to achieve what you have described via Lua script.
However you can do it using Redisson Topic and/or Redisson distributed services:
Modify a value, send a message to a channel. Another process receives the message, do the computation and updating.
Or ...
If there's only one particular process that does the computation and updating, you can use Redisson remote service to tell this process do the work, it works like RPC. Maybe it is able to modify the first value too.
Or ...
Create the whole lot as one runnable job and send it to be processed by a Redisson remote executor. You can also choose to schedule the job if it is not immediately required.

Are redis operations on data structures thread safe

How does Redis handle multiple threads (from different clients) updating the same data structure in Redis ? What is the recommended best practice for such a use case?
if you read the Little redis book at some point this sentence comes.
"You might not know it, but Redis is actually single-threaded, which is how every command is guaranteed to be atomic.
While one command is executing, no other command will run."
Have a look in http://openmymind.net/2012/1/23/The-Little-Redis-Book/ for more information
Regards

Read issue in MongoDB asynchronous replication

I'm new to MongoDB. I created a Java app using MongoDB as database.
I configured 3 servers in a replica set.
my pseudo code:
{
createUser
getUser
updateUser
}
Here createUser creates the user successfully but getUser fails to return that user in somtimes.
when I analysed it is due to the data replication latency.
How can I overcome this issue?
is there anyway to replicate data immediately when it is created?
is there any other way to get user without fail?
Thx in advance!
If you are certain that the issue is due to replication latency, one thing you can do is make sure your writes are safe and using the w flag. That way, MongoDB will wait until data is replicated to at least n nodes before returning. You can do this from the client driver as well.
MongoDB getLastError
Are you reading with slaveOk=True ? If you read from the ReplicaSet Primary, this shouldn't be an issue either.
The slaveOk property is now known as ReadPreference (.SECONDARY in this case) in newer Mongo Java driver versions. This can be set at the Mongo/DB/Collection level. Note that when you set ReadPreference at these levels, it applies for all callers (i.e. these objects are shared across threads).
Another approach is to try the ReadPreference.SECONDARY and if it fails, try without it and go to the master. This logic can be isolated to your repository layer, so the service layer doesn't have to deal with it. If you are doing this, you may want to set the ReadPreference at the DBQuery object, which is on a per-use basis.
I am not familiar with Java driver, but there are w and j options.
The w option confirms that write operations have replicated to the specified number of replica set members, including the primary.
The j will confirm the write operation only after it has written the operation to the journal.
It looks like you need to use WriteConcern.

Web crawler in Ruby: How to achieve the best perfomance?

I'm writing a web-crawler that should be able to parse multiple pages at the same time. I use Nokogiri for parsing which is quiet good and solve all my tasks, but I don't know how to achieve better perfomance.
I use threads to make many open-uri requests at the same time and it makes the process quicker, but it seems that it's still far from the potential that I can achieve from a single server. Should I use multiple processes? What are the limits of the threads and processes that can be launched for a single ruby application?
By the other words: how to achieve the best performance in this case.
I really like Typhoeus and Hydra for handling multiple requests at once.
Typhoeus is the http client side, and Hydra is the part that handles multiple requests. The examples are good so go through them and see.
While it sounds like you're not looking for something quite so complex I found this thesis an interesting read awhile ago: Building blocks of a scalable webcrawler - Marc Seeger.
In terms of threading/process limits Ruby has very low threading potential. Standard Ruby (MRI/YARV) and Rubinius don't support simultaneous thread execution, unless using an extension specifically built to support it. Depending on how much of your performance trouble is in the IO and how much is in the processing I could suggest using EventMachine.
Multi process however Ruby works very well, as long as you've got a good manager/database for all the processes to communicate with then running multiple processes should scale as well as your processing power allows.
Hey another way is to use a combination of Nokogiri and IronWorker (IronMQ and IronCache).
See a full blog entry on the Topic here
We use a combination of ActiveMQ/Active Messaging, Event Machine, and multi-threading for this problem. We start off with a big list of URL's to fetch. We then break them down into batches of 100 URL's per batch. Each batch is then pushed into ActiveMQ. Then, we have an array of poller/consumer processes listening to the queue. These consumers can all be on one computer, or they can be spread across multiple computers. The array of consumers can grow arbitrarily large to support as much parallelism as we want. The consumers use Active Messaging, which is a nice Ruby integration with ActiveMQ.
When a consumer receives a message to process a batch of 100 URL's, it kicks off Event Machine to create a thread pool that can process multiple messages in multiple threads. Like you, we use Nokogiri to process each URL.
So, there are three levels of parallelism:
1) Multiple concurrent requests per consumer process, supported by Event Machine and threads.
2) Multiple consumer processes per computer.
3) Multiple computers.
If you want something easy go for http://anemone.rubyforge.org/
If you want something fast, code something with eventmachine/em-http-request
I found redis to be a great multi purpose tool for queue management, caching and so on. You could also use specialized things like beanstalkd/active mq/... but at least in my use case, I didn't really find them to be a big advantage compared to redis.
Especially the load on the backend system could be a bottleneck, so chose your database carefully and pay attention to what you save

Resources