I am looking for a distributed lock framework that has a feature like, only limited number of process can use this lock.
For example:
Lock lock =new Lock("key", 5);
Only 5 process can use this lock at the same time, others has to wait until one of them finishes its job.
Do you know that kind of a framework or how can I implement one with existing ones ?
I think you are looking for a Semaphore. Most major Redis and Hazelcast clients have already implemented this. For example, in Java, you have:
https://redisson.org/glossary/java-semaphore.html
https://docs.hazelcast.com/imdg/4.2/data-structures/isemaphore
Related
Suppose we have a cache with a CacheLoaderWriter, so we are registered to the events: write and writeAll.
What is the status of these keys at that time?
i.e. If another thread tries to cache.get(keyThatBeingWritten), will it be blocked until the write()/writeAll() operations exit?
writeAll() logically functions like a succession of write(), it is entirely possible for one thread to observe some already written data while another thread is still busy executing writeAll().
Regarding write(), it will block concurrent reader and writer threads working on the same key if needed for as long as needed to fulfill the Ehcache visibility guarantees.
I'm creating a new service, and for that I have database entries (Mongo) that have a state field, which I need to update based on a current time, so, for instance, the start time was set to two hours from now, I need to change state from CREATED -> STARTED in database, and there can be multiple such states.
Approaches I've thought of:
Keep querying database entries that are <= current time and then change their states accordingly. This causes extra reads for no reason and half the time empty reads, and it will get complicated fast with more states coming in.
I write a job scheduler (I am using go, so that'd be not so hard), and schedule all the jobs, but I might lose queue data in case of a panic/crash.
I use some products like celery, have found a go implementation for it https://github.com/gocelery/gocelery
Another task scheduler I've found is on Google Cloud https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine, but I don't want to get stuck in proprietary technologies.
I wanted to use some PubSub service for this, but I couldn't find one that has delayed messages (if that's a thing). My problem is mainly not being able to find an actual name for this problem, to be able to search for it properly, I've even tried searching Microsoft docs. If someone can point me in the right direction or if any of the approaches I've written are the ones I should use, please let me know, that would be a great help!
UPDATE:
Found one more solution by Netflix, for the same problem
https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc
I think you are right in that the problem you are trying to solve is the job or task scheduling problem.
One approach that many companies use is the system you are proposing: jobs are inserted into a datastore with a time to execute at and then that datastore can be polled for jobs to be run. There are optimizations that prevent extra reads like polling the database at a regular interval and using exponential back-off. The advantage of this system is that it is tolerant to node failure and the disadvantage is added complexity to the system.
Looking around, in addition to the one you linked (https://github.com/gocelery/gocelery) there are other implementations of this model (https://github.com/ajvb/kala or https://github.com/rakanalh/scheduler were ones I found after a quick search).
The other approach you described "schedule jobs in process" is very simple in go because goroutines which are parked are extremely cheap. It's simple to just spawn a goroutine for your work cheaply. This is simple but the downside is that if the process dies, the job is lost.
go func() {
<-time.After(expirationTime.Sub(time.Now()))
// do work here.
}()
A final approach that I have seen but wouldn't recommend is the callback model (something like https://gitlab.com/andreynech/dsched). This is where your service calls to another service (over http, grpc, etc.) and schedules a callback for a specific time. The advantage is that if you have multiple services in different languages, they can use the same scheduler.
Overall, before you decide on a solution, I would consider some trade-offs:
How acceptable is job loss? If it's ok that some jobs are lost a small percentage of the time, maybe an in-process solution is acceptable.
How long will jobs be waiting? If it's longer than the shutdown period of your host, maybe a datastore based solution is better.
Will you need to distribute job load across multiple machines? If you need to distribute the load, sharding and scheduling are tricky things and you might want to consider using a more off-the-shelf solution.
Good luck! Hope that helps.
This is more of a theorical question.
Well, imagine that I have two programas that work simultaneously, the main one only do something when he receives a flag marked with true from a secondary program. So, this main program has a function that will keep asking to the secondary for the value of the flag, and when it gets true, it will do something.
What I learned at college is that the polling is the simplest way of doing that. But when I started working as an developer, coworkers told me that this method generate some overhead or it's waste of computation, by asking every certain amount of time for a value.
I tried to come up with some ideas for doing this in a different way, searched on the internet for something like this, but didn't found a useful way about how to do this.
I read about interruptions and passive ways that can cause the main program to get that data only if was informed by the secondary program. But how this happen? The main program will need a function to check for interruption right? So it will not end the same way as before?
What could I do differently?
There is no magic...
no program will guess when it has new information to be read, what you can do is decide between two approaches,
A -> asks -> B
A <- is informed <- B
whenever use each? it depends in many other factors like:
1- how fast you need the data be delivered from the moment it is generated? as far as possible? or keep a while and acumulate
2- how fast the data is generated?
3- how many simoultaneuos clients are requesting data at same server
4- what type of data you deal with? persistent? fast-changing?
If you are building something like a stocks analyzer where you need to ask the price of stocks everysecond (and it will change also everysecond) the approach you mentioned may be the best
if you are writing a chat based app like whatsapp where you need to check if there is some new message to the client and most of time wont... publish subscribe may be the best
but all of this is a very superficial look into a high impact architecture decision, it is not possible to get the best by just looking one factor
what i want to show is that
coworkers told me that this method generate some overhead or it's
waste of computation
it is not a right statement, it may be in some particular scenario but overhead will always exist in distributed systems
The typical way to prevent polling is by using the Publish/Subscribe pattern.
Your client program will subscribe to the server program and when an event occurs, the server program will publish to all its subscribers for them to handle however they need to.
If you flip the order of the requests you end up with something more similar to a standard web API. Your main program (left in your example) would be a server listening for requests. The secondary program would be a client hitting an endpoint on the server to trigger an event.
There's many ways to accomplish this in every language and it doesn't have to be tied to tcp/ip requests.
I'll add a few links for you shortly.
Well, in most of languages you won't implement such a low level. But theorically speaking, there are different waiting strategies, you are talking about active waiting. Doing this you can easily eat all your memory.
Most of languages implements libraries to allow you to start a process as a service which is at passive waiting and it is triggered when a request comes.
What I'd like to achieve is as follows (pseudocode):
f, t = select(files, threads)
if f
<read from files>
elsif t
<do something else>
end
Where select is a method similar to IO.select. But it seems unlikely to be possible.
The big picture is I'm trying to write a program which has to perform several types of jobs. The idea was to pass job data using database. But also inform the program about new jobs using pipes (by sending just type of the job). So that it wouldn't need to poll for jobs. So I was planning to create a loop waiting for either new notifications from pipes, or for worker threads to finish. After thread finishes I check if there were at least one notification about this particular type of job and run the worker thread again if needed. I'm not really sure what's is the best route to take here, so if you've got suggestions I'd like to hear them out.
Don't reinvent the wheel mate :) check out https://github.com/eventmachine/eventmachine (IO lib based on reactor pattern like node.js etc) or (perhaps preferably) https://github.com/celluloid/celluloid-io (IO lib based on actor pattern, better docs and active maintainers)
OPTION 1 - use EM or Celluloid to handle non-blocking sockets
EM and Celluloid are quite different, EM is reactor pattern ("same thing" as node.js, with a threadpool as workaround for blocking calls) and Celluloid is actor pattern (an actor thread pool).
Both can do non-blocking IO to/from a lot of sockets and delegate work to a lot of threads, depending on how you go about to do it. Both libs are very robust, efficient and battle tested, EM has more history but seems to have fallen slightly out of maintenance (https://www.youtube.com/watch?v=mPDs-xQhPb0), celluloid has nicer API and more active community (http://www.youtube.com/watch?v=KilbFPvLBaI).
Best advice I can give is to play with code samples that both projects provide and see what feels best. I'd go with celluloid for a new project, but that's a personal opinion - you may find that EM has more IO-related features (such as handling files, keyboard, unix sockets, ...)
OPTION 2 - use background job queues
I may have been misguided by the low level of your question :) Have you considered using some of the job queues available under ruby? There's a TON of decent and different options available, see https://www.ruby-toolbox.com/categories/Background_Jobs
OPTION 3 - DIY (not recommended)
There is a pure ruby implementation of EM, it uses IO selectables to handle sockets so it offers a pattern for what you're trying to do, check it out: https://github.com/eventmachine/eventmachine/blob/master/lib/em/pure_ruby.rb#L311 (see selectables handling).
However, given the amount of other options, hopefully you shouldn't need to resort to such low level coding.
Background: I'm writing network traffic processing kernel module.
I'm getting packets using netfilter hooks. All filtering is done inside hook function, but I don't want to do packet processing here. So solution is tasklets or workqueues. I know the difference between them, I can use both, but I have some problems and I need an advice.
Tasklets solution. Preferrable. I can create and start tasklet for
each packet, but who will delete this tasklet? Tasklet function? I
don't think its a good idea - to dealloc tasklet while it is
executing. Create global pool of tasklets? Well, since there can't
be 2 executing tasklets on one processor, the pool size will be the
number of processors. But how to find out when tasklet is available
for new use? There are only two states: shed and run, but there is
no "done" state. Ok, I probably can wrap tasklet with some struct
with flag. But wouldn't that all be too much overkill?
Workqueue solution. Same problem: who will delete work? Same "solution" as for tasklets?
Workqueue solution 2. Just create permanent work due module loading, save packets to some queue and process them inside the work. May be two works and two queues: incoming and outgoing. But I'm afraid that with that solution I will use only one (or two) processors since looks like work can't be performed on few processors simultaneously.
Any other solutions?
One can use high-priority(WQ_HIGH_PRI), unbound(WQ_UNBOUND) workqueues and stick with option3 listed in the question.
WQ_HIGH_PRI guarantees that the processing is initiated ASAP. WQ_UNBOUND eliminates the single-CPU bottleneck as the scheduler assigns the work to any available CPU immediately.