Optimal language for asynchronous processing of information - thread-safety

Before delving into the heart of the matter, first I will have to outline the current scenario. I currently have a php script that executes through CLI to process some data. It goes something like this:
The user submits some data through the website and it is stored in a database
A php script executing through CLI cycles through all of the data in the database every 5 minutes or so. It reads the information submitted by the user in the database, processes it, then creates multiple other entires in other databases. Often it might have to post something through http using file_get_contents.
I cannot always have the information processed simply when the user submits it for logistical reasons (this is non-negotiable)
The code for it would look something like this:
$q = mysql_query("SELECT username, infoA, infoB FROM data");
while($r = mysql_fetch_array($q))
{
some_function($r['username'], $r['infoA']);
another_function($r['infoB']);
}
The functions "some_function" and "another_function" are where all the actual processing of the information occurs. Here is the issue: Often, there are a lot of entries to cycle through and there is far too large of a delay between the time the first entry is processed and the last one. I need all of the data processed with minimal delay between the first and last entry. The functions themselves are optimized well and run pretty fast so that is not the issue. Since future function calls do not need to reference data from previous function calls, I am thinking that I need the functions to be executed asynchronously. This way, the script can cycle to the next entry without waiting for the first entry to be done processing.
The php cli script I created is primarily for testing purposes. It works well for preliminary testing, but once I launch the quantity of data will be significantly greater. What is the ideal language for handling a task such as this. I certainly need the functions to be executed asynchronously. However, if there are too many asynchronous calls at the same time, it might overload the system or the information not be processed properly. Hence, there must also be an efficient way to to handle this. Can I still do this in php, or should I move to something else and why?
The requirements are that I can make http requests with GET data (I do not need to wait for the results), be able to use mysql, and memcached.
Realistically speaking, I will hire programmers to work on this. So, I am really looking for as much information as possible to determine exactly what skill sets I should look for in the programmers.
Also, please do not recommend getting a faster server. I am focused on optimizing the software end of this. Improvements to the physical server that are required for an improved software approach might be taken into consideration. However, I am trying to avoid simply pumping money into the hardware infrastructure to compensate for software inefficiency.

I recommand you to use Gearmand right now.
It's very easy to use with PHP with this extension http://php.net/manual/fr/book.gearman.php
Just set up a gearman server, and refactor your code to delegate all the processing to this server.
Your previous code can be refactored like that :
<?php
# Client Code
$client= new GearmanClient();
$client->addServer();
print $client->doBackground("action1", json_encode(array($username, $infoA)));
print $client->doBackground("action2", $infoB);
# Worker Code
$worker= new GearmanWorker();
$worker->addServer();
$worker->addFunction("action1", "some_function");
$worker->addFunction("action2", "another_function");
while ($worker->work());
function some_function($job)
{
list($username, $infoA) = json_decode($job->workload(), true);
// do the stuff ...
}
function another_function($job)
{
$infoB = $job->workload();
// do the stuff ...
}

Related

Scheduling tasks/messages for later processing/delivery

I'm creating a new service, and for that I have database entries (Mongo) that have a state field, which I need to update based on a current time, so, for instance, the start time was set to two hours from now, I need to change state from CREATED -> STARTED in database, and there can be multiple such states.
Approaches I've thought of:
Keep querying database entries that are <= current time and then change their states accordingly. This causes extra reads for no reason and half the time empty reads, and it will get complicated fast with more states coming in.
I write a job scheduler (I am using go, so that'd be not so hard), and schedule all the jobs, but I might lose queue data in case of a panic/crash.
I use some products like celery, have found a go implementation for it https://github.com/gocelery/gocelery
Another task scheduler I've found is on Google Cloud https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine, but I don't want to get stuck in proprietary technologies.
I wanted to use some PubSub service for this, but I couldn't find one that has delayed messages (if that's a thing). My problem is mainly not being able to find an actual name for this problem, to be able to search for it properly, I've even tried searching Microsoft docs. If someone can point me in the right direction or if any of the approaches I've written are the ones I should use, please let me know, that would be a great help!
UPDATE:
Found one more solution by Netflix, for the same problem
https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc
I think you are right in that the problem you are trying to solve is the job or task scheduling problem.
One approach that many companies use is the system you are proposing: jobs are inserted into a datastore with a time to execute at and then that datastore can be polled for jobs to be run. There are optimizations that prevent extra reads like polling the database at a regular interval and using exponential back-off. The advantage of this system is that it is tolerant to node failure and the disadvantage is added complexity to the system.
Looking around, in addition to the one you linked (https://github.com/gocelery/gocelery) there are other implementations of this model (https://github.com/ajvb/kala or https://github.com/rakanalh/scheduler were ones I found after a quick search).
The other approach you described "schedule jobs in process" is very simple in go because goroutines which are parked are extremely cheap. It's simple to just spawn a goroutine for your work cheaply. This is simple but the downside is that if the process dies, the job is lost.
go func() {
<-time.After(expirationTime.Sub(time.Now()))
// do work here.
}()
A final approach that I have seen but wouldn't recommend is the callback model (something like https://gitlab.com/andreynech/dsched). This is where your service calls to another service (over http, grpc, etc.) and schedules a callback for a specific time. The advantage is that if you have multiple services in different languages, they can use the same scheduler.
Overall, before you decide on a solution, I would consider some trade-offs:
How acceptable is job loss? If it's ok that some jobs are lost a small percentage of the time, maybe an in-process solution is acceptable.
How long will jobs be waiting? If it's longer than the shutdown period of your host, maybe a datastore based solution is better.
Will you need to distribute job load across multiple machines? If you need to distribute the load, sharding and scheduling are tricky things and you might want to consider using a more off-the-shelf solution.
Good luck! Hope that helps.

How get a data without polling?

This is more of a theorical question.
Well, imagine that I have two programas that work simultaneously, the main one only do something when he receives a flag marked with true from a secondary program. So, this main program has a function that will keep asking to the secondary for the value of the flag, and when it gets true, it will do something.
What I learned at college is that the polling is the simplest way of doing that. But when I started working as an developer, coworkers told me that this method generate some overhead or it's waste of computation, by asking every certain amount of time for a value.
I tried to come up with some ideas for doing this in a different way, searched on the internet for something like this, but didn't found a useful way about how to do this.
I read about interruptions and passive ways that can cause the main program to get that data only if was informed by the secondary program. But how this happen? The main program will need a function to check for interruption right? So it will not end the same way as before?
What could I do differently?
There is no magic...
no program will guess when it has new information to be read, what you can do is decide between two approaches,
A -> asks -> B
A <- is informed <- B
whenever use each? it depends in many other factors like:
1- how fast you need the data be delivered from the moment it is generated? as far as possible? or keep a while and acumulate
2- how fast the data is generated?
3- how many simoultaneuos clients are requesting data at same server
4- what type of data you deal with? persistent? fast-changing?
If you are building something like a stocks analyzer where you need to ask the price of stocks everysecond (and it will change also everysecond) the approach you mentioned may be the best
if you are writing a chat based app like whatsapp where you need to check if there is some new message to the client and most of time wont... publish subscribe may be the best
but all of this is a very superficial look into a high impact architecture decision, it is not possible to get the best by just looking one factor
what i want to show is that
coworkers told me that this method generate some overhead or it's
waste of computation
it is not a right statement, it may be in some particular scenario but overhead will always exist in distributed systems
The typical way to prevent polling is by using the Publish/Subscribe pattern.
Your client program will subscribe to the server program and when an event occurs, the server program will publish to all its subscribers for them to handle however they need to.
If you flip the order of the requests you end up with something more similar to a standard web API. Your main program (left in your example) would be a server listening for requests. The secondary program would be a client hitting an endpoint on the server to trigger an event.
There's many ways to accomplish this in every language and it doesn't have to be tied to tcp/ip requests.
I'll add a few links for you shortly.
Well, in most of languages you won't implement such a low level. But theorically speaking, there are different waiting strategies, you are talking about active waiting. Doing this you can easily eat all your memory.
Most of languages implements libraries to allow you to start a process as a service which is at passive waiting and it is triggered when a request comes.

Java EE servlet to create a file and show progress while creating it

I need to write a servlet that will return to the user a csv that holds some statistics.
I know how to return just the file, but how can I do it while showing a progress bar of the file creation process?
I am having trouble understanding how can I do something ajaxy to show the progress of the file creation, while creating the file at the same time - if I create a servlet that will return the completion percentage, how can it keep the same file it is creating while returning a response every x seconds to the browser to show the progress.
There's two fundamentally different approaches. One is true asynchronous delivery using an approach such as Comet. You can see some descriptions in articles such as this. I would use this approach where the data your are delivering is naturally incremental - for example live measurements from instrumentation. Some Java App Servers have nice integration between their JMS message systems and comet to the browser.
The other approach is that you have a polling mechanism. The JavaScript in the browser makes periodic calls to the server to get status (and maybe the next chunk of data). The advantage of this approach is that you are using a very standard programming model, less new stuff to learn. For many cases, such as "are there new answers for the Stack Overflow question I'm working on?" this is quite sufficient.
Your challenge may be to determine any useful progress information. How would you know how far through the generation of the CSV file you are?
If you are firing off a long running request from a servlet it's quite likely that you will effectivley spin off a worker thread to do that work. (Maybe using JMS, maybe using asynch workers) and immediately return a response to the browser saying "Understood, I'm thinking". This ensures that you are not vulnerable to and Http response timeouts. The problem then is how to determine the current progress. Unless the "worker" doing the work has some way to communicate its partial progress you have nothing useful to say. This kind of thing tend to be very application-specific. Some tasks very naturally have progress points (consider printing we know how many pages to do and how many printed) others don't (consider determining if a number is prime - yes or no, no useful intermediate stages perhaps)

Ajax use on website design

I just want to ask for your experience. I'm designing a public website, using jQuery Ajax in most of operations. I'm having some timeouts, and I think it should be for hosting provider cause. Any of you have expirience in this case and may advise me on some hints (especially on timeouts handling)?
Thanks in advance to all.
Esteve
If you have a half-decent host, chances are these aren't network timeouts but are rather due to insufficient hardware which causes your server-side scripts to take too long to answer. For example if you have an autocomplete field and the script goes through a database of 100,000 entries, this is a breeze for newer servers but older "budget" servers or overcrowded shared hosting servers might croak on it.
Depending on what your Ajax operations are, you may be able to break them down in shorter chunks. If you're doing database queries for example, use LIMIT and OFFSET and only return say, 5 entries at a time. When those 5 entries arrive on the client, make another Ajax call for 5 more, so from the user's point of view the entries will keep coming in and it will look fluid (instead of waiting 30s and possibly timing out before they see all entries at once). If you do this make sure you display a spiffy web 2.0 turning wheel to let the user know if they should be waiting some more or if it's done.

Send data to browser

An example:
Say, I have an AJAX chat on a page where people can talk to each other.
How is it possible to display (send) the message sent by person A to persons B, C and D while they have the chat opened?
I understand that technically it works a bit different: the chat(ajax) is reading from DB (or other source), say every second, to find out if there are new messages to display.
But I wonder if there is a method to send the new message to the rest of the people just when it is sent, and not to load the DB with 1000s of reads every second.
Please note that the AJAX chat example is just an example to explain what I want, and is not something I want to realize. I just need to know if there is a method to let all the opened browser at a specific page(ajax) that there is new content on the server that should be gathered.
{sorry for my English}
Since the server cannot respond to a client without a corresponding request, you need to keep state for each user's queued message. However, this is exactly what the database accomplishes. You cannot get around this by replacing the database with something that doesn't just accomplish the same thing in a different way. That said, there are surely optimizations you could do. Keep in mind, however, that you shouldn't prematurely optimize situations like this; databases are designed to handle extremely high traffic, and it's very possible (and in fact, likely), that the scenario described will be handled just fine by the database out of the box.
What you're describing is generally referred to as the 'Comet' concept. See the Wikipedia article for details, especially implementation options (long polling, etc.).
Another answer is to have the server push changes to connected clients, that way there is just one call to the database and then the server pushes the change to all the clients. This article indicates it is possible, however I have never tried this myself.
It's very basic, but if you want to stick with a standard AJAX solution, a simple means of reducing load on the server when polling would be to get the AJAX call to forward the last collected comment ID for that client - you then use that (with the appropriate escaping) in the lookup query on the server side to ensure you only return new comments.

Resources