I have a site that has to crawl different sites to aggregate information. When the crawling scripts are running, the site's speed slows down. I have done as much as possible to optimize the crawling, but it's really CPU- and RAM- intensive. These crawls have to occur based on some user action (e.g. search). It is not an option to "pre-crawl" the information as the information is time-sensitive.
What are the general strategies I can use to solve this? Here are 2 of my ideas:
Get more CPU and RAM on current server
Offload these processing intensive scripts on a separate physical server
I'm wondering about cloud computing, but don't have any experience in it. Suggestions?
You've already identified the options. "Cloud computing" doesn't mean anything but being able to quickly allocate a VPS with hourly pricing. It's the same as buying another physical server, except without waiting for the host to put it online and e-mail you access info, and without a monthly commitment. You still have to write your application to make use of multiple servers, you have to write code to "scale up" or "scale down" as needed (purchase or terminate virtual servers, and write code to automatically start whatever programs you need on them), you still have to properly manage the servers (install and maintain an OS, keep packages updated with security fixes) etc.
You could try to make the action to be asynchronous:-
User submits a search.
System displays "The system is currently searching the information based on your criteria and you will be notified shortly". System handles the user request at the mean time.
Since the user isn't waiting for the result page, the user is free to browse around or do other thing in your website instead of locking up their screens.
When the result is generated, system notifies the user that the search is done and provides the link for the user to view the result. This can be done by either sending an email notification to the user, or merely popping a dialog box or sliding down a notification message on the menu bar (basically something to catch the user attention).
It is wise to have a separate machine to run these processing intensive scripts so that it will not slow down the entire application server especially when you have tons of users submitting the search.
Related
Use a StackOverflow Q&A thread as an example - when you vote up, vote down, or favorite a question, you can see the UI quickly respond to that action with changes in the # of up-votes on the side.
How can we achieve that effect? If send every of such action to back-end for processing and use the returned response to update UI, you will see a slow update and feel the glitches. But if put some of the logic on the front-end, you will also need to take care of the fraud/abuse etc before reflecting the action on UI, i.e - before changing the # of up-votes, don't you need to make sure that's a valid click by an valid user first?
You make sure that a valid user is using the app before a user clicks on anything. This is done through authentication, and it must include various protection mechanisms against malicious users.
When a user clicks, a call is made to a server. In a properly architected app this call is lightweight, and the server responds very quickly. I don't know why you believe that "you will see a slow update and feel the glitches". Adding an upvote to the database should take a few hundred milliseconds at most (including the roundtrip from the client), especially if the commit is asynchronous or a memcache is used.
If a database update results in a need to do some complex operations, typically these operations are not done right away. For example, a cron job may run periodically to compute new rankings, etc., precisely because you do not want every user to wait. Alternatively, a task is created and put in a task queue to be executed when resources are available - again to make sure that a user does not wait.
In some apps a UI is updated immediately after the call to the server is made, before any response from a server arrives. You can do it when the consequences of a failed call are negligible. For example, if an upvote fails to be saved in the database, it's not a disaster, especially if it happens once in a million tries. Again, in a properly architected app calls fail extremely rarely.
This is a decision that an app developer needs to make. I would not update a UI before a server response if such an update may lead a user to believe that some other action is now possible. For example, if a user uploads a new photo, I would not show icons to edit or share this photo until I know that the photo is safely saved.
We have an image processing workflow product. Typically 10,000->100,000 images can be run though our processing in a job. More than one job may be pending.
Currently, all the image processing is performed in our home grown imaging library, a managed C++ library, .NET compatible. It is run in the user’s application space. What I mean by that is that if you log on as “PeteSmith” the images will be run on Pete Smiths’ account.
Currently, we only allow one instance of this image processing at a time. Customers are asking us for a new version, one that allows more than one instance to run at the same time, so the question of how we do this is now something we are examining.
The idea of getting processing off the “users account” and using a “system account” to do the processing in the background is appealing. It is appealing, because of the way windows services are naturally managed by OS events like logging in and logging out and other system resource utilization events alarms.
It appears to me that all we would need to do is manage a small number of well-defined events, well documented by Microsoft.
That’s all nice and wonderful. But what I need to understand is what going to a service implantation for our image processing code means for performance, from our customer’s point of view.
In their view, they need more processed, faster.
QUESTION How I should think about tradoffs:
1) Using a service to run a job vs. running N different “instances” of the software running only on Pete Smith (the users’) account?
2) Allowing N number of services to run N different jobs (no cross talk needed) in comparison to running N different “instances” of the software running only on Pete Smith (the users’) account?
Well, the image processing requires a certain amount of CPU and IO resources for the processing. That amount does not change by fiddling with how and where you start your process.
The difference between service or not should be governed by the required usage pattern. If you want the application to go on processing images automatically regardless if anyone is logged on you should run as a service, but if the usage is more like "choose an image, start processing and wait for the result" style you should go for a client app.
It is not entirely clear why your customer wants to run multiple instances. Is it because they want to have one instance do the heavy processing work while they configure the processing for the other? Or do they want to run multiple instances because the processing is heavy and they want to run multiple in parallel?
In both those cases I would consider running the calculation(s) on a background thread in the application. If it is not possible to use threads (maybe due to some global shared state in the library) my second best bet would be to start each processing in a new process and wait for the result on the main process.
I am developing a social network in ASP.NET MVC 3. Every user has must have the ability to see connected people.
What is the best way to do this?
I added a flag in the table Contact in my database, and I set it to true when the user logs in and set it to false when he logs out.
But the problem with this solution is when the user closes the browser without logging out, he will still remain connected.
The only way to truly know that a user is currently connected is to maintain some sort of connection between the user and the server. Two options immediately come to mind:
Use javascript to periodically call your server using ajax. You would have a special endpoint on your server that would be used to update a "last connected time" status, and you would have a second endpoint for users to poll to see who is online.
Use a websocket to maintain a persistent connection with your server
Option 1 should be fairly easy to implement. The main thing to keep in mind that this will increase the amount of requests coming into your server, and you will have to plan accordingly in order handle the traffic this could generate. You will have some control over the amount of load on your server by configuring how often javascript timer calls back to your server.
Option 2 could be a little more involved if you did this without library support. Of course there are libraries out there such as SignalR that make this really easy to do. This also has an impact on the performance of your site since each user will be maintaining a persistent connection. The advantage with this approach is that it reduces the need for polling like option 1 does. If you use this approach it would also be very easy to push a message to user A that user B has gone offline.
I guess I should also mention a really easy 3rd option as well. If you feel like your site is pretty interactive, you could just track the last time they made a request to your site. This of course may not give you enough accuracy to determine whether a user is "connected".
my goal is to track all logged users in my web portal to develop some kind of administration app that provides stats to admin users. I have some idea of how to develop it but I'm not sure if is the right thing to do. Basically a listener will put a custom some object inside the servlet context and the login servlet will fill it with user information every time a user logs in and out and other information.
Thank you even if you only read it!
In fact, you always keep session data somewhere inside your application context. It's up to you where to keep it, depending on the workload - you may keep it either in the servlet itself (meaning its own memory) or somewhere else (for example, in a dedicated database).
Choosing second option will cause you to use additional interfaces and data transfers (between your servlet and the DB), but it's much more scalable and is the best option for huge workloads.
Simply, if you have 10 active sessions and high activity, you better use local memory. If you have 100k+ active sessions and low activity - some shared resource is your choice.
It is optimal for you to start with local memory and then perform some load testing to determine if you need a separate data domain for the sessions.
I'm working on a consumer web app that needs to do a long running background process that is tied to each customer request. By long running, I mean anywhere between 1 and 3 minutes.
Here is an example flow. The object/widget doesn't really matter.
Customer comes to the site and specifies object/widget they are looking for.
We search/clean/filter for widgets matching some initial criteria. <-- long running process
Customer further configures more detail about the widget they are looking for.
When the long running process is complete the customer is able to complete the last few steps before conversion.
Steps 3 and 4 aren't really important. I just mention them because we can buy some time while we are doing the long running process.
The environment we are working in is a LAMP stack-- currently using PHP. It doesn't seem like a good design to have the long running process take up an apache thread in mod_php (or fastcgi process). The apache layer of our app should be focused on serving up content and not data processing IMO.
A few questions:
Is our thinking right in that we should separate this "long running" part out of the apache/web app layer?
Is there a standard/typical way to break this out under Linux/Apache/MySQL/PHP (we're open to using a different language for the processing if appropriate)?
Any suggestions on how to go about breaking it out? E.g. do we create a deamon that churns through a FIFO queue?
Edit: Just to clarify, only about 1/4 of the long running process is database centric. We're working on optimizing that part. There is some work that we could potentially do, but we are limited in the amount we can do right now.
Thanks!
Consider providing the search results via AJAX from a web service instead of your application. Presumably you could offload this to another server and let you web application deal with the content as you desire.
Just curious: 1-3 minutes seems like a long time for a lookup query. Have you looked at indexes on the columns you are querying to improve the speed? Or do you need to do some algorithmic process -- perhaps you could perform some of this offline and prepopulate some common searches with hints?
As Jonnii suggested, you can start a child process to carry out background processing. However, this needs to be done with some care:
Make sure that any parameters passed through are escaped correctly
Ensure that more than one copy of the process does not run at once
If several copies of the process run, there's nothing stopping a (not even malicious, just impatient) user from hitting reload on the page which kicks it off, eventually starting so many copies that the machine runs out of ram and grinds to a halt.
So you can use a subprocess, but do it carefully, in a controlled manner, and test it properly.
Another option is to have a daemon permanently running waiting for requests, which processes them and then records the results somewhere (perhaps in a database)
This is the poor man's solution:
exec ("/usr/bin/php long_running_process.php > /dev/null &");
Alternatively you could:
Insert a row into your database with details of the background request, which a daemon can then read and process.
Write a message to a message queue which a daemon then read and processed.
Here's some discussion on the Java version of this problem.
See java: what are the best techniques for communicating with a batch server
Two important things you might do:
Switch to Java and use JMS.
Read up on JMS but use another queue manager. Unix named pipes, for instance, might be an acceptable implementation.
Java servlets can do background processing. You could do something similar to this technology in a web technology with threading support. I don't know about PHP though.
Not a complete answer but I would think using AJAX and passing the 2nd step to something thats faster then PHP (C, C++, C#) then a PHP function pick the results off of some stack most likely just a database.