I am using Parse.com's httpRequest to retrieve the source code of a website.
Code:
Parse.Cloud.define("extract_website_simple", function(request, response)
{
return Parse.Cloud.httpRequest({url: 'http://bet.hkjc.com/marksix/index.aspx?lang=en' }).then
(function(httpResponse)
{
response.success("code=" + httpResponse.text);
},
function (error)
{
response.error("Error: " + error.code + " " + error.message);
});
});
Question:
The html code cannot be retrieved. Instead, a ParseException, after loading 10 seconds, is appeared, written as follows:
com.parse.ParseException: i/o failure: java.net.SocketTimeoutException: Read timed out
How could I retrieve it properly without timeout? It seems there is no way to increase the timeout length?
Thanks!
As it it underlined by Parse support in many places like official Q/A, timeouts are low and they are not gonna be changed to keep good performance. Quote:
Héctor Ramos: I think that only two operations can run at any time in Cloud Code, so when you send three queries in parallel, the third one won't start until at least one of the first two has finished. Cloud Functions are not the best tool for long-running operations, and so they are limited to 15 seconds to keep Cloud Code performant for everybody. A better solution for long-running operations should be available shortly.
Official documentation says:
Resource Limits -> Timeouts
Cloud functions will be killed after 15 seconds of wall clock time. beforeSave, afterSave, beforeDelete, and afterDelete functions will be killed after 3 seconds of run time. If a Cloud function or a beforeSave/afterSave/beforeDelete/afterDelete function is called from another Cloud Code call, it will be further limited by the time left in the calling function. For example, if a beforeSave function is triggered by a cloud function after it has run for 13 seconds, the beforeSave function will only have 2 seconds to run, rather than the normal 3 seconds.
So even if pay them thousands of dollars every month they won't allow your function to run more than 10-15 seconds. Parse is not a tool for everything and is very specific. I meet limitations all the time, like lack of support multipart forms with many attachments.
Parse.Cloud.job
In order to support max 15 minutes with Parse request you can choose to work with Background Jobs. They support Promises with .then which highly conserves server resources over typical anonymous callbacks.
If you use Free edition you won't love another limit: Apps may have one job running concurrently per 20 req/s in their request limit, so you can run only single Background Job in your app and if you try to open another one: Jobs that are initiated after the maximum concurrent limit has been reached will be terminated immediately. To get 4 background jobs running you will have to pay $700/m with current pricing.
If you need more time or have less money to parse tens of pages at once you can choose different technology to support web scraping. There are many options, personally my favorites are:
Node.js
If you like JavaScript on server side you could try Node.js. To start from basics you could follow schotch.io tutorial.
PHP
Another alternative with thousands of examples is PHP. You could start with tutorial-like answer on stackoverflow itself.
Related
I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.
We have an a Parse application with about 100K users.
Our queries on the user tables timeout.
For example, I'm doing the following query:
var query = new Parse.Query(Parse.User);
query.exists("email");
query.find(...);
This query will timeout. If I limit the results to a low number, e.g. 10, I can get the first 10 results. But the next pages will timeout. I.e. this will timeout:
query.limit(10);
query.skip(500);
query.find(...);
We are currently at a situation where we are not able to manage our users. Whenever we try to get a list of users by some attribute or change something for a batch of users we get timeout.
We tried doing the queries in cloud code and using the javascript sdk. Both methods fail with timeouts eventually.
Am I doing something wrong or is it a Parse limitation?
Parse cloud functions have a timeout of 15 seconds, and before/after save triggers have a timeout of 3 seconds.
If you need more time, you should find a way to do what you need done in a background job rather than a cloud function. Those have 15 minute timers, which is more than enough to do anything reasonable, and anything that requires more time, you'll have to find a way to save where you left off, and have the function run multiple times until everything you wanted to do is complete.
I have a specific use case which Im trying to solve using Node. The response time that I receive from NodeJS is not the one I expect.
The application is an express.js web application. The flow is as below
a. Request reaches the server.
b. Based on the parameter, backend REST Service is invoked.
c. The response of the REST Service has links to multiple other objects.
d. Navigate each of the link and agrregate the data.
e. This data is formatted (not much) and send to the client.
The actual test data-
The response from C has got 100 links and hence I make 100 parallel calls (Im using async.map). Each of the backend service responds in less than 30 msecs. But the overall response time for 100 requests is 4 seconds. This is considerably high.
What I have observed is :
The time difference between the first backend request and the last backend request is around 3 seconds. I believe that this is due to the fact that Node is single threaded and it takes 3 seconds to place all of the 100 http requests.
The code that I use to make parallel calls is given below
var getIndividualRecord = function(entity,callback1)
{
httpExecutor.executeRequest( entity.link.url, callback1);
}
var aggregateData = function(err, results)
{
callback(null, results);
}
async.map(childObjects, getIndividualRecord, aggregateData);
The childObjects is an array with 100 records. httpExecutor makes a REST invocation using request module.
Is there something wrong Im doing or is this a wrong use case for Node?
You're assumption is correct: node is single threaded, so while your HTTP requests happen in a nonblocking manner (requests are made right after the other without even waiting for the response from the server), they don't truly happen in simultaneously.
So, yes, it's probable it takes Node 3 seconds to get through all these requests and process them.
There are a few ways "around" this, which might work depending on your situation:
Could you use Node's cluster module to spawn multiple node apps and each do a portion of the work? Then you would be doing things simultaneously (since you have N Node processes going on).
Use a background queue mechanism (aka: Resque, Beanstalk) and have a background worker (or a process spawned with Cluster) to distribute the work (to Node worker processes waiting around to pick things off this queue)
Refactor your web app a little bit to deal with the fact that parts will take a while. Perhaps render most of the page then onload make an ajax request that fires off the 3 second route and then puts the results in some DOM element when the AJAX request comes back.
i have similar scenario and similar observation.
in my case I run node app using pm2. in app there are 2 sub servers (let's call them A and B). pm2 spawns 2 processes per each server. from a client i call server A, it calculates simple thing and call server B in async manner. when server B responds server A sends data back to client.
very simple scenario but when I used jmeter to create 1000 threads (where each thread makes 50 calls)to call server A I got average response around 4 sec (for 50000 calls).
server B responds after 50ms and I think this is the problem. during first 50ms nodejs processes lots of incoming requests and then it cannot quickly process responses from server B and incoming calls.
I would expect that application code is executed in single thread but there supposed to be background threads to deal with all the rest. it seems this is not the case.
We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.
This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.
Some things to note:
the database was not the bottleneck, and it spent only 25ms doing db queries
we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.
So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.
I have been chasing a similar problem myself, with not much luck so far.
I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.
Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).
When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:
1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.
2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.
3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.
4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)
Do let us know what you come up with.
This is kind of a 2 part question
1) Is there a max number of HttpWebRequests that can be run at the same time in WP7?
I'm going to create a ScheduledTaskAgent to run a PeriodicTask. There will be 2 different REST service calls the first one will get a list of IDs for records that need to be downloaded, the second service will be used to download those records one at a time. I don't know how many records there will be my guestimage would be +-50.
2.) Would making all the individual record requests at once be a bad idea? (assuming that its possible) or should I wait for a request to finish before starting another?
Having just spent a week and a half working at getting a BackgroundAgent to stay within it's memory limits, I would suggest doing them one at a time.
You lose about half your memory to system libraries and the like, your first web request will take another nearly 20%, but it seems to reuse that memory on subsequent requests.
If you need to store the results into a local database, it is going to take a good chunk more. I have found a CompiledQuery uses less memory, which means holding a single instance of your context.
Between each call I would suggest doing a GC.Collect(), I even add a short Thread.Sleep() just to be sure the process has some time to tidying things up.
Another thing I do is track how much memory I am using and attempt to exit gracefully when I get to around 97 or 98%.
You can not use the debugger to test memory limits as the debug memory is much higher and the limits are not enforced. However, for comparative testing between versions of your code, the debugger does produce very similar result on subsequent runs over the same code.
You can track your memory usage with Microsoft.Phone.Info.DeviceStatus.ApplicationCurrentMemoryUsage and Microsoft.Phone.Info.DeviceStatus.ApplicationMemoryUsageLimit
I write a status log into IsolatedStorage so I can see the result of runs on the phone and use ScheduledActionService.LaunchForTest() to kick the off. I then use ShellToast notifications to let me know when the task runs and also when it completes, that way I can launch my app to read the status log without interrupting it.
Tyler,
My 2 cents here.
I don't believe there is any restriction on how mant HTTPWebequests you can spin up. These however have to be async, off course, and may be served from the browser stack. Most modern browsers including IE9 handle over 5 concurrently to the same domain; but you are not guaranteed a request handle immediately. However, it should not matter if you are willing to wait on a separate thread, dump your content on to the request pipe & wait for response on yet another thread. This post (here) has a nice walkthrough of why we need to do this.
Nothing wrong with this approach either, IMO. You're just going to have to wait until all the requests have their respective pipelines & then wait for the responses.
Thanks!
1) Your memory limit in a PeriodicTask or ResourceIntensiveTask is 5 MB. So you definitely should control your requests really careful. I dont think there is a limit in the code.
2)You have only 5 MB. So when you start all your requests at the same time it will terminate immediately.
3) I think you should better use a ResourceIntensiveTask because a PeriodicTask should only run 15 seconds.
Good guide for Multitasking features in Mango: http://blogs.infosupport.com/blogs/alexb/archive/2011/05/26/multi-tasking-in-windows-phone-7-1.aspx
I seem to remember (but can't find the reference right now) that the maximum number of requests that the OS can make at once is 7. You should avoid making this many at once though as it will stop other/system apps from being able to make requests.