JMeter and page views - jmeter

I'm trying to use data from google analytics for an existing website to load test a new website. In our busiest month over an hour we had 8361 page requests. So should I get a list of all the urls for these page requests and feed these to jMeter, would that be a sensible approach? I'm hoping to compare the page response times against the existing website.

If you need to do this very quickly, say you have less than an hour for scripting, in that case you can do this way to compare that there are no major differences between 2 instances.
If you would like to go deeper:
8361 requests per hour == 2.3 requests per second so it doesn't make any sense to replicate this load pattern as I'm more than sure that your application will survive such an enormous load.
Performance testing is not only about hitting URLs from list and measuring response times, normally the main questions which need to be answered are:
how many concurrent users my application can support providing acceptable response times (at this point you may be also interested in requests/second)
what happens when the load exceeds the threshold, what types of errors start occurring and what is the impact.
does application recover when the load gets back to normal
what is the bottleneck (i.e. lack of RAM, slow DB queries, low network bandwidth on server/router, whatever)
So the options are in:
If you need "quick and dirty" solution you can use the list of URLs from Google Analytics with i.e. CSV Data Set Config or Access Log Sampler or parse your application logs to replay production traffic with JMeter
Better approach would be checking Google Analytics to identify which groups of users you have and their behavioral patterns, i.e. X % of not authenticated users are browsing the site, Y % of authenticated users are searching, Z % of users are doing checkout, etc. After it you need to properly simulate all these groups using separate JMeter Thread Groups and keep in mind cookies, headers, cache, think times, etc. Once you have this form of test gradually and proportionally increase the number of virtual users and monitor the correlation of increasing response time with the number of virtual users until you hit any form of bottleneck.

The "sensible approach" would be to know the profile, the pattern of your load.
For that, it's excellent you're already have these data.
Yes, you can feed it as is, but that would be the quick & dirty approach - while get the data analysed, patterns distilled out of it and applied to your test plan seems smarter.

Related

Fetch third party data in a periodic interval

I've an application with 10M users. The application has access to the user's Google Health data. I want to periodically read/refresh users' data using Google APIs.
The challenge that I'm facing is the memory-intensive task. Since Google does not provide any callback for new data, I'll be doing background sync (every 30 mins). All users would be picked and added to a queue, which would then be picked sequentially (depending upon the number of worker nodes).
Now for 10M users being refreshed every 30 mins, I need a lot of worker nodes.
Each user request takes around 1 sec including network calls.
In 30 mins, I can process = 1800 users
To process 10M users, I need 10M/1800 nodes = 5.5K nodes
Quite expensive. Both monetary and operationally.
Then thought of using lambdas. However, lambda requires a NAT with an internet gateway to access the public internet. Relatively, it very cheap.
Want to understand if there's any other possible solution wrt the scale?
Without knowing more about your architecture and the google APIs it is difficult to make a recommendation.
Firstly I would see if google offer a bulk export functionality, then batch up the user requests. So instead of making 1 request per user you can make say 1 request for 100k users. This would reduce the overhead associated with connecting and processing/parsing of the message metadata.
Secondly i'd look to see if i could reduce the processing time, for example an interpreted language like python is in a lot of cases much slower than a compiled language like C# or GO. Or maybe a library or algorithm can be replaced with something more optimal.
Without more details of your specific setup its hard to offer more specific advice.

How to determine the number of users to use in Jmeter performance testing?

This can actually be a complex question for me, if the number is not given directly from my product owner as a direct requirement.
Jmeter is basically an API performance testing tool. I've seen so many Jmeter scripts that include only the important APIs needed for the flow to be tested. It does not consider any pure front end (UI) related user actions.
A common reply to my question in article/tutorial is: estimate how many concurrent users you normally have on your website?
The problem with this approach is that user purely browsing your website is not causing any 'load' that Jmeter try to simulate.
If user is using a form submission webpage for example, every second user uses to browse the page content, or filling the data in the form are pure front end (UI) activity and does not lead to any 'load'. There may be 10 concurrent user visiting my webpage, but only 2 are 'submitting'. Should i use 10 or 2 in this scenario? The Jmeter script is intended to only measure the performance of 'submit form' API.
Another most sophisticated reply to my question is 'load testing calculator' mentioned in https://www.webperformance.com/library/tutorials/CalculateNumberOfLoadtestUsers.
It calculate concurrent number of user from 'visit rate (visits/hour)' and 'Average visit length (minutes/visit)'. This is more precise that the 1st reply of just 'estimate how many concurrent users are using your system?'.
However it has the same issue as the 1st reply in that it does not define 'Average visit length (minutes/visit)' as 'Average visit length' from API perspective. The same argument i present for the form submission website applies here too. The 'visit' time a user spend on browsing the page, filling the form does not count, only the time he spend on 'submit form' API does.
So what's your way of determining the number of users to use in Jmeter test?
Jmeter is basically an API performance testing tool. - this is wrong
Think of each JMeter thread (virtual user) as of the real user with all its attributes like:
using a real browser
needing some time to "think" between operations
Once you implement your JMeter test so each JMeter virtual user represents a real user with 100% accuracy - you will be able to tell how many users your website can handle without issues by looking at i.e. Active Threads Over Time chart.
If you need to know how many requests per second are X virtual users making - check out Server Hits Per Second chart.

Distributed crawling and rate limiting / flow control

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?
Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Should requests contain unnecessary parameters which are sent if manually browsing the application

I'm currently testing a asp.net application. I have recorded all the steps i need and i have noticed that if i remove some of the parameters that i'm sending with the request the scripts still work and the desired outcome still happens. Anyway i couldn't find difference in the response time with them or without them, and i was wondering can i remove those parameters which are not needed and is this going to impact the performance in any way? I understand that the most realistic way of executing the scripts should be to do it like a normal user does (send all which is sent with normal usage) but this would really improve the readability of my scripts, any idea?
Thank you in advance and here is a picture which shows for example some parameters which i can remove and the scripts still work this is from a document management system and i'm performing step which doesn't direct the document as the parameters say but the normal usage records those :
Although it may be something very trivial like pre-populating date and time in calendar in user's time zone I believe you shouldn't be omitting any request parameters.
I strongly believe that load testing should mimic real user as close as possible so if it is not a big deal to send these extra parameters and perform their correlation - I would leave them.
Few other tips:
Embedded Resources (scripts, styles, images). Real-browsers download these entities so
Make sure you have "Retrieve All Embedded Resources" box checked
Make sure you "Use concurrent pool" size 3-5 threads
Filter out any "external" stuff via "URLs must match" input
Well-behaved browsers download embedded resources but do it only once. On subsequent requests they're being returned from browser's cache. Add HTTP Cache Manager to your Test Plan to simulate browser cache.
Add HTTP Cookie Manager to represent browser cookies and deal with cookie-based authentication.
See How To Make JMeter Behave More Like A Real Browser article for above tips explained just in case you want to dive into details
Less data to send, faster response time (normally).
Like you said, it's more realistic to test with all data from the recorded case, but if these parameters really doesn't impact your result and measured time, you can remove them for a better readability.
Sometimes jmeter records not necessary parameters because they are only needed for brower compability.

Does the Google Analytics API throttle requests?

Does the Google Analytics API throttle requests?
We have a batch script that I have just moved from v2 to v3 of the API and the requests go through quite well for the first bit (50 queries or so) and then they start taking 4s or so each. Is this Google throttling us?
While Matthew is correct, I have another possibility for you. Google analytics API cashes your requests to some extent. Let me try and explain.
I have a customer / site that I request data from. While testing I noticed some strange things.
the first million rows results would come back with in an acceptable amount of time.
after a million rows things started to slow down we where seeing results returning in 5 times as much time instead of 5 minutes we where waiting 20 minutes or more for the results to return.
Example:
Request URL :
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:34896748&dimensions=ga:date,ga:sourceMedium,ga:country,ga:networkDomain,ga:pagePath,ga:exitPagePath,ga:landingPagePath&metrics=ga:entrances,ga:pageviews,ga:exits,ga:bounces,ga:timeOnPage,ga:uniquePageviews&filters=ga:userType%3D%3DReturning+Visitor;ga:deviceCategory%3D%3Ddesktop&start-date=2014-05-12&end-date=2014-05-22&start-index=236001&max-results=2000&oauth_token={OauthToken}
Request Time (seconds:milliseconds): :0:484
Request URL :
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:34896748&dimensions=ga:date,ga:sourceMedium,ga:country,ga:networkDomain,ga:pagePath,ga:exitPagePath,ga:landingPagePath&metrics=ga:entrances,ga:pageviews,ga:exits,ga:bounces,ga:timeOnPage,ga:uniquePageviews&filters=ga:userType%3D%3DReturning+Visitor;ga:deviceCategory%3D%3Ddesktop&start-date=2014-05-12&end-date=2014-05-22&start-index=238001&max-results=2000&oauth_token={OauthToken}
Request Time (seconds:milliseconds): :7:968
I did a lot of testing stopping and starting my application. I couldn't figure out why the data was so fast in the beginning then slow later.
Now I have some contacts on the Google Analytics Development team the guys in charge of the API. So I made a nice test app, logged some results showing my issue and sent it off to them. With the question Are you throttling me?
They where also perplexed, and told me there is no throttle on the API. There is a flood protection limit that Matthew speaks of. My Developer contact forwarded it to the guys in charge of the traffic.
Fast forward a few weeks. It seams that when we make a request for a bunch of data Google cashes the data for us. Its saved on the server incase we request it again. By restarting my application I was accessing the cashed data and it would return fast. When I let the application run longer I would suddenly reach non cashed data and it would take longer for them to return the request.
I asked how long is data cashed for, answer there was no set time. So I don't think you are being throttled. I think your initial speedy requests are cashed data and your slower requests are non cashed data.
Email back from google:
Hi Linda,
I talked to the engineers and they had a look. The response was
basically that they thinks it's because of caching. The response is
below. If you could do some additional queries to confirm the behavior
it might be helpful. However, what they need to determine is if it's
because you are querying and hitting cached results (because you've
already asked for that data). Anyway, take a look at the comments
below and let me know if you have additional questions or results that
you can share.
Summary from talking to engineer: "Items not already in our cache will
exhibit a slower retrieval processing time than items already present
in the cache. The first query loads the response into our cache and
typical query times without using the cache is about 7 seconds and
with using the cache is a few milliseconds. We can also confirm that
you are not hitting any rate limits on our end, as far as we can tell.
To confirm if this is indeed what's happening in your case, you might
want to rerun verified slow queries a second time to see if the next
query speeds up considerably (this could be what you're seeing when
you say you paste the request URL into a browser and results return
instantly)."
-- IMBA Google Analytics API Developer --
Google's Analytics API does have a rate limit per their docs: https://developers.google.com/analytics/devguides/reporting/core/v3/coreErrors
However they should not caused delayed requests, rather the request should be returned with a response of: 403 userRateLimitExceeded
Description of that error:
Indicates that the user rate limit has been exceeded. The maximum rate limit is 10 qps per IP address. The default value set in Google Developers Console is 1 qps per IP address. You can increase this limit in the Google Developers Console to a maximum of 10 qps.
Google's recommended course of action:
Retry using exponential back-off. You need to slow down the rate at which you are sending the requests.

Resources