how to properly match your own domain in "URL patterns to include" in jmeter script recorder - jmeter

our site makes 200+ requests. about 100 are our own urls (images, css, js, fonts etc). the other 100 are google analytics, newrelic, tealium, and lots of dross.
i want to match all, and only, requests to our site, which is www.mysite.com.
In "URLS Patterns to Include" I tried:
.*mysite.com.*
But this also includes many of the marketing requests which include the site name in the url parameters.
Next I tried this:
https:\/\/mysite.com.*
https:\/\/www.mysite.com.*
but get no results back.
what is the proper way to include only, and all, resources loaded from your own domain?
I think this could be the way:
^www.mysite.com.*
It seems to return the right number of requests (when I clear cache before recording of course)
Is this the best solution?

If you look at ProxyControl.generateMatchUrl() function source code you will see the following:
private String generateMatchUrl(HTTPSamplerBase sampler) {
StringBuilder buf = new StringBuilder(sampler.getDomain());
buf.append(':'); // $NON-NLS-1$
buf.append(sampler.getPort());
buf.append(sampler.getPath());
if (sampler.getQueryString().length() > 0) {
buf.append('?'); // $NON-NLS-1$
buf.append(sampler.getQueryString());
}
return buf.toString();
}
Pay attention to this sampler.getDomain() bit which returns DNS hostname or IP address of the URL so if you add protocol there (http or https) the function will not match anything.
So you will have to provide patterns without protocol section like in the "Suggested Excludes"
If you have to include the protocol - I think you will need to re-consider your approach to recording and switch to i.e. JMeter Chrome Extension which provides possibility to filter the requests including the protocol:
Moreover you won't have to worry about proxies, certificates, etc.

Related

JMeter: How to record a script for specific domain URLs?

I am trying to capture a script using HTTP(S) Test Script Recorder in JMeter3.0. When I start start capturing, URLs from other domains for ex. download.cdn.mozilla.net are also getting captured. I don't want these URLs to be recorded, I want to record URLs for a specific domain only.
So, how to achieve this in JMeter3.0?
Note: I tried using URL Patters to Exclude but as I can not predict the other domain URLs, I don't want to use this option.
I also tried URL Patters to Include by specifying a specific domain i.e. ^((?!DOMAINNAME).)*$, but it is still recording the other domain URLs.
I would recommend breaking down your requirement into 2 parts:
Include your domain only
Exclude everything else
So, given I want to record JMeter Home Page and filter out any external resources the relevant configuration would be:
URL Patterns to Include: .*jmeter.apache.org.*
URL Pattens to Exclude: .*
Both inputs accept Perl5-compatible regular expressions so double check if the values your'e providing match (or don't match) the URL patterns captured by JMeter.
References:
JMeter Regular Expressions
Excluding Domains From The Load Test

Check if two urls are for the same website

I'm looking for a way to compare two urls. I can do:
URI('http://www.test.com/blabla').host
to have the base name, but this not reliable. For example:
URI('http://www.test.com/blabla').host == URI('http://test.com/blabla').host
returns false, but they can be the same site. To have the IP address is not reliable too because if I do:
IPSocket.getaddress(URI('http://hello.herokuapp.com').host) ==
IPSocket.getaddress(URI('http://test.herokuapp.com').host)
It returns true, but they are not the same site. Is there a more reliable way?
The site under http://foo.com can be the same as under http://www.foo.com, but it can be a totally different site, due to web server configuration. It depends on the DNS config too, which IP points to www and which one to without www.
If you want compare two sites, you need to fetch the content, and compare key parts (using nokogiri for example) about similarities.
Nowadays due to sidebars and news, two consequent request to the same url, gives slight different html responses.

How to force Jmeter proxy server to listen only for specified page

I am using Jmeter proxy to record steps which i take in browser. Is there a possibility to set up proxy server just to listen for one specific page?
I want record only steps which are taken on www.test123.com
Thanks
Use Include pattern to restrict the requests that are recorded.
As per http://jmeter.apache.org/usermanual/component_reference.html#HTTP_Proxy_Server
The include and exclude patterns are treated as regular expressions (using Jakarta ORO). They will be matched against the host name, port (actual or implied) path and query (if any) of each browser request. If the URL you are browsing is
"http://jmeter.apache.org/jmeter/index.html?username=xxxx" ,
then the regular expression will be tested against the string:
"jmeter.apache.org:80/jmeter/index.html?username=xxxx" .
Thus, if you want to include all .html files, your regular expression might look like:
".*\.html(\?.*)?" - or ".*\.html" if you know that there is no query string or you only want html pages without query strings.
"www.test123.com.*" should record requests only from the given URL.

Detecting URL rewrites (SEO urls)

How could a client detect if a server is using Search Engine Optimizing techniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
Like said before: are there <link rel="canonical"> tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from.
Modify parts of the URL and see, if you get an 404. In /pic/1 I would modify "1".
If there is no mod_rewrite it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." /> tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article, with fully URL examples being:
www.domain.ext/article/2011/06/15/man-bites-dog
www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
HTTP Response Code is 200: should be obvious, but you can get a 404 www.domain.ext/errors/file-not-found that would pass the other checks listed.
Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
Tag is Important: if prior is true and tag is <title> or <h#> tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." /> tag for other duplicate

Using the HTTP Range Header with a range specifier other than bytes?

The core question is about the use of the HTTP Headers, including Range, If-Range, Accept-Ranges and a user defined range specifier.
Here is a manufactured example to help illustrate my question. Assume I have a Web 2.0 style application that displays some sort of human readable documents. These documents are editorially broken up into pages (similar to articles you see on news websites). For this example, assume:
There is a document titled "HTTP Range Question" is broken up into three pages.
The shell page (/document/shell/http-range-question) knows the meta information about the document, including the number of pages.
The first readable page of the document is loaded during the page onload event via an ajax GET and inserted onto the page.
A UI control that looks like [ 1 2 3 All ] is at the bottom of the page, and clicking on a number will display that readable page (also loaded via ajax), and clicking "All" will display the entire document. Assume these URLS for the 1, 2, 3 and All use cases:
/document/content/http-range-question?page=1
/document/content/http-range-question?page=2
/document/content/http-range-question?page=3
/document/content/http-range-question
Now to the question. Can I use the HTTP Range headers instead part of the URL (e.g. a querystring parameter)? Maybe something like this on the GET /document/content/http-range-question request:
Range: page=1
It looks like the spec only defines byte ranges as allowable, so even if I made my ajax calls work with my browser and server code, anything in the middle could break the contract (e.g. a caching proxy server).
Range: bytes=0-499
Any opinions or real world examples of custom range specifiers?
Update: I did find a similar question about the Range header (Paging in a Rest Collection) where they mention that Dojo's JsonRestStore uses a custom Range header value.
Range: items=0-24
Absolutely - you are free to specify any range units you like.
From RFC 2616:
3.12 Range Units
HTTP/1.1 allows a client to request
that only part (a range of) the
response entity be included within the
response. HTTP/1.1 uses range units
in the Range (section 14.35) and
Content-Range (section 14.16)
header fields. An entity can be broken
down into subranges according to
various structural units.
range-unit = bytes-unit | other-range-unit
bytes-unit = "bytes"
other-range-unit = token
The only range unit defined by
HTTP/1.1 is "bytes". HTTP/1.1
implementations MAY ignore ranges
specified using other units.
The key piece is the last paragraph. Really what it's saying is that when they wrote the spec for HTTP/1.1, they only outlined the "bytes" token. But, as you can see from the 'other-range-unit' bit, you are free to come up with your own token specifiers.
Coming up with your own Range specifiers does mean that you have to have control over the client and server code that uses that specifier. So, if you own the backend piece that exposes the "/document/content/http-range-question" URI, you are good to go; presumably you're using a modern web framework that lets you inspect the request headers coming in. You could then look at the Range values to perform the backing query correctly.
Furthermore, if you control the AJAX code that makes requests to the backend, you should be able to set the Range header yourself.
However, there is a potential downside which you anticipate in your question: the potential to break caching. If you are using a custom Range unit, any caches between your client and the origin servers "MAY ignore ranges specified using [units other than 'bytes']". So for example, if you had a Squid/Varnish cache between the front and backend, there's no guarantee that the results you're hoping for will be served from the cache!
You might also consider an alternative implementation where, rather than using a query string, you make the page a "parameter" of the URI; e.g.: /document/content/http-range-question/page/1. This would likely be a little more work for you server-side, but it's HTTP/1.1 compliant and caches should treat it properly.
Hope this helps.
bytes is the only unit supported by HTTP 1.1 Specification.
HTTP Range is typically used for recovering interrupted downloads without starting from the beginning.
What you're trying to do would be better handled by OAI-ORE, which allows you to define relationships between multiple documents. (alternative formats, components of the whole, etc)
Unfortunately, it's a relatively new metadata format, and I don't know of any web browsers that ship with native support.
It sounds like you want to change the HTTP spec just to remove a querystring parameter. In order to do this you'd have to modify code on both the client to send the modified header and the server to read from the "Range" header instead of the querystring.
The end result is that this will probably work, but you're breaking all of the standards and existing tools to do so.

Resources