Robots.txt to allow a few specific crawlers and deny all others - yahoo

I have been getting a lot of CPU spikes recently on my server and somehow I believe it's not the real traffic or some part of it isn't real. So I want to only allow Google bots, MSN and Yahoo for now. Please guide me if the following robots.txt file is correct for my requirement.
User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: Mediapartners-Google*
User-agent: Googlebot-Image
User-agent: Yahoo-MMCrawler
Disallow:
User-agent: *
Disallow: /
Thanks.

Your robots.txt seems to be valid.
It is allowed to have several User-agent lines in a record.
Disallow: allows crawling everything.
The record starting with User-agent: * only applies to bots not matched by the previous record.
Disallow: / forbids crawling anything.
But note: Only nice bots follow the rules in robots.txt -- and it’s likely that nice bots don’t overdo common crawling frequencies. So either you need to work on your performance, or not-so-nice bots are to blame.

That first Disallow: should probably be:
Allow: /
if you want to, in fact, allow all those user agents to index your site.

Related

Magento 2 Cross site scripting vulnerability PCI scan

PCI scan in our servers recently failed due to cross-site scripting vulnerability in Magento 2 pages. They have requested like below to reproduce the issue.
REQUEST:
GET /2018-2019-random-url.html?<script>alert('TK00000031')</script> HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Host: example.com
Content-Type: text/html
Content-Length: 0
Evidence: <script>alert('TK00000031')</script>
But we couldn't reproduce it in any browsers. Only in IE it detects it's cross-site scripting and stops the execution of the page. Magento support also didn't help us and not sure how to fix it.
Does anyone faced the same issue and got the resolution?

Redirect domain name in HAProxy while preserving the URL path

I have HAProxy as front layer to our Node.js app and I'm looking for a way to rewrite URL. For example, if customers go to https://aaa.example/product/123, HAProxy will rewrite to URL https://bbb.example/product/123. is this possible to do in HAProxy? The important is we want to preserve the URL path (product/123) and just change the host name.
You say you want to "rewrite" but that is a term that is often misused. What is your intention?
Do you want to rewrite the incoming URL and then change the address in the browser's address bar with an HTTP redirect?
If so, in proxy configuration:
http-request redirect prefix https://example.org if { hdr(host) -i example.com }
This changes example.com to example.org and tells the browser to ask again.
Test:
curl -v https://example.com/foo/bar/1234?query=yes
...
< HTTP/1.1 302 Found
< Cache-Control: no-cache
< Content-length: 0
< Location: https://example.org/foo/bar/1234?query=yes
< Connection: close
This is the simplest solution if it fits your need, because the net result is that the browser is actually making the correct request itself, reducing the potential for unexpected behavior.
Or do you want to change the Host: header that the backend server sees, but not send a redirect, and leave the browser's address bar as it was?
This changes example.com to example.org in the Host: header that the back-end server sees in the request from HAProxy.
http-request set-header Host example.org if { hdr(host) -i example.com }
This will do exactly what is intended, but it may not have the desired result, particularly if the application is aware of other inconsistencies, such as the incoming Referer: or Origin: being inconsistent with the Host:, or if it's doing non-portable things with cookies, in which case further header rewriting (possibly in both directions) or application changes may be necessary.

does google maps api block bot?

When I check a URL in Google webmaster tool, I found this.
I have added Google Places javascript API in my webpage , but the calls were blocked when it was through Google bot. How can I handle this?
Click through the "robots.txt" link, and see what it says.
I think you'll see:
User-agent: *
Allow: /maps/api/js?
Allow: /maps/api/js/DirectionsService.Route
Allow: /maps/api/js/DistanceMatrixService.GetDistanceMatrix
Allow: /maps/api/js/ElevationService.GetElevationForLine
Allow: /maps/api/js/GeocodeService.Search
Allow: /maps/api/js/KmlOverlayService.GetFeature
Allow: /maps/api/js/KmlOverlayService.GetOverlays
Allow: /maps/api/js/LayersService.GetFeature
Disallow: /
... which means that the /maps-api-v3/... paths you're trying are indeed disallowed.

Do we need the "Expect: 100-continue" header in the xfire request header?

I found the apache xfire has add one head parameter in its post header:
POST /testservice/services/TestService1.1 HTTP/1.1
SOAPAction: "testAPI" Content-Type: text/xml; charset=UTF-8
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; XFire Client +http://xfire.codehaus.org)
Host: 192.168.10.111:9082
Expect: 100-continue
Will this Expect: 100-continue make the roundtrip call between the xfire client and its endpoint server a little bit waste because it will use one more handshake for the origin server to return the "willing to accept request"?
This just my guess.
Vance
I know this is old question but as I was just researching the subject, here is my answer. You don't really need to use "Expect: 100-continue" and it indeed does introduce extra roundtrip. The purpose of this header is to indicate to the server that you want your request to be validated before posting the data. This also means that if it is set, you are committed to waiting (within your own timeout period - not indefinitely!) for server response (either 100 or HTTP failure) before sending your form or data. Although it seems like extra expense, it is meant to improve performance in failure cases, by allowing the server to make you know not to send the data (since the request has failed).
If the header is not set by the client, this means you are not awaiting for 100 code from the server and should send your data in the request body. Here is relevant standard: http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html (jump to section 8.2.3).
Hint for .NET 4 users: this header can be disabled using static property "Expect100Continue"
Hint for libcurl users: there was a bug in old version 7.15 when disabling this header wasn't working; fixed in newer versions (more here: http://curl.haxx.se/mail/lib-2006-08/0061.html)

Is the anchor part of a URL being sent to a web server?

Say, there's a URL, http://www.example.com/#hello.
Will the #hello thing be sent to the web server or not, according to standards?
How do modern browsers act?
The answer to this question is similar to the answers for Retrieving anchor link in URL for ASP.NET.
Basically, according to the standard at RFC 1808 - Relative Uniform Resource Locators (see Section 2.4.1), it says:
"Note that the fragment identifier is not considered part of the URL."
As stephbu pointed out, "the anchor tag is never sent as part of the HTTP request by any browser. It is only interpreted locally within the browser".
The hash variables aren't sent to the web server at all.
For instance, a request to http://www.whatismyip.org/#test from Firefox sends the follow HTTP request packet
GET / HTTP/1.1
Host: www.whatismyip.org
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cache-Control: max-age=0
You'll notice the # is nowhere to be found.
Pages you see using # as a form of navigation are doing so through javascript.
This parameter is accessible though the window.location.hash variable
The anchor part (after the #) is not sent to any $_SERVER variables in PHP. I don't know if there is a way of retrieving that piece of info from the URL or not (as far as I know, it's not possible). It's supposed to be used by the browser only to find a location in the page, which is why the page does not reload if you click on an anchor like so: hello

Resources