Ignore urls that have no parameters in robots.txt - mod-rewrite

I have a pretty url like this.
http://example.com/category/shoes
I want to disallow the url if there is no shoes parameter in the url because without the parameter the page will be redirected to homepage.
So i tried something like this in robots.txt file
Disallow: /category/
Allow: /category/*?*
I am not sure if the above attempt is correct or not? Or what is the correct way of achieving it?

To block:
http://example.com/category/
without blocking:
http://example.com/category/whatever
You can use the $ operator:
User-agent: *
Disallow: /category/$
The $ means "end of url". Note that the $ operator is not supported by all web robots. It is a common extension that works for all of the major search engines, but it was not part of the original robots.txt standard.
The following will not work:
Disallow: /category/
Allow: /category/*?*
This will block any URL path that starts with /category/, except when the URL also contains a question mark. This is probably not what you want.

Related

Twitter meta image is not rendering on Twitter because it "may be restricted by the site's robots.txt file"

So this is the link while I tried using Twitter the image somehow doesn't work, while it works for Facebook.
It is working for Facebook only but for Twitter I am getting issue:
WARN: The image URL https://scontent.xx.fbcdn.net/v/t31.0-8/19388529_1922333018037676_3741053750453855177_o.jpg?_nc_cat=0&oh=ba7394f2a6af68cb4b78961759a154f1&oe=5B6BC349 specified by the 'twitter:image' metatag may be restricted by the site's robots.txt file, which will prevent Twitter from fetching it.
Dont know what is causing this here is my robots.txt:
User-agent: *
Disallow: /translations
Disallow: /manage
Disallow: /ecommerce
Here is the link to replicate the issue: https://invoker.pvdemo.com/album?album=1422199821384334&name=gallery
Your robots.txt is only relevant for your URLs. For an image hosted at https://scontent.xx.fbcdn.net/, the relevant robots.txt is https://scontent.xx.fbcdn.net/robots.txt.
Currently, this robots.txt blocks everything:
User-agent: *
Disallow: /
As documented under URL Crawling & Caching, Twitter’s crawler (Twitterbot) respects the robots.txt:
If an image URL is blocked, no thumbnail or photo will be shown.
You can also configure your robots.txt to have explicit privileges for different crawlers:
User-agent: facebookexternalhit
Disallow:
User-agent: Twitterbot
Disallow:
Google has great docs about it here:
https://developers.google.com/search/docs/advanced/robots/create-robots-txt
https://gist.github.com/peterdalle/302303fb67c2bb73a9a09df78c59ba1d

Google Webmaster Tools is not accepting my sitemap

Few days ago I set robots.txt for my website which you find in below:
user-agent: *
Allow: /$
Disallow: /
Now I am facing some problem to submit my sitemap in Google Webmaster:
We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.
I got this reply and also get:
URL restricted by robots.txt
Now what can I do to submit my sitemap in Webmaster?
Your robots.txt disallows accessing any URL except the root (/, e.g.: http://example.com/). So bots are (per your robots.txt) not allowed to crawl your sitemap.
You could add another Allow line (like below), but in general there is probably no point in having a sitemap if you disallow crawling all pages except the homepage.
User-agent: *
Allow: /$
Allow: /sitemap.xml$
Disallow: /
(I removed the empty lines, because they are not allowed inside a record.)
Note that you could also link your sitemap from the robots.txt.

#! url showing up at the top of my search results

We are just getting started with SEO/Ajax so hoping someone can help us figure this out - One of the #! urls is showing up as the first organic result for our startup nurturelist.com. Although this link technically works, we would 1) not like to have any #! urls show up in search results because they look weird and we have non #! versions 2) the second organic result in the image is the one that we'd actually like to appear at the top.
Thanks very much on any thoughts on how we can make this happen...
Do you just simply not want the #! to show up in search results? Simply make a robots.txt in your root directory (in most cases the public_html directory) and add these lines to it:
User-agent: *
Disallow: /\#!/
This prevents Google from indexing all pages under the /#!/ subdirectory.
However:
If the page has already been indexed by Googlebot, using a robots.txt
file won't remove it from the index. You'll either have to use the
Google Webmaster Tools URL removal tool after you apply the
robots.txt, or instead you can add a noindex command to the page via a
tag or X-Robots-Tag in the HTTP Headers.
(Source)
Here is a link to the Google Webmaster Tools URL Removal Tool
So add this to pages you don't want indexed:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />

How to create a robots.txt file to hide a view page from search engines in codeigniter

How to create a robots.txt file in a codeigniter project to hide a view page . where should i place this robots.txt file
currently i have created file like this
User-agent: *
Disallow: /application/views/myviewpage.php
in side
/public_html/folder/robots.txt (Where i place my .htaccess file). Is there any way to test this?
The robots.txt file MUST be placed in the document root of the host. It won’t work in other locations.
If your host is example.com, it needs to be accessible at http://example.com/robots.txt.
If your host is www.example.com, it needs to be accessible at http://www.example.com/robots.txt.

iirf rewrite url

i'm trying to use iirf for what looks like asimple rewrite but it's not working.
what i need:
rewrite http://www.mydomain.com/ru to : http://www.mydomain.com?page=russian
the objective being that a get param would be sent but the user would see the first url on their browser's url bar.
i'm using the following code:
RewriteEngine ON
StatusUrl /iirfStatus
RewriteRule http://www.mydomain.com/ru http://www.mydomain.com?page=russian
does this go (the iirf file) on the site's root or in a 'ru' subfolder (tried both)?
what am i doing wrong or missing?
thanx and have a nice day :-)
The following should work:
RewriteRule ^/ru$ /?page=russian [I,L]
You should put this in the iirf.ini file in the web sites root folder.
Check http://www.mydomain.com/iirfStatus to see whether iirf was able to read your configuration file.
Also, you may use RewriteLogLevel with a value of 2 or 3 and RewriteLog to see whether the incoming url was rewritten, and how (or why not).

Resources