Google Webmaster Tools is not accepting my sitemap - sitemap

Few days ago I set robots.txt for my website which you find in below:
user-agent: *
Allow: /$
Disallow: /
Now I am facing some problem to submit my sitemap in Google Webmaster:
We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.
I got this reply and also get:
URL restricted by robots.txt
Now what can I do to submit my sitemap in Webmaster?

Your robots.txt disallows accessing any URL except the root (/, e.g.: http://example.com/). So bots are (per your robots.txt) not allowed to crawl your sitemap.
You could add another Allow line (like below), but in general there is probably no point in having a sitemap if you disallow crawling all pages except the homepage.
User-agent: *
Allow: /$
Allow: /sitemap.xml$
Disallow: /
(I removed the empty lines, because they are not allowed inside a record.)
Note that you could also link your sitemap from the robots.txt.

Related

Twitter meta image is not rendering on Twitter because it "may be restricted by the site's robots.txt file"

So this is the link while I tried using Twitter the image somehow doesn't work, while it works for Facebook.
It is working for Facebook only but for Twitter I am getting issue:
WARN: The image URL https://scontent.xx.fbcdn.net/v/t31.0-8/19388529_1922333018037676_3741053750453855177_o.jpg?_nc_cat=0&oh=ba7394f2a6af68cb4b78961759a154f1&oe=5B6BC349 specified by the 'twitter:image' metatag may be restricted by the site's robots.txt file, which will prevent Twitter from fetching it.
Dont know what is causing this here is my robots.txt:
User-agent: *
Disallow: /translations
Disallow: /manage
Disallow: /ecommerce
Here is the link to replicate the issue: https://invoker.pvdemo.com/album?album=1422199821384334&name=gallery
Your robots.txt is only relevant for your URLs. For an image hosted at https://scontent.xx.fbcdn.net/, the relevant robots.txt is https://scontent.xx.fbcdn.net/robots.txt.
Currently, this robots.txt blocks everything:
User-agent: *
Disallow: /
As documented under URL Crawling & Caching, Twitter’s crawler (Twitterbot) respects the robots.txt:
If an image URL is blocked, no thumbnail or photo will be shown.
You can also configure your robots.txt to have explicit privileges for different crawlers:
User-agent: facebookexternalhit
Disallow:
User-agent: Twitterbot
Disallow:
Google has great docs about it here:
https://developers.google.com/search/docs/advanced/robots/create-robots-txt
https://gist.github.com/peterdalle/302303fb67c2bb73a9a09df78c59ba1d

Google does not index startpage (index.html) of AJAX application correctly but all subpages containing a hashbang (#!)

I followed the google guideline Making AJAX applications crawlable to make my AngularJS Application crawlable for SEO purposes. So I am using #! (hashbang) in my routes config:
$locationProvider.hashPrefix('!');
So my URLs look like this:
http://www.example.com/#!/page1.html
http://www.example.com/#!/page2.html
...
As google replaces the hashbangs (#!) with ?_escaped_fragment_= I redirect the google bots via my .htaccess file to a snapshot of the page:
DirectoryIndex index.html
RewriteEngine On
RewriteCond %{QUERY_STRING} ^_escaped_fragment_=/?(.*)$
RewriteRule ^(.*)$ /snapshot/%1? [NC,L]
So far everything works like a charm. When a bot requests following URL http://www.example.com/#!/page1.html it will replace the hashbang and actually requests http://www.example.com/?_escaped_fragment_=/page1.html which I redirect to the static/prerendered version of the requested page.
So I submitted my sitemap.xml via the Search Console from Google Webmaster Tools. All URLs in my sitemap are indexed correctly by google but not the domain itself. So it means that a page like:
http://www.example.com/#!/page1.html
is indexed correctly and by googling specific content of any of my subpages google finds the correct page. The problem is the start/homepage itself which "naturally" has no hashbang:
http://www.example.com/
The hashbang here is appended (via javascript in my router configuration) when a user visits the site. But it looks that this is not the case for the google bot.
So the crawlers does not "see" the hashbang and hence does not use the static version here which is a big issue because especially here I provide the most important content.
I already tried to rewrite and redirect / via .htaccess to /#!/but this ends up in to many redirects and crashes everything. I also tried to use
<meta name="fragment" content="!">
in the header of the index.html. But this did not help at all.
Does anybody else faced that problem before?

Ignore urls that have no parameters in robots.txt

I have a pretty url like this.
http://example.com/category/shoes
I want to disallow the url if there is no shoes parameter in the url because without the parameter the page will be redirected to homepage.
So i tried something like this in robots.txt file
Disallow: /category/
Allow: /category/*?*
I am not sure if the above attempt is correct or not? Or what is the correct way of achieving it?
To block:
http://example.com/category/
without blocking:
http://example.com/category/whatever
You can use the $ operator:
User-agent: *
Disallow: /category/$
The $ means "end of url". Note that the $ operator is not supported by all web robots. It is a common extension that works for all of the major search engines, but it was not part of the original robots.txt standard.
The following will not work:
Disallow: /category/
Allow: /category/*?*
This will block any URL path that starts with /category/, except when the URL also contains a question mark. This is probably not what you want.

#! url showing up at the top of my search results

We are just getting started with SEO/Ajax so hoping someone can help us figure this out - One of the #! urls is showing up as the first organic result for our startup nurturelist.com. Although this link technically works, we would 1) not like to have any #! urls show up in search results because they look weird and we have non #! versions 2) the second organic result in the image is the one that we'd actually like to appear at the top.
Thanks very much on any thoughts on how we can make this happen...
Do you just simply not want the #! to show up in search results? Simply make a robots.txt in your root directory (in most cases the public_html directory) and add these lines to it:
User-agent: *
Disallow: /\#!/
This prevents Google from indexing all pages under the /#!/ subdirectory.
However:
If the page has already been indexed by Googlebot, using a robots.txt
file won't remove it from the index. You'll either have to use the
Google Webmaster Tools URL removal tool after you apply the
robots.txt, or instead you can add a noindex command to the page via a
tag or X-Robots-Tag in the HTTP Headers.
(Source)
Here is a link to the Google Webmaster Tools URL Removal Tool
So add this to pages you don't want indexed:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />

How to create a robots.txt file to hide a view page from search engines in codeigniter

How to create a robots.txt file in a codeigniter project to hide a view page . where should i place this robots.txt file
currently i have created file like this
User-agent: *
Disallow: /application/views/myviewpage.php
in side
/public_html/folder/robots.txt (Where i place my .htaccess file). Is there any way to test this?
The robots.txt file MUST be placed in the document root of the host. It won’t work in other locations.
If your host is example.com, it needs to be accessible at http://example.com/robots.txt.
If your host is www.example.com, it needs to be accessible at http://www.example.com/robots.txt.

Resources