Sitemap on multiple webserver - sitemap

If I have two webserver serving the same domain. Is it possible to have sitemap1.xml on server1 and sitemap2.xml on server2? sitemap1 and sitemap2 will have different dynamic contents. Will search engines be able to find and index them?

To make sure they can find both sitemaps, make sure they are declared in your robots.txt file:
sitemap: http://www.yoursite.com/sitemap1.xml
sitemap: http://www.yoursite.com/sitemap2.xml
See a summary I maintain for more information.

Related

Domain name alias, sitemaps.xml and robots.txt

I'm looking add referencing the sitemap for multiple domain name alias which is spun off logic within a Laravel framework. in my robots.txt file - but I'm not quite sure what the correct way is to do this. Sitemaps exist and are present and correct, but just unsure as to the format google expects...so really looking for SEO based answers rather than was to achieve this.
I'm thinking I can do this for robots.txt
i.e.,
Sitemap: https://www.main-domain.com/sitemap.xml
Sitemap: https://www.domain-alias1.com/sitemap.xml
Sitemap: https://www.domain-alias2.com/sitemap.xml
Any pro-seo tips would be much appreciated!
Assuming your code is about generating a proper robot.txt having multiple sitemaps, then listing them one per line with the Sitemap: instruction is the right way. These must be located at the same directory level as the robot.txt file (i.e. root level).

Sitemap reference in robots.txt for each TLD

We are using the robots.txt to reference our sitemap index file.
Now we will release new, different countries. Our webseite under the TLD .de provides a robots.txt, containing a reference to our index file. The index files refers to different sitemaps containing our .de link in loc XML node. Other locales (eg. for .fr) are listed with xhtml:link below.
Example:
<url>
<loc>https://xy.de/hallo</loc>
<xhtml:link>https://xy.fr/hello</xhtml:link>
</url>
The question is now, should we add a robots.txt with a reference to our sitemap index to our .fr index too?
Or might it is enough to place the reference only in the German .de robots.txt because the locations are described with alternative locations for each other locale? Or should we invert the loc XML node with the "current" locale? E.g. under https://xy.fr/robots.txt should there be a sitemap referenced with .fr links in the loc XML node?
The Sitemaps protocol doesn’t mention an xhtml:link element, so consumers following the protocol might ignore it.
As a sitemap can only contain URLs from the same host, and a robots.txt file also only works for its host, the typical way is to give each host its own robots.txt file which points to this host’s sitemap (with an absolute URL).
# robots.txt from http://fr.example/robots.txt
Sitemap: http://fr.example/sitemap.xml
# robots.txt from http://de.example/robots.txt
Sitemap: http://de.example/sitemap.xml
The sitemap can be hosted on a different host, but you still need to prove ownership via the robots.txt file (see Sitemaps & Cross Submits).

Someone using our site on robots.txt

Some weeks ago, we discovered someone going on our site with the robots.txt directory:
http://www.ourdomain.com/robots.txt
I've been doing some research and it said that robots.txt makes the permissions of our search engine?
I'm not certain of that...
The reason why I'm asking this is because he is trying to get into that file once again today...
The thing is that we do not have this file on our website... So why is someone trying to access that file? Is it dangerous? Should we be worried?
We have tracked the IP address and it says the location is in Texas, and some weeks ago, it was in Venezuela... Is he using a VPN? Is this a bot?
Can someone explain what this file does and why he is trying to access it?
In a robots.txt (a simple text file) you can specify which URLs of your site should not be crawled by bots (like search engine crawlers).
The location of this file is fixed so that bots always know where to find the rules: the file named robots.txt has to be placed in the document root of your host. For example, when your site is http://example.com/blog, the robots.txt must be accessible from http://example.com/robots.txt.
Polite bots will always check this file before trying to access your pages; impolite bots will ignore it.
If you don’t provide a robots.txt, polite bots assume that they are allowed to crawl everything. To get rid of the 404s, use this robots.txt (which says the same: all bots are allowed to crawl everything):
User-agent: *
Disallow:

Should Magento base url include www?

In the Magento installation wizard, should the base url include www or not?
Ex: www.site.com or site.com
If you plan on using a cdn to distribute your images, etc. yes, it would be a really, really good idea to have your web server use a www. host name instead of using only the bare domain name.
It's not cool when your customers start having blecherous cookie problems due to not using proper host names to sort out the different cname entries in your DNS.
Changing after the fact results in humorous non-SEO friendly reindexing by Google, Bing, et. al.

Is there a way to detect sitemap, if it is not in robots.txt?

I'm working for a simple bot for a project, and I noticed, that a lot of sites do not have sitemaps in their robot.txt files. There is of course an option to simply index the sites in question and crawl all possible pages, but that often takes much more time than simply downloading sitemap.
What is the best way to detect sitemap if it is not mentioned in robots.txt?
Normally it should be placed in the root directory of a domain like xydomain.xyz/sitemap.xml .
I would only add the site map into the robots file, if it is placed elsewhere. If a site uses more than one site map located on another place, it should be noted in an index map.
You can use this online tool to scan your site and create a bespoke sitemap.xlm file for your site.
To help your sitemap to be discovered through the robot.txt add the URL of your sitemap at the very top of your robot.txt file, (see below example).
So, the robots.txt file looks like this:
Sitemap: http://www.example.com/sitemap.xml
User-agent:*
Disallow:

Resources