Is there a way to force a spider to slow down its spidering of a website? Anything that can be put in headers or robots.txt?
I thought i remembered reading something about this being possible but cannot find anything now.
If you're referring to Google, you can throttle the speed at which Google spiders your site by using your Google Webmaster account (Google Webmaster Tools).
There is also this, which you can put in robots.txt
User-agent: *
Crawl-delay: 10
Where crawl delay is specified as the number of seconds between each page crawl. Of course, like everything else in robots.txt, the crawler has to respect it, so YMMV.
Beyond using the Google Webmaster tools for the Googlebot (see Robert Harvey's answer), Yahoo! and Bing support the nonstandard Crawl-delay directive in robots.txt:
http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions
When push comes to shove, however, a misbehaving bot that's slamming your site will just have to be blocked at a higher level (e.g. load balancer, router, caching proxy, whatever is appropriate for your architecture).
See Throttling your web server for a solution using Perl. Randal Schwartz said that he survived a Slashdot attack using this solution.
I don't think robots will do anything except allow or disallow. Most of the search engines will allow you to customize how they index your site.
For example: Bing and Google
If you have a specific agent that is causing issues, you might either block it specifically, or see if you can configure it.
Related
Well, I trying establish a web page with a wordpress and GoDaddy hosting. I want to make fast web page, because people says fast web pages appear on first line at Google (as specially mobile web page speed is very important people says). So want to make very fast web page but my level of knowledge is not very advanced, I progress by learning.
If I test my web page with Insights, mine mobile score is about 60-70. If I read reports of Insights there are lots of improvements links appear at blow. I want to learn how to fix that. If you help me make an example, I will do the others myself.
If we start at first problem which is /css?family=…(fonts.googleapis.com) this problem seen below of "Eliminate resources that prevent rendering" topic. So how to fix it. What should I do?
Also at the "covorage" tab there are some source codes are seen and it is not using. For example I am not using easy-sheare plugin (secong row at the image) at homepage.
How to remove safely that codes from home page. If I can learn how one is made, I can correct the others myself.
The issue you are running into is something I have seen over and over again. GoDaddy and Wordpress sites generally are bloated and perform poorly.
Here are some tips to improve your speed & get a better PS ranking.
Hosting: Do you need to be on Godaddy? I have seen this time and time again. Most websites on GD are SLOW. GD is good for domain registration, not for hosting. Most non-tech folks do not know any better. Try using Amazon Lightsail, AWS-S3, Google Firebase, or Netlify. They all offer much faster page loads by reducing initial server response time. And they are surprisingly simple to learn and deploy.
CDN: You must use a content-distribution-network (CDN). Check out Cloudfront. They offer a free tier that works quite well.
Wordpress: This is your real issue. Wordpress is neither easy to build nor easy to maintain. You need multiple plugins to make the site perform. Best you build your own. If you have to be on Wordpress checkout image optimizers, minifiers, and cache plugins. Gumlet, WP Rocket, Shortpixel are quite popular to improve speed.
I have recently started working with a company that has a Magento eCommerce website.
We spotted that the traffic dipped considerably in May, and also the ranking on Google.
When I started investigating i saw that the pages of the ES website were not appearing on Screaming Frog
Only the homepage showed and status said blocked by robots.txt
I said this to my developer and they said they would move the robot.txt file to the /pub folder.
but would that not mean the file was in two places.. would this be an issue?
The developer has gone ahead and done this, how long should it take to see is screaming frog is indexing the pages.
Any Magento developers that could help with advise on this?
Thanks
Neo
There is a documentation page for how to manage robots.txt with Magento 2.x.
And you can use this to allow all traffic to your site:
User-agent:*
Disallow:
Regarding the Googlebot crawl rate, here is some explanation on it.
According to Google, “crawling and indexing are processes which can take some time and which rely on many factors.” It’s estimated that it takes anywhere between a few days to four weeks before Googlebots index a new site. If you have an older website and doesn’t experience crawling, the design of the site may be a problem. Sometimes, sites are temporarily unavailable when Google attempted to crawl, so checking the crawl stats and looking at errors can help you make changes to get your site crawled. Google also has two different crawlbots: a desktop crawler to simulate a user on a desktop, and a mobile crawler to simulate a device search.
I encountered several websites running in Ajax and it seems like their SEO is pretty bad, does Google really crawl websites like that?
Optimization guides for different search engines tell that bots are unable to crawl such sites. I think, Google's bots might use Chrome's engine for some purposes (I remember, they made site screenshots one time), but nevertheless, it's the static HTML that's important. Therefore, the usual practice would be generating valid HTML on the server to provide functional site for user agents like, for example, Lynx, and then patching it with AJAX, history API and all other imaginable bells and whistles.
I've got a web app which heavily uses AngularJS / AJAX and I'd like it to be crawlable by Google and other search engines. My understanding is that I need to do something special to make it work, as described here: https://developers.google.com/webmasters/ajax-crawling
Unfortunately, that looks quite nasty and I'd rather not introduce the hash tags. What I'd like to do is to serve a static page to Googlebot (based on the User-Agent), either directly or by sending it a 302 redirect. That way, the web app can be the same, and the whole Googlebot workaround is nicely isolated until it is no longer necessary.
My worry is that Google may mistakenly assume that I'm trying to trick Googlebot, while my goal is to help it. What do you guys think about this approach, and what would you recommend?
Recently I come upon this excellent post from yearofmoo, explaining in details how to make your Angular app SEO friendly. In essence, when bots see an uri with a hash tag they will know it's an ajaxed page and will try to reach the same uri by replacing '#!' in your uri with '?_escaped_fragment_='. This alternative uri instructs bots that they should expect to find a definitive static version of the page they were accessing.
Of course, to achieve this you'd have to introduce hash tags into your uris. I don't see why are you trying to avoid them. Isn't gmail using hash tags?
Yeah unfortunately, if you want to be indexed - you have to adhere to the scheme :( If your running a ruby app - there's a gem that implements the crawling scheme for any rack app....
gem install google_ajax_crawler
writeup of how to use it is at http://thecodeabode.blogspot.com.au/2013/03/backbonejs-and-seo-google-ajax-crawling.html, source code at https://github.com/benkitzelman/google-ajax-crawler
Have a look at these links and it will give you a good direction:
Set up your own Prerender service using Prerender.io open source code:
https://prerender.io/
Use a different existing service such as BromBone, Seo.js or SEO4AJAX:
http://www.brombone.com/
http://getseojs.com/
http://www.seo4ajax.com/
Create your own service for rendering and serving snapshots to search engines. Read this article. It will give you the big picture:
http://scotch.io/tutorials/javascript/angularjs-seo-with-prerender-io
As of May 2014 GoogleBot now executes JavaScript. Check WebmasterTools to see how Google sees your site.
http://googlewebmastercentral.blogspot.no/2014/05/understanding-web-pages-better.html
Edit: Note that this does not mean other crawlers (Bing, Facebook, etc.) will execute Javascript. You may still need to take additional steps to ensure that these crawlers can see your site.
I've played around with Google Sitemaps on a couple sites. The lastmod, changefreq, and priority parameters are pretty cool in theory. But in practice I haven't seen these parameters affect much.
And most of my sites don't have a Google Sitemap and that has worked out fine. Google still crawls the site and finds all of my pages. The old meta robot and robots.txt mechanisms still work when you don't want a page (or directory) to be indexed. And I just leave every other page alone and as long as there's a link to it Google will find it.
So what reasons have you found to write a Google Sitemap? Is it worth it?
From the FAQ:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily
discovered by Googlebot during the
crawl process—for example, pages
featuring rich AJAX or images.
Your site is new and has few links to it.
(Googlebot crawls the web by
following links from one page to
another, so if your site isn't well
linked, it may be hard for us to
discover it.)
Your site has a large
archive of content pages that are not
well linked to each other, or are not
linked at all.
It also allows you to provide more granular information to Google about the relative importance of pages in your site and how often the spider should come back. And, as mentioned above, if Google deems your site important enough to show sublinks under in the search results, you can control what appears via sitemap.
I believe the "special links" in search results are generated from the google sitemap.
What do I mean by "special link"? Search for "apache", below the first result (Apache software foundation) there are two columns of links ("Apache Server", "Tomcat", "FAQ").
I guess it helps Google to prioritize their crawl? But in practice I was involved in a project where we used the gzip-ed large version of it where it helped massively. And AFAIK there is a nice integration with webmaster tools as well.
I am also curious about the topic, but does it cost anything to generate a sitemap?
In theory, anything that costs nothing and may have a potential gain, even if very small or very remote, can be defined as "worth it".
In addition, Google says: "Tell us about your pages with Sitemaps: which ones are the most important to you and how often they change. You can also let us know how you would like the URLs we index to appear." (Webmaster Tools)
I don't think that the bold statement above is possible with the traditional mechanisms that search engines use to discover URLs.