Right now we have sitemap that produced dynamicly with around 400 products and submit it in google sitemap for index purpose. It's always hit the server resource to produce xml format for 400 products each time google sitemap crawl it. Since most product already indexed by google, I'm thingking to reduce the dynamic sitemap to only generate latest 50 products to save server resource.
I'm search an explanation about does google sitemap need all products to index regularly or just need the latest product that not indexed yet? Please advise me about this.
Put every page you want to be indexed in your sitemap. Even those pages which are old and already included in Google's index. Sitemaps aren't to tell search engines about new content. They're there to tell them about all of the content you want to be indexed.
Related
We have a dspace repository of research publications that the gsa is indexing via a web crawl, ie start at the homepage and follow all the links.
I'm thinking that using a connector to submit urls for indexing from sitemap.xml file, might be more efficient. The gsa would then only need to index and recrawl the urls on the sitemap and could ignore the result of the site.
The suggestion from the gsa documentation is that this is not really a target for a connector, as the content can all be discovered by a web crawl.
What do you think?
Thanks,
Georgina.
This might be outdated (so I'm not sure if it still work), but there's an example of a python connector that will parse a sitemap.xml and send it as Content Feed or Metadata feed.
Here are 2 links to help you
https://github.com/google/gsa-admin-toolkit/blob/master/connectormanager/sitemap_connector.py
https://github.com/google/gsa-admin-toolkit/wiki/ConnectorManagerDocumentation
If anything, this will give you an idea of the logic to implement if you write your own Connector 3.x or Adaptor 4.x
You can generate sitemaps making from /bin directory "dspace generate-sitemaps". It will generate a sitemaps directory with link to all items from dspace.
An output example:
<html><head><title>URL List</title></head><body><ul><li>http://localhost:8080//handle/123456789/1</li>
<li>http://localhost:8080//handle/123456789/2</li>
<li>http://localhost:8080//handle/123456789/3</li>
<li>http://localhost:8080//handle/123456789/5</li>
</ul></body></html>
You could easily create a GSA "Feed" that lists the URLs that you want to crawl. However, since your "Follow" patterns must include the host name of your web site, the crawler is going to follow all the pages that are linked from the pages in your feed.
If you truly only want to index the items in your "Site Map" then you should probably look at writing an Adaptor (4.x). You would then be responsible for writing the logic to parse your sitemap.xml file to extract the URLs you want crawled.
I have a site built in CodeIgniter and I am trying to utilize Google's webmaster tools which tell me to setup a robots.txt file. I want Google to index the whole site but not necessarily the files which make up the site. So I don't want good to look at the /system/ files or the /application/config/ files but I do want every page to be indexed. Should I list out each file for the Google not to index or tell it to index all or tell it to index nothing?
Thanks!
Google only see the pages/URLs your website makes available. So you don't block files, you block pages. So, your robots.txt should contain the URLs you don't want indexed. The files behind the scenes are irrelevant.
I'm in the process of creating a sitemap for my website. I'm doing this because I have a large number of pages that can only be reached via a search form normally by users.
I've created an automated method for pulling the links out of the database and compiling them into a sitemap. However, for all the pages that are regularly accessible, and do not live in the database, I would have to manually go through and add these to the sitemap.
It strikes me that the regular pages are those that get found anyway by ordinary crawlers, so it seems like a hassle manually adding in those pages, and then making sure the sitemap keeps up to date on any changes to them.
Is it a bad to just leave those out, if they're already being indexed, and have my sitemap only contain my dynamic pages?
Google will crawl any URLs (as allowed by robots.txt) it discovers, even if they are not in the sitemap. So long as your static pages are all reachable from the other pages in your sitemap, it is fine to exclude them. However, there are other features of sitemap XML that may incentivize you to include static URLs in your sitemap (such as modification dates and priorities).
If you're willing to write a script to automatically generate a sitemap for database entries, then take it one step further and make your script also generate entries for static pages. This could be as simple as searching through the webroot and looking for *.html files. Or if you are using a framework, iterate over your framework's static routes.
Yes, I think it is not a good to leave them out. I think it would also be advisable to look for a way that your search pages can be found by a crawler without a sitemap. For example, you could add some kind of advanced search page where a user can select in a form the search term. Crawlers can also fill in those forms.
I'm in the processing of setting up a Google Mini device to index an ASP.Net site which has a lot of dynamically generated content. I've created a dynamic site.map XML file which lists all of the dynamic URL's. This conforms to the XML site map format and is currently being indexed by Google but seems to be ignored by the Google mini device.
I've added the site.map file into the "Start crawling from the following URL's". When I view the crawl diagnostics the site.map file comes up but none of the dynamic URL's contained within the site.map are being indexed. The Google mini device is only indexing 100 URL's whereas the site.map contains 10,000.
If I search for a phrase using the test centre the search results includes the site.map and not the URL it points to.
Any ideas?
We've just had a consultant come in who has stated the Google Mini cannot index the URL's contained in a sitemap.xml file. One alternative solution is to create a HTML page with all of the links within it.
The Google documentation says there's a limit of 50k URLs in sitemaps you send to them, and I want my sitemap to be submitted by an automated job periodically. Therefore, shouldn't I just have the sitemap contain only the N most recent URLs added to my site? Yes, I know you can have multiple sitemaps, and I do have a separate one for the static HTML pages in the site. But I also need one for the database content that may not be reachable in one hop from the main pages, and I don't like the idea of a growing list of sitemaps(It may sound like 50k is more than enough, but I don't want to code with that assumption).
Sure - if you know your previous pages (from an older sitemap.xml or simply crawled) are indexed, you should be fine by including only new links.