Using a connector to crawl content using a sitemap.xml - google-search-appliance

We have a dspace repository of research publications that the gsa is indexing via a web crawl, ie start at the homepage and follow all the links.
I'm thinking that using a connector to submit urls for indexing from sitemap.xml file, might be more efficient. The gsa would then only need to index and recrawl the urls on the sitemap and could ignore the result of the site.
The suggestion from the gsa documentation is that this is not really a target for a connector, as the content can all be discovered by a web crawl.
What do you think?
Thanks,
Georgina.

This might be outdated (so I'm not sure if it still work), but there's an example of a python connector that will parse a sitemap.xml and send it as Content Feed or Metadata feed.
Here are 2 links to help you
https://github.com/google/gsa-admin-toolkit/blob/master/connectormanager/sitemap_connector.py
https://github.com/google/gsa-admin-toolkit/wiki/ConnectorManagerDocumentation
If anything, this will give you an idea of the logic to implement if you write your own Connector 3.x or Adaptor 4.x

You can generate sitemaps making from /bin directory "dspace generate-sitemaps". It will generate a sitemaps directory with link to all items from dspace.
An output example:
<html><head><title>URL List</title></head><body><ul><li>http://localhost:8080//handle/123456789/1</li>
<li>http://localhost:8080//handle/123456789/2</li>
<li>http://localhost:8080//handle/123456789/3</li>
<li>http://localhost:8080//handle/123456789/5</li>
</ul></body></html>

You could easily create a GSA "Feed" that lists the URLs that you want to crawl. However, since your "Follow" patterns must include the host name of your web site, the crawler is going to follow all the pages that are linked from the pages in your feed.
If you truly only want to index the items in your "Site Map" then you should probably look at writing an Adaptor (4.x). You would then be responsible for writing the logic to parse your sitemap.xml file to extract the URLs you want crawled.

Related

Indexing Hash Bang #! Content Using Google Search Appliance (GSA)

Has anyone had success indexing content that contains #! (Hashbang) in the URL? If so, how did you do it?
We have a third party help center of ours that we are hosting that requires the use of #! in the URL, however, we need the ability to index this content within our GSA. We are using version 7.0.14.G.238 of our GSA
Here's an example of one of our help articles with a hashbang in the URL:
/templates/selfservice/example/#!portal/201500000001006/article/201500000006039/Resume-and-Cover-Letter-Reviews
I understand #! requires JavaScript, not the most friendly SEO in the world and many popular sites (Facebook, Twitter, etc.) deprecated the use of it.
While some Javascript content is indexed, if you want to make sure there is absolutely content in the index for this site you have two options. Either make sure the site is non-JS friendly which is supported in a lot of JS frontend sites, or alternatively use a content feed to push the data into the GSA instead. Turn off JS in your browser and access the site and see if content links are created.
If you have access to the database, you could just send the content straight in, however read up here: http://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/feedsguide/ on feeds which can send data straight in, or possibly read up on connectors in general https://support.google.com/gsa/topic/2721859?hl=en&ref_topic=2707841

Does robots.txt stop Google from indexing my site or the files CodeIgniter uses?

I have a site built in CodeIgniter and I am trying to utilize Google's webmaster tools which tell me to setup a robots.txt file. I want Google to index the whole site but not necessarily the files which make up the site. So I don't want good to look at the /system/ files or the /application/config/ files but I do want every page to be indexed. Should I list out each file for the Google not to index or tell it to index all or tell it to index nothing?
Thanks!
Google only see the pages/URLs your website makes available. So you don't block files, you block pages. So, your robots.txt should contain the URLs you don't want indexed. The files behind the scenes are irrelevant.

How to avoid custom url generation from joomla SH404SEF component?

I have been using joomla Sh404sef component in my site,what is my problem is its generating two url for the same content so it produces the problem with google search engine as both url pointing to the same content.
Here is the examples of url generated from the component
URL:
http://www.mysite.com/page.html - Automatic URLS
http://www.mysite.com/page/ - Custom URLS
When I go and do the purge url from the component option it eliminated the non html (.html) urls from the db but it creates the url again when we post new page etc.
any body come across this issue and could give a suggestion on it?
thanks in advance
I think you'll find this is Joomla generating the url not SH404SEF. It has a habit of generating extras especially where you have blog style views etc. The way I get around this problem is 3 fold.
Create a solid menu structure (Now harder on Joomla2.5 where aliases are created with date/time). This should take care of most issues. Make sure you make unnecessary levels as noindex-nofollow
Use a 3rd party tool to mark secondary urls with a canonical tag. Look at ITPMetaPro but many others are available.
Work in Webmaster Tools to remove urls from index after following step 1 & 2.

In a sitemap, is it advisable to include links to every page on the site, or only ones that need it?

I'm in the process of creating a sitemap for my website. I'm doing this because I have a large number of pages that can only be reached via a search form normally by users.
I've created an automated method for pulling the links out of the database and compiling them into a sitemap. However, for all the pages that are regularly accessible, and do not live in the database, I would have to manually go through and add these to the sitemap.
It strikes me that the regular pages are those that get found anyway by ordinary crawlers, so it seems like a hassle manually adding in those pages, and then making sure the sitemap keeps up to date on any changes to them.
Is it a bad to just leave those out, if they're already being indexed, and have my sitemap only contain my dynamic pages?
Google will crawl any URLs (as allowed by robots.txt) it discovers, even if they are not in the sitemap. So long as your static pages are all reachable from the other pages in your sitemap, it is fine to exclude them. However, there are other features of sitemap XML that may incentivize you to include static URLs in your sitemap (such as modification dates and priorities).
If you're willing to write a script to automatically generate a sitemap for database entries, then take it one step further and make your script also generate entries for static pages. This could be as simple as searching through the webroot and looking for *.html files. Or if you are using a framework, iterate over your framework's static routes.
Yes, I think it is not a good to leave them out. I think it would also be advisable to look for a way that your search pages can be found by a crawler without a sitemap. For example, you could add some kind of advanced search page where a user can select in a form the search term. Crawlers can also fill in those forms.

Should I submit a sitemap with only new links added?

The Google documentation says there's a limit of 50k URLs in sitemaps you send to them, and I want my sitemap to be submitted by an automated job periodically. Therefore, shouldn't I just have the sitemap contain only the N most recent URLs added to my site? Yes, I know you can have multiple sitemaps, and I do have a separate one for the static HTML pages in the site. But I also need one for the database content that may not be reachable in one hop from the main pages, and I don't like the idea of a growing list of sitemaps(It may sound like 50k is more than enough, but I don't want to code with that assumption).
Sure - if you know your previous pages (from an older sitemap.xml or simply crawled) are indexed, you should be fine by including only new links.

Resources