I'm in the processing of setting up a Google Mini device to index an ASP.Net site which has a lot of dynamically generated content. I've created a dynamic site.map XML file which lists all of the dynamic URL's. This conforms to the XML site map format and is currently being indexed by Google but seems to be ignored by the Google mini device.
I've added the site.map file into the "Start crawling from the following URL's". When I view the crawl diagnostics the site.map file comes up but none of the dynamic URL's contained within the site.map are being indexed. The Google mini device is only indexing 100 URL's whereas the site.map contains 10,000.
If I search for a phrase using the test centre the search results includes the site.map and not the URL it points to.
Any ideas?
We've just had a consultant come in who has stated the Google Mini cannot index the URL's contained in a sitemap.xml file. One alternative solution is to create a HTML page with all of the links within it.
Related
I'm interested in building a software system which will connect to various document sources, extract the content from the documents contained within each source, and make the extracted content available to a search engine such as Elastic or Solr. This search engine will serve as the back-end for a web-based search application.
I'm interested in rendering snippets of these documents in the search results for well-known types, such as Microsoft Word and PDF. How would one go about implementing document snippet rendering in search?
I'd be happy with serving up these snippets in any format, including as images. I just want to be able to give my users some kind of formatted preview of their results for well-known types.
Thank you!
In my GSA frontend I have is a option that when clicked should show only results which dont have any files(pdf or any)
so what i need is way to modify my url so that i get only results with no files. What should be the url parameter?
Also any reference if I can do it through Google Frontend
What do you mean by show results that don't have any file? Do you mean don't show web pages that have embedded PDF documents or don't show PDF results at all?
As far as the GSA is concerned a PDF document is the same as an HTML document and there is no "knowledge" the GSA has if there is an embedded attachment.
If you are looking to exclude PDF, Office files, etc, then you could create a different collection that excludes those or you could use a different "client" that uses the "remove URLs" to exclude the URL patterns you don't want.
We have a dspace repository of research publications that the gsa is indexing via a web crawl, ie start at the homepage and follow all the links.
I'm thinking that using a connector to submit urls for indexing from sitemap.xml file, might be more efficient. The gsa would then only need to index and recrawl the urls on the sitemap and could ignore the result of the site.
The suggestion from the gsa documentation is that this is not really a target for a connector, as the content can all be discovered by a web crawl.
What do you think?
Thanks,
Georgina.
This might be outdated (so I'm not sure if it still work), but there's an example of a python connector that will parse a sitemap.xml and send it as Content Feed or Metadata feed.
Here are 2 links to help you
https://github.com/google/gsa-admin-toolkit/blob/master/connectormanager/sitemap_connector.py
https://github.com/google/gsa-admin-toolkit/wiki/ConnectorManagerDocumentation
If anything, this will give you an idea of the logic to implement if you write your own Connector 3.x or Adaptor 4.x
You can generate sitemaps making from /bin directory "dspace generate-sitemaps". It will generate a sitemaps directory with link to all items from dspace.
An output example:
<html><head><title>URL List</title></head><body><ul><li>http://localhost:8080//handle/123456789/1</li>
<li>http://localhost:8080//handle/123456789/2</li>
<li>http://localhost:8080//handle/123456789/3</li>
<li>http://localhost:8080//handle/123456789/5</li>
</ul></body></html>
You could easily create a GSA "Feed" that lists the URLs that you want to crawl. However, since your "Follow" patterns must include the host name of your web site, the crawler is going to follow all the pages that are linked from the pages in your feed.
If you truly only want to index the items in your "Site Map" then you should probably look at writing an Adaptor (4.x). You would then be responsible for writing the logic to parse your sitemap.xml file to extract the URLs you want crawled.
I have a site built in CodeIgniter and I am trying to utilize Google's webmaster tools which tell me to setup a robots.txt file. I want Google to index the whole site but not necessarily the files which make up the site. So I don't want good to look at the /system/ files or the /application/config/ files but I do want every page to be indexed. Should I list out each file for the Google not to index or tell it to index all or tell it to index nothing?
Thanks!
Google only see the pages/URLs your website makes available. So you don't block files, you block pages. So, your robots.txt should contain the URLs you don't want indexed. The files behind the scenes are irrelevant.
Can I use Google Search API to search pages by html source? I want to make a different search engine which will set limits on the engine based on the code of each site. For example, suppose I want to set a parameter which will exclude sites which contain headings. Is this possible?