Can I use Google Search API to search pages by html source? I want to make a different search engine which will set limits on the engine based on the code of each site. For example, suppose I want to set a parameter which will exclude sites which contain headings. Is this possible?
Related
I'm interested in building a software system which will connect to various document sources, extract the content from the documents contained within each source, and make the extracted content available to a search engine such as Elastic or Solr. This search engine will serve as the back-end for a web-based search application.
I'm interested in rendering snippets of these documents in the search results for well-known types, such as Microsoft Word and PDF. How would one go about implementing document snippet rendering in search?
I'd be happy with serving up these snippets in any format, including as images. I just want to be able to give my users some kind of formatted preview of their results for well-known types.
Thank you!
In my GSA frontend I have is a option that when clicked should show only results which dont have any files(pdf or any)
so what i need is way to modify my url so that i get only results with no files. What should be the url parameter?
Also any reference if I can do it through Google Frontend
What do you mean by show results that don't have any file? Do you mean don't show web pages that have embedded PDF documents or don't show PDF results at all?
As far as the GSA is concerned a PDF document is the same as an HTML document and there is no "knowledge" the GSA has if there is an embedded attachment.
If you are looking to exclude PDF, Office files, etc, then you could create a different collection that excludes those or you could use a different "client" that uses the "remove URLs" to exclude the URL patterns you don't want.
I have a search bar which should show me only the PDF files from Google when i start searching for something.Which API can i use for searching in google and how can i integrate that in my code.Are there any tutorial available for it.
You'll use the Documents List API
You'll search by MIME type.
It'll look like this (but needs to be properly encoded):
GET https://docs.google.com/feeds/default/private/full/-/{http://schemas.google.com/g/2005#kind}application/pdf
i am injecting some text into my pages but i need to prevent search engines from indexing it. I read that some engines are able to read this content now. How can one prevent them from doing so?
Search engines cannot read Ajax content yet. The closest they come is Google supporting it if you use their specifications. But that does require you using their specification otherwise Google can't crawl Ajax content.
I'm in the processing of setting up a Google Mini device to index an ASP.Net site which has a lot of dynamically generated content. I've created a dynamic site.map XML file which lists all of the dynamic URL's. This conforms to the XML site map format and is currently being indexed by Google but seems to be ignored by the Google mini device.
I've added the site.map file into the "Start crawling from the following URL's". When I view the crawl diagnostics the site.map file comes up but none of the dynamic URL's contained within the site.map are being indexed. The Google mini device is only indexing 100 URL's whereas the site.map contains 10,000.
If I search for a phrase using the test centre the search results includes the site.map and not the URL it points to.
Any ideas?
We've just had a consultant come in who has stated the Google Mini cannot index the URL's contained in a sitemap.xml file. One alternative solution is to create a HTML page with all of the links within it.