GSA: How to show results that don't have any documents/attachments in them - google-search-appliance

In my GSA frontend I have is a option that when clicked should show only results which dont have any files(pdf or any)
so what i need is way to modify my url so that i get only results with no files. What should be the url parameter?
Also any reference if I can do it through Google Frontend

What do you mean by show results that don't have any file? Do you mean don't show web pages that have embedded PDF documents or don't show PDF results at all?
As far as the GSA is concerned a PDF document is the same as an HTML document and there is no "knowledge" the GSA has if there is an embedded attachment.
If you are looking to exclude PDF, Office files, etc, then you could create a different collection that excludes those or you could use a different "client" that uses the "remove URLs" to exclude the URL patterns you don't want.

Related

Word / PDF document snippet rendering in search

I'm interested in building a software system which will connect to various document sources, extract the content from the documents contained within each source, and make the extracted content available to a search engine such as Elastic or Solr. This search engine will serve as the back-end for a web-based search application.
I'm interested in rendering snippets of these documents in the search results for well-known types, such as Microsoft Word and PDF. How would one go about implementing document snippet rendering in search?
I'd be happy with serving up these snippets in any format, including as images. I just want to be able to give my users some kind of formatted preview of their results for well-known types.
Thank you!

Google Custom Search JSON API webpages only

I am building a search component that allows users to filter by type of response. You can see all responses, just the PDFs, or just the webpages. I have the first two parts down, all responses is a basic search and you can filter for pdfs using &fileType=pdf in the query, but i'm not sure how to exclude the pdfs and only return web pages.
I can't find a similar "exclude" param such as -fileType which seems to be supported in other similar APIs. Maybe I just need to format the URL the right way... If anyone has insight into how to accomplish something like this I would appreciate it.
You can try with -inurl:pdf in your URL.

Can i use Google Search API to search pages by html source?

Can I use Google Search API to search pages by html source? I want to make a different search engine which will set limits on the engine based on the code of each site. For example, suppose I want to set a parameter which will exclude sites which contain headings. Is this possible?

Converting HTML Files into a PDF

I have a website that displays product information that the client wishes to offer as pdf format. I need a way to dynamically convert a particular HTML page into a PDF, does anybody know of a way to do this? I need to convert an html page into a PDF document and serve it to the end user on the fly (there are WAY too many products to do this manually and these products receive updates regularly so a manual approach is out of the question)
EDIT: I forgot to mention that I need this to use either vb.net or c#.net
Have you tried iTextSharp?

Google mini ignoring sitemap

I'm in the processing of setting up a Google Mini device to index an ASP.Net site which has a lot of dynamically generated content. I've created a dynamic site.map XML file which lists all of the dynamic URL's. This conforms to the XML site map format and is currently being indexed by Google but seems to be ignored by the Google mini device.
I've added the site.map file into the "Start crawling from the following URL's". When I view the crawl diagnostics the site.map file comes up but none of the dynamic URL's contained within the site.map are being indexed. The Google mini device is only indexing 100 URL's whereas the site.map contains 10,000.
If I search for a phrase using the test centre the search results includes the site.map and not the URL it points to.
Any ideas?
We've just had a consultant come in who has stated the Google Mini cannot index the URL's contained in a sitemap.xml file. One alternative solution is to create a HTML page with all of the links within it.

Resources