Word / PDF document snippet rendering in search - elasticsearch

I'm interested in building a software system which will connect to various document sources, extract the content from the documents contained within each source, and make the extracted content available to a search engine such as Elastic or Solr. This search engine will serve as the back-end for a web-based search application.
I'm interested in rendering snippets of these documents in the search results for well-known types, such as Microsoft Word and PDF. How would one go about implementing document snippet rendering in search?
I'd be happy with serving up these snippets in any format, including as images. I just want to be able to give my users some kind of formatted preview of their results for well-known types.
Thank you!

Related

Google Custom Search Engine - search by image

Right now I'm using Custom Search Engine (CSE) to search through entire web by search term. The request looks like:
GET "https:/www.googleapis.com/customsearch/v1?key=API_KEY&cx=SEARCH_ENGINE_ID&q=SEARCH_TERM"
And this request returns me list of search results.
But, I need to implement search by image or image url. Do Google API provide such a url param? Maybe something like "image_url"? So, the request can look the following:
https://www.googleapis.com/customsearch/v1?key=API_KEY&cx=SEARCH_ENGINE_ID&image_url=http://www.example.com/image.png
Basically, I need to implement the same functionality as images.google.com/ but using my Custom Search Engine. Thanks.
Example of what I need to implement:
As far as I know, what you're looking for is only supported by Vision AI service (partially at least, never used it before). Google CSE API only allows you to search images and not by images as you may noticed.

image search with google

Google had a beautiful API which you can use to search for large images, but unfortunately they decided to disable it. Now you can use their "custom search engine", but it doesn't get even close to what that old API could do. For a start, the results you get are not the same as if you search in the common search page with your browser, and you can't specify the size of the images you are searching for.
Is it there any programatically way I can get a list of the URLs of the images I would find in the common search google page, size included?
You can use scrapping the google image search results and parse the links to the images. urllib2 library in python can help you here.

Docx generation - reuse

I'm looking to generate docx and pdf documents in my java application. The best, most cost effective solution seems to be xdocreport - I've started using it and it's good.
However, xdocreport doesn't seem to allow reuse of common sections across documents.
Eg.
I want to create two documents - order and invoice. Both have a customer section which should be identical. It would be nice if I could maintain a single customer template that can be applied to both documents.
Are there any libraries (free or paid) that have this functionality.
The commercial product Docmosis can create DocX and Pdf and has an inert/merge capability meaning you can put common content into a template and merge/reference/insert with other templates. It has a Java API and you can try the cloud service without having to install anything to see if it suits your purposes.
Please note I work for Docmosis.
I hope that helps.

Can i use Google Search API to search pages by html source?

Can I use Google Search API to search pages by html source? I want to make a different search engine which will set limits on the engine based on the code of each site. For example, suppose I want to set a parameter which will exclude sites which contain headings. Is this possible?

Google mini ignoring sitemap

I'm in the processing of setting up a Google Mini device to index an ASP.Net site which has a lot of dynamically generated content. I've created a dynamic site.map XML file which lists all of the dynamic URL's. This conforms to the XML site map format and is currently being indexed by Google but seems to be ignored by the Google mini device.
I've added the site.map file into the "Start crawling from the following URL's". When I view the crawl diagnostics the site.map file comes up but none of the dynamic URL's contained within the site.map are being indexed. The Google mini device is only indexing 100 URL's whereas the site.map contains 10,000.
If I search for a phrase using the test centre the search results includes the site.map and not the URL it points to.
Any ideas?
We've just had a consultant come in who has stated the Google Mini cannot index the URL's contained in a sitemap.xml file. One alternative solution is to create a HTML page with all of the links within it.

Resources