Elasticsearch - TikaOnDotNet Text Extraction page by page - elasticsearch

We are exploring the elastic search and currently we are extracting the text from ms office documents, pdf, .eml and other file formats using TikaOnDotNet.
We want to store the document content page by page to Elasticsearch. so that we can update users that the keyword you were looking for is available on page number x.
I am not sure whether it is possible or not, if you could share you though on the same or show some direction would be greatly appreciated.
Regards,
Hiten

Related

Change image next to Website in Google search results

Good morning,
a customer of ours asked us if it was possible to change the image that Google shows next to his site in Google search results.
After several searches, we tried using different techniques all followed by re-indexing the page in order to instantly see the results.
We tried using structured data (both with ld+json and using microdata) and also of the attributes "og:image" and "og:title" in the "meta" tags, but none of these tests changed the image displayed on the right side next to the site in Google results.
We expected that with one of these methods would have changed the image, but nothing happened
Therefore, we wondered whether it was possible to change that image or whether Google chose the best image based on its search parameters.
Thank you for your valuable help,
Best regards

Extract the screenshot page where the text is found in azure cognitive search

I have PDF documents stored in Azure Blobs that are indexed with Azure Search. I am searching for text in the content of the PDFs and everything works correctly. When I perform the search, is it possible that Azure returns a screenshot of the page where the text was found?
For example, if I search for the word 'information', which is on page 2 of a PDF, let Azure return a screenshot of that page.
thankssss
You can find an example of this in the JFK sample. The sample uses an image store custom skill that is used to extract the images and an HOCR skill to extract the data necessary to overlay zones corresponding to the text. The full skillset can be found here.
The front-end can then use that data to build a HOCR viewer component from that data.
I encourage you to read through the sample code to get the full details, that wouldn't fit in a Stack Overflow response.

Create Index page for ASCII doc

I have a lot of ASCII docs at different locations and I want to create an index page which should render these documents. But the condition here is that I want to list all the document link on the index page and if the user clicks on any link then only the document should be displayed. I don't want to display the documents below the table of content. I just want to display the table of content on the index page.
Is there any way to do this?
If I understand you correctly, you wish to generate a multi-document website, but you want an index page that displays just the TOC, with the other documents served elsewhere. I believe the best way to get this effect would be to generate chunked XHTML output using the DocBook toolchain. I believe this should be possible with Asciidoctor tools, but I have only implemented this particular post-rendering toolchain with the original (Python-based) AsciiDoc rendering tool, as documented here. This setup is configurable to generate a TOC index page that links to chunked output (you can configure the level of chunking).
As you have already figured out, AsciiDoc's automated TOC generation only works on the present document, which requires including the subordinate document to get their headings for the TOC. I can think of ways to sort of game this, such as to include just the heading of the included document (include::path/to/document.adoc[lines=1]) and then hiding even those headings with CSS or something. The problem is, the links in the TOC will be pointing internally, so you'd need to handle that somehow.
Another way is to use any of the static-site generators that support or can be readily extended to support AsciiDoc. What you're talking about is not an out-of-the-box feature that I'm aware of, but they all at least make it possible to generate an organized TOC-type navigation.

Pass an image get a list of URLs matching the image, HOW?

I'm essentially trying to do a reverse image search, i.e. I want to pass in an image and get back a results list of instances on the web where that image is found. I know Google's old API that did this is depreciated, I see some answers on SO (e.g. Google custom search for images only) that talk about doing an image search with Google's Custom Search API, but every time I dig into the code they are retrieving images from a string rather than what I'm trying to do. Is there currently any API that will help me with what I'm trying to do?
I'm sorry. I cannot write comments yet. How about this? https://github.com/tanaikech/goris
Recently, I found this. I don't know whether this is what you want.

Adobe InDesign Server examples

I'm new to Adobe InDesign Server and I'm having a hard time finding a good kitchen sink app. All the examples I got from the SDK seem to partially work. All I'm trying to do is use a master page from InDesign from the server side and edit certain text fields. For example placing first and last name in particular text fields. Does anyone know of a good place to get examples code that shows all the features or how I would approach this problem?
http://www.adobe.com/devnet/indesign/documentation.html#idserver Has a lot of resources that is useful when starting out. In particular http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs55-docs/InDesignServer/ids-solutions.pdf includes a number of code examples for various common operations.
As to your specific example, the typical way to go about it is:
1. Get the page object from the master pages list.
2. Iterate over each text field on the page.
3. Somehow identify the fields, for example by setting the script label in the template document and checking the labels of each text field you iterate through.
4. Set the contents of the text field.
A lot of the official InDesign documentation is partial.
Jongware also hosts the complete reference documentation:
http://www.jongware.com/idjshelp.html
Probably the reason why teh IDS documentation isn't that exhaustive is that dealing with the server version is an extension of the classical indesign use. So the exception of some peculiarities detailed in the ids sdk docs, you will find most of the help with InDesign Scripting guides ;)

Resources