Extract paragraph or sentence from pdf azure cognitive search - azure-blob-storage

I am having a blob container where I am storing PDF files and I am using Azure cognitive search to search word or content over pdf. When searching a word or sentence which is present in one of the pdf present in the container, Azure cognitive search is returning the entire content present in the pdf.
Is there a way that we can extract only a sentence or paragraph from pdf where the word or sentence is appearing?
Is there a way to highlight the input passed wherever it appears in pdf?
Am I using the correct service for above two points?

Yes, there is a feature exactly for what you are looking for, see highlight
you only need to highlight the Content field
POST /indexes/hotels-sample-index/docs/search?api-version=2020-06-30
{
"search": "sandy beaches",
"highlight": "Content"
}

Related

How do I search in Rich Text with HyGraph API?

I have FAQ model in my HyGraph and I want to search a text and see if the text contains in the Question and in the Answer. Question is a single line text which has no problem but the Answer is rich text and I am not able to search it using its API.
Any idea?

JMeter Document upload - Content validation in converted pdf document

W.r.t. JMeter document upload and download,
I would like to know, Can we validate and do ample to ample content comparison (e.g. same Text, Space, Lines, Images etc) of PDF document which is converted using Libra Office/PDF Box in Document upload scenarios from different type of documents like Doc/Docx/Text/Jpg/Png/Rtf etc
Scenario-
Upload a Docx document ( Document should convert in PDF Format and user can view the same in pdf)
View the Docx Document in PDF Format after document upload
-Compare the Docx Contents (e.g. Text, Space, Lines, Images etc) in PDF doc, is same or not
Take a look at Apache Tika project, if you add the tika-app.jar to JMeter Classpath you would be able to use Document (text) "Field to Test" of the Response Assertion:
so you can check document content against reference text.
If it is not sufficient for your needs take a look at JSR223 Test Elements, Groovy language, Apache POI and Apache PDFBox APIs

Extract the screenshot page where the text is found in azure cognitive search

I have PDF documents stored in Azure Blobs that are indexed with Azure Search. I am searching for text in the content of the PDFs and everything works correctly. When I perform the search, is it possible that Azure returns a screenshot of the page where the text was found?
For example, if I search for the word 'information', which is on page 2 of a PDF, let Azure return a screenshot of that page.
thankssss
You can find an example of this in the JFK sample. The sample uses an image store custom skill that is used to extract the images and an HOCR skill to extract the data necessary to overlay zones corresponding to the text. The full skillset can be found here.
The front-end can then use that data to build a HOCR viewer component from that data.
I encourage you to read through the sample code to get the full details, that wouldn't fit in a Stack Overflow response.

Google apps API, is it possible to search the text of a presentation?

I'd like to produce a list of all of the words that appear in a google docs presentation. I thought that the API would allow this, but it only seems that the spreadsheets API allows searching of the contents of the document?
This is correct, you can't get the content of the presentation with the Documents List API, but you can easily download an exported version of a presentation, for example:
GET https://docs.google.com/feeds/download/presentations/Export
?docID=0AsJD12345&exportFormat=txt
You can use plain text output and just split up the words.

Searchable PDF Files (Image+Text PDF) validation

I am checking if a PDF document is searchable if I can get any text from every single page in a PDF.
But checking every page seems to take forever when I am trying to extract text from a PDF that contains more than 500~2000 pages.
Is it possible for a PDF to contain text for one page but not in the rest?
What I am trying to do here is that, if a first page of PDF contains text, then it is a searchable PDF else not..
Yes, it is very possible for a PDF to contain text on one page but not the rest. You could very well have a 500 page PDF that contains images on the first 499 pages, but contain text on the last page.
Unless you want to open the PDF file yourself and scan it for text/text operations, you will need to use an existing third-party PDF library that allows you to extract text from a PDF.
Also, see Ferruccio's response to a related question, which is to use the IFilter interface, specifically made for search indexing and text extraction.
Try this version of Searcharoo, which lets you search Word and PDF documents.

Resources