Indexing PDF file in ElasticSearch using Java Code - elasticsearch

I am trying to Index PDF files in elastic search 6.3.2 using Java code. So far I have written following code to save the pdf in ES. The code is working fine and I am able to save the Base64 encoded string of my PDF in ES. I want to understand if the approach which I am following is correct or not? Is there any better way of doing it?
Following is my code:
InputStream inputStream = new FileInputStream(new File("mypdf.pdf"));
try {
byte[] fileByteStream = IOUtils.toByteArray(inputStream );
String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(),"UTF-8");
String strEncoded = Base64.getEncoder().encodeToString( base64String.getBytes( "utf-8" ));
this.stream.close();
JSONObject correspondenceNode = new JSONObject();
correspondenceNode.put("data",strEncoded );
String strSsonValues = correspondenceNode.toString();
HttpEntity entity = new NStringEntity(strSsonValues , ContentType.APPLICATION_JSON);
elasticrestClient.put("/2018/documents/"1, entity);
} catch (IOException e) {
e.printStackTrace();
}
Basically what I am doing here is, I am converting the PDF document into Base64String and saving it into ES and while reading, I am converting it back.
following is the code for decoding:
String responseBody = elasticrestClient.get("/2018/documents/1");
//some code to fetch the hits
JSONObject h = hitsArray.getJSONObject(0);
source = h.getJSONObject("_source");
String object = (source.getString("data"));
byte[] decodedStr = Base64.getDecoder().decode( object );
FileOutputStream fos = new FileOutputStream("download.pdf");
fos.write(Base64.getDecoder().decode(new String( decodedStr, "utf-8" )));
fos.close();

This might be correct to store a BASE64 content in elasticsearch but few pieces might be missing here:
You are not "indexing" the PDF as per say in Elasticsearch. If you want to do so, you need to define an ingest pipeline and use the ingest attachment plugin to extract the content from the PDF.
You did not speak about the mapping you are using. If you "really" want to keep the binary content around, you might want to define the BASE64 field as a binary data type.
It does not sound to me a good idea to use elasticsearch to store large blobs like this.
Instead, I'd extract text and metadata and index that + an URL to the binary itself. Like:
{
"content": "Extracted text here",
"meta": {
// Meta data there
},
"url": "file://path/to/file"
}
You can also look at FSCrawler (including its code) which does basically that.

Related

Tomcat Performance with Spring Boot API for File Upload

I have a Spring boot API and one of the endpoints allows users to upload video's. Now My controller basically takes the file as a MultiPart file and then I store it in a temp folder accessible to tomcat. Once I have it stored on Disk, I then push the video to an S3 bucket.
Now to me anyway, this seems to be less than optimal, Like if I wanted to have a 100 or a 1000 users upload at once it seems really non performant to write the files to disk first.
As a little background I'm storing it on disk with the intention that if there is a issue pushing to S3 I can retry
The below code might show what I'm doing better than the above:
public Video addVideo(#RequestParam("title") String title,
#RequestParam("Description") String Description,
#RequestParam(value = "file", required = true) MultipartFile file) {
this.amazonS3ClientService.uploadFileToS3Bucket(file, title, description));
}
Method for storing Video file:
String fileNameWithExtenstion = awsS3FileName + "." + FilenameUtils.getExtension(multipartFile.getOriginalFilename());
//creating the file in the server (temporarily)
File file = new File(tomcatTempDir + fileNameWithExtenstion);FileOutputStream fos = new FileOutputStream(file);
fos.write(multipartFile.getBytes());
fos.close();PutObjectRequest putObjectRequest = new PutObjectRequest(this.awsS3Bucket, awsS3BucketFolder + UnigueId + "/" + fileNameWithExtenstion, file);
if (enablePublicReadAccess) {
putObjectRequest.withCannedAcl(CannedAccessControlList.PublicRead);
}
// Upload a file as a new object with ContentType and title
specified.amazonS3.putObject(putObjectRequest);
//removing the file created in the server
file.delete();
So my question is....is there a better way in Tomcat to:
A) Take in a file via a controllerB) Push to S3
There is no other way to do it with multipart. The problem with multipart that to properly segement parts from the requst they need sometimes skipped or be repeatable. That is impossible within memory w/o having memory to explode. Therefore, Commons FileUpload caches them on disk after a certain threshold is reached.
Multipart requests are the worst way for that. I highly recommend to use either PUT or POST with content type application/octet-stream. You can take the bare request input stream and pass to HttpClient to stream to your backend server. I did this already 5 years ago and it works for gigabytes. I have posted the solution in the Apache HttpClient mailing list.
There is one possibility how this could work under specific conditions:
All parts are in the correct physical order you want to read
Your write to a backend is fast enough to sustain the read from the front
Consume the root part and then go over to the next physical one, process the request body lazily. JAX-WS RI (Metro) has a very nice handling of multipart requests for XOP/MTOM. Learn from that because you won't be able to make it any better.
Perhaps you can try to direct stream the input stream from your MultipartFile to S3.
Consider the following uploadFileToS3Bucket method:
public PutObjectResult uploadFileToS3Bucket(InputStream input, long size, String title, String description) {
// Indicate the length of the information to avoid the need to compute it by the AWS SDK
// See: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html#PutObjectRequest-java.lang.String-java.lang.String-java.io.InputStream-com.amazonaws.services.s3.model.ObjectMetadata-
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(size); // rely on Spring implementation. Maybe you probably also can use input.available()
// compute the object name as appropriate
String key = "...";
PutObjectRequest putObjectRequest = new PutObjectRequest(
this.awsS3Bucket, key, input, objectMetadata
);
// The rest of your code
if (enablePublicReadAccess) {
putObjectRequest.withCannedAcl(CannedAccessControlList.PublicRead);
}
// Upload a file as a new object with ContentType and title
return specified.amazonS3.putObject(putObjectRequest);
}
Of course, you need to provide the service the input stream obtained from the client request associated with the MutipartFile object:
public Video addVideo(
#RequestParam("title") String title,
#RequestParam("Description") String Description,
#RequestParam(value = "file", required = true) MultipartFile file) {
try (InputStream input = file.getInputStream()) {
this.amazonS3ClientService.uploadFileToS3Bucket(input, file.getSize(), title, description));
}
}
Probably you can also play with the getBytes method of MultipartFile and create a ByteArrayInputStream to perform the operation.
In addVideo:
byte[] bytes = file.getBytes();
In uploadFileToS3Bucket:
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(bytes.length);
PutObjectRequest putObjectRequest = new PutObjectRequest(
this.awsS3Bucket, key, new ByteArrayInputStream(bytes), objectMetadata
);
I would prefer the first solution, but try to determine which option offers you the best performance.

Google Cloud Search - db.blobColumns

I'm trying to understand the property db.blobColumns in the database connector --- I've got essentially a massive string of 500,000 characters and I want to use db.blobColumns to upload this text. By the inherent name of blob I am assuming that it is expecting a binary large object? If anyone's used this property before for large text files please help me! I'm at a loss with this particular situation.
Here are the docs: https://developers.google.com/cloud-search/docs/guides/database-connector#content-fields
I have tried using the db.blobColumn field with database blob(binary) content and it works well by extracting text from the file and doing OCR if its an image. But yes, it also accepts text content in the form of database's CLOB type.
I suggest you take a look at the code of database connector here. Main two files that matter here are DatabaseAccess.java and DatabaseRepository.java.
private ByteArrayContent createBlobContent(Map<String, Object> allColumnValues) {
byte[] bytes;
Object value = allColumnValues.get(columnManager.getBlobColumn());
if (value == null) {
return null;
} else if (value instanceof String) {
bytes = ((String) value).getBytes(UTF_8);
} else if (value instanceof byte[]) {
bytes = (byte[]) value;
} else {
throw new InvalidConfigurationException( // allow SDK to send dashboard notification
"Invalid Blob column type. Column: " + columnManager.getBlobColumn()
+ "; object type: " + value.getClass().getSimpleName());
}
return new ByteArrayContent(null, bytes);
}
Above code snippet from the DatabaseRepository.java file is responsible for generating the blob content(binary) which is pushed to Cloud Search. The content of Clob and Blob comes to this function in the form of a byte[]. And is pushed as-is to Cloud Search.
Note from here:
Google Cloud Search will only index first 10 MB of your content
regardless of whether its a text file or binary content.

Apache Commons CSV parser: Not able to read the values

I am using apache commons CSV parser to convert the CSV to a map. In the map I couldnt able to read some values through intellij debuger. if I manually type map.get("key") the value is null. However, if I copy paste the key from the map, I am getting data. Couldnt understand what is going wrong. Any pointers would help. Thanks
Here is my CSV parser code:
private CSVParser parseCSV(InputStream inputStream) {
System.out.println("What is the encoding "+ new InputStreamReader(inputStream).getEncoding());
try {
return new CSVParser(new InputStreamReader(inputStream), CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withSkipHeaderRecord()
.withTrim());
} catch (IOException e) {
throw new IPRSException(e);
}
}
There was a weird character in the strings (Reference: Reading UTF-8 - BOM marker). The below syntax help to resolve the issue
header = header("\uFEFF", "");
in java use UnicodeReader:
String path = "demo.csv";
CSVFormat.Builder builder = CSVFormat.RFC4180.builder();
CSVFormat format = builder.setQuote(null).setHeader().build();
InputStream in = new FileInputStream(new File(path));
CSVParser parser = new CSVParser(new BufferedReader(new UnicodeReader(in)), format);

Send an image rather than a link

I'm using the Microsoft Bot Framework with Cognitive Services to generate images from a source image that the user uploads via the bot. I'm using C#.
The Cognitive Services API returns a byte[] or a Stream representing the treated image.
How can I send that image directly to my user? All the docs and samples seem to point to me having to host the image as a publically addressable URL and send a link. I can do this but I'd rather not.
Does anyone know how to simple return the image, kind of like the Caption Bot does?
You should be able to use something like this:
var message = activity.CreateReply("");
message.Type = "message";
message.Attachments = new List<Attachment>();
var webClient = new WebClient();
byte[] imageBytes = webClient.DownloadData("https://placeholdit.imgix.net/~text?txtsize=35&txt=image-data&w=120&h=120");
string url = "data:image/png;base64," + Convert.ToBase64String(imageBytes)
message.Attachments.Add(new Attachment { ContentUrl = url, ContentType = "image/png" });
await _client.Conversations.ReplyToActivityAsync(message);
The image source of HTML image elements can be a data URI that contains the image directly rather than a URL for downloading the image. The following overloaded functions will take any valid image and encode it as a JPEG data URI string that may be provided directly to the src property of HTML elements to display the image. If you know ahead of time the format of the image returned, then you might be able to save some processing by not re-encoding the image as JPEG by just returning the image encoded as base 64 with the appropriate image data URI prefix.
public string ImageToBase64(System.IO.Stream stream)
{
// Create bitmap from stream
using (System.Drawing.Bitmap bitmap = System.Drawing.Bitmap.FromStream(stream) as System.Drawing.Bitmap)
{
// Save to memory stream as jpeg to set known format. Could also use PNG with changes to bitmap save
// and returned data prefix below
byte[] outputBytes = null;
using (System.IO.MemoryStream outputStream = new System.IO.MemoryStream())
{
bitmap.Save(outputStream, System.Drawing.Imaging.ImageFormat.Jpeg);
outputBytes = outputStream.ToArray();
}
// Encoded image byte array and prepend proper prefix for image data. Result can be used as HTML image source directly
string output = string.Format("data:image/jpeg;base64,{0}", Convert.ToBase64String(outputBytes));
return output;
}
}
public string ImageToBase64(byte[] bytes)
{
using (System.IO.MemoryStream inputStream = new System.IO.MemoryStream())
{
inputStream.Write(bytes, 0, bytes.Length);
return ImageToBase64(inputStream);
}
}

Windows phone: what to do with webresponse object?

I am trying to create a simple rss feed reader. I got this webresponse object but how to extract text,links from this?
WebResponse response = request.EndGetResponse(result);
Also,can anyone tell me what is the underlying framework behind all the working of windows phone 8.Is it known as Silverlight?I want to know it so that I can make relevant Google searches and don't bother you people time and again?
It depends on what you want to do with it. You will need to get the Stream and then from there can turn it into a string (or other primitive type), just Json.Net to convert from json to an object, or you can use the stream to create an image.
using (var response = webRequest.EndGetResponse(asyncResult))
{
using (var reader = new StreamReader(response.GetResponseStream()))
{
// need a string?
string result = reader.ReadToEnd();
// Need to convert from json?
MyObject obj = Newtonsoft.Json.JsonConvert.DeserializeObject<MyObject>(result);
}
}

Resources