Google Cloud Search - db.blobColumns - google-api

I'm trying to understand the property db.blobColumns in the database connector --- I've got essentially a massive string of 500,000 characters and I want to use db.blobColumns to upload this text. By the inherent name of blob I am assuming that it is expecting a binary large object? If anyone's used this property before for large text files please help me! I'm at a loss with this particular situation.
Here are the docs: https://developers.google.com/cloud-search/docs/guides/database-connector#content-fields

I have tried using the db.blobColumn field with database blob(binary) content and it works well by extracting text from the file and doing OCR if its an image. But yes, it also accepts text content in the form of database's CLOB type.
I suggest you take a look at the code of database connector here. Main two files that matter here are DatabaseAccess.java and DatabaseRepository.java.
private ByteArrayContent createBlobContent(Map<String, Object> allColumnValues) {
byte[] bytes;
Object value = allColumnValues.get(columnManager.getBlobColumn());
if (value == null) {
return null;
} else if (value instanceof String) {
bytes = ((String) value).getBytes(UTF_8);
} else if (value instanceof byte[]) {
bytes = (byte[]) value;
} else {
throw new InvalidConfigurationException( // allow SDK to send dashboard notification
"Invalid Blob column type. Column: " + columnManager.getBlobColumn()
+ "; object type: " + value.getClass().getSimpleName());
}
return new ByteArrayContent(null, bytes);
}
Above code snippet from the DatabaseRepository.java file is responsible for generating the blob content(binary) which is pushed to Cloud Search. The content of Clob and Blob comes to this function in the form of a byte[]. And is pushed as-is to Cloud Search.
Note from here:
Google Cloud Search will only index first 10 MB of your content
regardless of whether its a text file or binary content.

Related

Tomcat Performance with Spring Boot API for File Upload

I have a Spring boot API and one of the endpoints allows users to upload video's. Now My controller basically takes the file as a MultiPart file and then I store it in a temp folder accessible to tomcat. Once I have it stored on Disk, I then push the video to an S3 bucket.
Now to me anyway, this seems to be less than optimal, Like if I wanted to have a 100 or a 1000 users upload at once it seems really non performant to write the files to disk first.
As a little background I'm storing it on disk with the intention that if there is a issue pushing to S3 I can retry
The below code might show what I'm doing better than the above:
public Video addVideo(#RequestParam("title") String title,
#RequestParam("Description") String Description,
#RequestParam(value = "file", required = true) MultipartFile file) {
this.amazonS3ClientService.uploadFileToS3Bucket(file, title, description));
}
Method for storing Video file:
String fileNameWithExtenstion = awsS3FileName + "." + FilenameUtils.getExtension(multipartFile.getOriginalFilename());
//creating the file in the server (temporarily)
File file = new File(tomcatTempDir + fileNameWithExtenstion);FileOutputStream fos = new FileOutputStream(file);
fos.write(multipartFile.getBytes());
fos.close();PutObjectRequest putObjectRequest = new PutObjectRequest(this.awsS3Bucket, awsS3BucketFolder + UnigueId + "/" + fileNameWithExtenstion, file);
if (enablePublicReadAccess) {
putObjectRequest.withCannedAcl(CannedAccessControlList.PublicRead);
}
// Upload a file as a new object with ContentType and title
specified.amazonS3.putObject(putObjectRequest);
//removing the file created in the server
file.delete();
So my question is....is there a better way in Tomcat to:
A) Take in a file via a controllerB) Push to S3
There is no other way to do it with multipart. The problem with multipart that to properly segement parts from the requst they need sometimes skipped or be repeatable. That is impossible within memory w/o having memory to explode. Therefore, Commons FileUpload caches them on disk after a certain threshold is reached.
Multipart requests are the worst way for that. I highly recommend to use either PUT or POST with content type application/octet-stream. You can take the bare request input stream and pass to HttpClient to stream to your backend server. I did this already 5 years ago and it works for gigabytes. I have posted the solution in the Apache HttpClient mailing list.
There is one possibility how this could work under specific conditions:
All parts are in the correct physical order you want to read
Your write to a backend is fast enough to sustain the read from the front
Consume the root part and then go over to the next physical one, process the request body lazily. JAX-WS RI (Metro) has a very nice handling of multipart requests for XOP/MTOM. Learn from that because you won't be able to make it any better.
Perhaps you can try to direct stream the input stream from your MultipartFile to S3.
Consider the following uploadFileToS3Bucket method:
public PutObjectResult uploadFileToS3Bucket(InputStream input, long size, String title, String description) {
// Indicate the length of the information to avoid the need to compute it by the AWS SDK
// See: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html#PutObjectRequest-java.lang.String-java.lang.String-java.io.InputStream-com.amazonaws.services.s3.model.ObjectMetadata-
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(size); // rely on Spring implementation. Maybe you probably also can use input.available()
// compute the object name as appropriate
String key = "...";
PutObjectRequest putObjectRequest = new PutObjectRequest(
this.awsS3Bucket, key, input, objectMetadata
);
// The rest of your code
if (enablePublicReadAccess) {
putObjectRequest.withCannedAcl(CannedAccessControlList.PublicRead);
}
// Upload a file as a new object with ContentType and title
return specified.amazonS3.putObject(putObjectRequest);
}
Of course, you need to provide the service the input stream obtained from the client request associated with the MutipartFile object:
public Video addVideo(
#RequestParam("title") String title,
#RequestParam("Description") String Description,
#RequestParam(value = "file", required = true) MultipartFile file) {
try (InputStream input = file.getInputStream()) {
this.amazonS3ClientService.uploadFileToS3Bucket(input, file.getSize(), title, description));
}
}
Probably you can also play with the getBytes method of MultipartFile and create a ByteArrayInputStream to perform the operation.
In addVideo:
byte[] bytes = file.getBytes();
In uploadFileToS3Bucket:
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(bytes.length);
PutObjectRequest putObjectRequest = new PutObjectRequest(
this.awsS3Bucket, key, new ByteArrayInputStream(bytes), objectMetadata
);
I would prefer the first solution, but try to determine which option offers you the best performance.

Indexing PDF file in ElasticSearch using Java Code

I am trying to Index PDF files in elastic search 6.3.2 using Java code. So far I have written following code to save the pdf in ES. The code is working fine and I am able to save the Base64 encoded string of my PDF in ES. I want to understand if the approach which I am following is correct or not? Is there any better way of doing it?
Following is my code:
InputStream inputStream = new FileInputStream(new File("mypdf.pdf"));
try {
byte[] fileByteStream = IOUtils.toByteArray(inputStream );
String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(),"UTF-8");
String strEncoded = Base64.getEncoder().encodeToString( base64String.getBytes( "utf-8" ));
this.stream.close();
JSONObject correspondenceNode = new JSONObject();
correspondenceNode.put("data",strEncoded );
String strSsonValues = correspondenceNode.toString();
HttpEntity entity = new NStringEntity(strSsonValues , ContentType.APPLICATION_JSON);
elasticrestClient.put("/2018/documents/"1, entity);
} catch (IOException e) {
e.printStackTrace();
}
Basically what I am doing here is, I am converting the PDF document into Base64String and saving it into ES and while reading, I am converting it back.
following is the code for decoding:
String responseBody = elasticrestClient.get("/2018/documents/1");
//some code to fetch the hits
JSONObject h = hitsArray.getJSONObject(0);
source = h.getJSONObject("_source");
String object = (source.getString("data"));
byte[] decodedStr = Base64.getDecoder().decode( object );
FileOutputStream fos = new FileOutputStream("download.pdf");
fos.write(Base64.getDecoder().decode(new String( decodedStr, "utf-8" )));
fos.close();
This might be correct to store a BASE64 content in elasticsearch but few pieces might be missing here:
You are not "indexing" the PDF as per say in Elasticsearch. If you want to do so, you need to define an ingest pipeline and use the ingest attachment plugin to extract the content from the PDF.
You did not speak about the mapping you are using. If you "really" want to keep the binary content around, you might want to define the BASE64 field as a binary data type.
It does not sound to me a good idea to use elasticsearch to store large blobs like this.
Instead, I'd extract text and metadata and index that + an URL to the binary itself. Like:
{
"content": "Extracted text here",
"meta": {
// Meta data there
},
"url": "file://path/to/file"
}
You can also look at FSCrawler (including its code) which does basically that.

NFC External record is returning in wrong format?

I've successfully written an external record to an NFC tag. When I use a 3rd party tag reader to evaluate the external record that was written, I see the appropriate value, which is a single, positive integer.
However, when I run my code (below) to see what the value of the payload (external record) is on the tag (using a Toast) in order to incorporate that value into an "if" statement, I get different values. So far, I've seen the following:
B#41fb4278 or B#41fb1190.
At this point, the value of the external record is just "2". How can I just return/write simply 2?
protected void onNewIntent(Intent intent) {
super.onNewIntent(intent);
if(intent.hasExtra(NfcAdapter.EXTRA_TAG))
{
Tag tag = intent.getParcelableExtra(NfcAdapter.EXTRA_TAG);
byte[] payload = "2".getBytes(); ///this is where the ID (payload) for the tag is assigned.
NdefRecord[] ndefRecords = new NdefRecord[2];
ndefRecords[0] = NdefRecord.createExternal("com.example.bmt_admin", "externaltype", payload);
ndefRecords[1] = NdefRecord.createApplicationRecord("com.example.bmt_01");
NdefMessage ndefMessage = new NdefMessage(ndefRecords);
writeNdefMessage(tag, ndefMessage);
Toast.makeText(this, "NFC Scan: " + payload, Toast.LENGTH_SHORT).show();
}
}
Thanks for any help!!
payload is defined as byte[]. When you use payload in your toast() statment, you use it a pointer to that array. Therefore what you see is the address of the array. When you want to get a string representation of a byte[], you can use for example:
String s = new String(payload);

Is there any way of suppressing ‘Do you want to save changes to xxx.pdf before closing’ dialog using ABCpdf

We are reading the adobe form template using ABCpdf , populating form fields with values retrieved from database and amending them into a single PDF document and sending the document back as File stream in the HTTP response to the users in a ASP.net MVC App.
This approach is working fine and PDF documents are getting generated successfully. But when the user choose to open the generated PDF file and try to close it, they are being prompted ‘Do you want to save changes to xxx.pdf before closing’ dialog from Adobe Acrobat. Is there any way of suppressing this message using ABC pdf?.
Following is the code we are using to generate the PDF.
public byte[] GeneratePDF(Employee employee, String TemplatePath)
{
string[] FieldNames;
Doc theDoc;
MemoryStream MSgeneratedPDFFile = new MemoryStream();
//Get the PDF Template and read all the form fields inside the template
theDoc = new Doc();
theDoc.Read(HttpContext.Current.Server.MapPath(TemplatePath));
FieldNames = theDoc.Form.GetFieldNames();
//Navigate through each Form field and populate employee details
foreach (string FieldName in FieldNames)
{
Field theField = theDoc.Form[FieldName];
switch (FieldName)
{
case "Your_First_Name":
theField.Value = employee.FirstName;
break;
default:
theField.Value = theField.Name;
break;
}
//Remove Form Fields and replace them with text
theField.Focus();
theDoc.Color.String = "240 240 255";
theDoc.FillRect();
theDoc.Rect.Height = 12;
theDoc.Color.String = "220 0 0";
theDoc.AddText(theField.Value);
theDoc.Delete(theField.ID);
}
return theDoc.GetData();
}
Today I ran into this problem too, but with a PDF with no form fields. I ran #CharlieNoTomatoes code and confirmed the FieldNames collection was definitely empty.
I stepped through the various stages of my code and found that if I saved the PDF to the file system and opened from there it was fine. Which narrowed it down to the code that took the abcpdf data stream and sent it directly to the user (I normally don't bother actually saving to disk). Found this in the WebSuperGoo docs and it suggested my server might be sending some extra rubbish in the Response causing the file to be corrupted.
Adding Response.End(); did the trick for me. The resulting PDF files no longer displayed the message.
byte[] theData = _thisPdf.Doc.GetData();
var curr = HttpContext.Current;
curr.Response.Clear();
curr.Response.ContentType = "application/pdf";
curr.Response.AddHeader("Content-Disposition", "attachment; filename=blah.pdf");
curr.Response.Charset = "UTF-8";
curr.Response.AddHeader("content-length", theData.Length.ToString());
curr.Response.BinaryWrite(theData);
curr.Response.End();
I ran into this problem, as well. I found a hint about "appearance streams" here:
The PDF contains form fields and the NeedAppearances entry in the interactive form dictionary is set to true. This means that the conforming PDF reader will generate an appearance stream where necessary for form fields in the PDF and as a result the Save button is enabled. If the NeedAppearances entry is set to false then the conforming PDF reader should not generate any new appearance streams. More information about appearance streams in PDF files and how to control them with Debenu Quick PDF Library.
So, I looked for "appearance" things in the websupergoo doc and was able to set some form properties and call a field method to get rid of the "Save changes" message. In the code sample above, it would look like this:
Edit: After exchanging emails with fast-and-helpful WebSuperGoo support about the same message after creating a PDF with AddImageHtml that WASN'T fixed by just setting the form NeedAppearances flag, I added the lines about Catalog and Atom to remove the core document NeedAppearances flag that gets set during AddImageHtml.
public byte[] GeneratePDF(Employee employee, String TemplatePath)
{
string[] FieldNames;
Doc theDoc;
MemoryStream MSgeneratedPDFFile = new MemoryStream();
//Get the PDF Template and read all the form fields inside the template
theDoc = new Doc();
theDoc.Read(HttpContext.Current.Server.MapPath(TemplatePath));
FieldNames = theDoc.Form.GetFieldNames();
//Tell PDF viewer to not create its own appearances
theDoc.Form.NeedAppearances = false;
//Generate appearances when needed
theDoc.Form.GenerateAppearances = true;
//Navigate through each Form field and populate employee details
foreach (string FieldName in FieldNames)
{
Field theField = theDoc.Form[FieldName];
switch (FieldName)
{
case "Your_First_Name":
theField.Value = employee.FirstName;
break;
default:
theField.Value = theField.Name;
break;
}
//Update the appearance for the field
theField.UpdateAppearance();
//Remove Form Fields and replace them with text
theField.Focus();
theDoc.Color.String = "240 240 255";
theDoc.FillRect();
theDoc.Rect.Height = 12;
theDoc.Color.String = "220 0 0";
theDoc.AddText(theField.Value);
theDoc.Delete(theField.ID);
}
Catalog cat = theDoc.ObjectSoup.Catalog;
Atom.RemoveItem(cat.Resolve(Atom.GetItem(cat.Atom, "AcroForm")), "NeedAppearances");
return theDoc.GetData();
}

How to successfully parse images from within content:encoded tags using SAX?

I am a trying to parse and display images from a feed that has the imgage URL inside tags. An example is this:
*Note>> http://someImage.jpg is not a real image link, this is just an example. This is what I have done so far.
public void startElement(String uri, String localName, String qName, Attributes atts) {
chars = new StringBuilder();
if (qName.equalsIgnoreCase("content:encoded")) {
if (!atts.getValue("src").toString().equalsIgnoreCase("null")) {
feedStr.setImgLink(atts.getValue("src").toString());
Log.d(TAG, "inside if " + feedStr.getImgLink());
} else {
feedStr.setImgLink("");
Log.d(TAG, feedStr.getImgLink());
}
}
}
I believe this part of my programming needs to be tweaked. First, when qName is equal to "content:encoded" the parsing stops. The application just runs endlessly and displays nothing. Second, if I change that initial if to anything that qName cannot equal like "purplebunny" everything works perfect, except there will be no images. What am I missing? Am I using atts.getValue properly? I have used log to see what comes up in ImgLink and it is null always.
You can store the content:encoded data in a String. Then you can extract image by this library Jsoup
Example:
Suppose content:encoded raw data stored in Description variable.
Document doc = Jsoup.parse(Description);
Element image =doc.select("img").first();
String url = image.absUrl("src");

Resources