AutoML Vision: Dataset import takes long time and fails eventually - google-cloud-automl

I am currently trying to import a single-label dataset that contains ~7300 images. I use a single CSV file in the following format to create the dataset from (paths shortened):
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
[...]
However, the import process failed after processing for over 7 hours (which I find unusually long based on previous experience) with the following error:
File unreadable or invalid gs://[...]
The strange thing is: The files were there and I was able to download and view them on my machine. And once I removed all entries from the CSV except the two "unreadable or invalid" ones and imported this CSV file (same bucket), it worked like a charm and took just a few seconds.
Another dataset with 500 other images caused the same strange behavior.
I have imported and trained a few AutoML Vision models before and I can't figure out what is going wrong this time. Any ideas or debugging tips appreciated. The GCP project is "its-2018-40128940-automl-vis".
Thanks in advance!

File unreadable or invalid is returned when a file either cannot be accessed from GCS (cannot be read due to file size or permissions) or when the file format is considered invalid. For example image is in different format than the extension used or in format that is not supported by image service.
When there are errors the pipeline may be slow because currently it does re-tries with exponential backoff. It tries to detect non-retry-able errors and fail fast - but errs on the re-try if unsure side.
It would be best if you could ensure the images are in the right format - for example by re-converting the images into one of the supported formats.
Depending on your platform there are tools to do that.

When I check a file via uploaded in the UI of GCP Storage
To match this one we have to upload the file in following configurations,
storage.bucket(bucketName).upload(`./${csv_file}`, {
// Support for HTTP requests made with `Accept-Encoding: gzip`
destination: `csv/${csv_file}`,
gzip: false,
metadata: {
},
});

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

Azure Databricks - Receive error Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data

I have been through many a links to solve this problem. However, none have helped me. Primarily because I am facing this error on Azure Databricks.
I am trying to read Excel files located on ADLS Curated zone. There are about 25 of the excel files. My program loops through the excel files and reads them into a PySpark Dataframe. However, after reading about 9 excel files, I receive the below error -
Py4JJavaError: An error occurred while calling o1481.load.
: java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 6111064, Raw/compressed size: 61100, ratio: 0.009998
I installed the maven - org.apache.poi.openxml4j but when I try to call it using the simple following import statement, I receive the error "No module named 'org'"
import org.apache.poi.openxml4j.util.ZipSecureFile
Any ideas anyone about how to set the ZipSecureFile.setMinInflateRatio() to 0 in Azure Databricks?
Best regards,
Sree
The "Zip bomb detected" exception will occur if the expanded file crosses the default MinInflateRatio set in the Apache jar. Apache includes a setting called MinInflateRatio which is configurable via ZipSecureFile.setMinInflateRatio() ; this will now be set to 0.0 by default to allow large files.
Checkout known issue in POI: https://bz.apache.org/bugzilla/show_bug.cgi?id=58499

How can i CSTORE pixel data of dicom from CMOVE properly?

I have 1 running server for handle C-Move, 2 running server for handle C-Store and remote pacs server(GEPACS)
When i tried to C-Move command from remote pacs to C-Store handler, 1 server(py-netdicom) is build and save the file properly and 1 server(go-netdicom) is not.
So there was couple of problems in go-netdicom.
I fixed the code can handle hexadecimals. It originally not supported on go-netdicom.
This was fix almost every problems in my case but still cannot store pixel data properly.
For example, I got 9117252 bytes from original signal from remote pacs and I saved the data itself, but actually it needs to be 18000000 bytes(got an error). even CT images are short for 3 times(got approximately 180000, but need 524288)
I think the problem caused by might be the encapsulation of pixel-data but not sure.
Is there any tip or some help?
Thank you.
EDIT 4: I've got a clue.link here
Somehow C-STORE command have a kind of transfer syntax.
This offer to scp type(compressed or not) of data scu get.
But still I don't have a idea which part of go-netdicom has to be changed.
I'll delete "python" tag because this is not related with python anymore.
I found the solution.
Somehow, GEPACS send the certain transfer syntax for JPEG compression.
if go-netdicom doesn't have the TransferSyntaxUID then pick the GEPACS's first transfer syntax and that was for JPEG compression.
i just put bigendian and explicitvr (GEPACS default) when transfersyntax is empty.
which placed in contextmanager.go line 101 on AssociateRequest, line 127
Hope this result help someone.
Thank you
The problem here is that go-netdicom uses the first PresentationContext sent in the A_ASSOCIATE_RQ (As you can see in the last image). So it accepts "2.16.840.1.113709.1.2.2" which is a privative Transfer Syntax and it is not in the DICOM standart, so no one can manage the C-STORE at the end.
If you are reading this.. maybe you do not use go-netdicom but the problem could be the same if the error involves the transfer Syntax "2.16.840.1.113709.1.2.2", in the Centricity PACs documentation says: "It is expected that other vendors' applications will ignore all Presentation Context proposed with the GE Private Compress Express Transfer Syntax"
And that is what we are suppose to do. I see a list of PRs in go-netdicom so I suppose it is not mantained, so I will post the change for go-netdicom here. I made this changes in contextmanager.go and works like a charm:

URI not found in CSV row - import error for AutoML NL

I have been running several independent multi-class NL models on an identical data set (to compare performance to a multi-label model) and had no problems importing the data or running the models. I've just been through the identical preparation process, uploaded the file to the bucket and now get this error on import:
Uri is not found in CSV row ",NotWarm".
Warm and NotWarm are my labels. A sample of the csv is below so you can see the format:
"To ensure you get the best possible service, we stagger the cut-off time for next day delivery from 5pm right up until Midnight.",Warm
You’ll be able to see if Next Day Delivery is still available when you place your order.,NotWarm
"You can choose a home delivery option, which lets you have your order delivered to an address of your choice.",Warm
"Some eligible items also let you choose Click + Collect, where your order is delivered to a local store.",NotWarm
I've double checked all the advice about preparing datasets on the AutoML help pages. The file itself has been encoded in UTF-8 using Notepad++ so there should be nothing amiss with the CSV format. The file is identical to those I've used previously except for the labels.
Has something changed on the AutoML NL process as it was over a month since my last model was created?
Thanks in advance for any guidance.
SOLVED
I tagged all my labels with unique numbers to determine which line of data it was failing on upload. Turned out some blank lines had crept into the file so I was trying to assign a label to a null string. Removed the empty lines and all works. Hopefully this may help someone else.

Adding processing code to a webpage using processing.js

I have created a Processing code (.pde file) to make a time series (coffee production v/s time) which takes its data from an excel file(.tsv table). Can anyone tell me how to include this to my webpage?
I have tried with processing.js but it does not show anything in the browser.
without additional information, you probably have your .tsv file in a "data" directory, but aren't explicitly loading it from "./data/myfile.tsv", instead relying on Processing to autoresolve. If you intend to use your sketch online, always include "data/" in your file locations, because browsers resolve locations relative to "where the page is right now".

Resources