The correct way to upload PDF to Google Cloud Automl - google-cloud-automl

I would like to know whether there is a clear guide to show the steps for uploading the pdf to GCP- AutoML NLP?
1.) I have tried to upload the pdf to the a bucket eg. ABC.pdf
2.) Set up the training.JSONL by replacing the file location with the ABC.pdf
{
"document": {
"input_config": {
"gcs_source": {
"input_uris": [ "gs://automl/ABC.pdf" ]
}
}
}
}
3.) I open a new csv, paste the gs link into the file
gs://automl/training.jsonl
4.) When I create the dataset for AutoML. It shows the following
Error: Has critical error in root level csv gs://automl/order.csv line 1: Expected 2 columns, but found 1 columns only.
It doesn't show what columns it is required in the guide. Thank you for your assistance

In the csv file, every line should have the set of the content, e.g.:
TRAIN,gs://automl/training.jsonl
otherwise it should start with a comma to indicate the first column is empty, e.g.:
,gs://automl/training.jsonl

Related

How to read Dynamic files in Cypress

While doing automation I have to download files and store them in Cypress folder. File downloading and storing works, but I am not sure how to read those files since every time it is downloading its Prefix with some random number.
For Example
in the cypress/Animesh folder I can see some random files like 1234_abc.json, 2345_abc.json, 3334_abc.json, 3454_abc.json
How do I read the first file?
I am having a similar issue. I can successfully download the file but my issue is I cannot find a way to soft wait for it to be ready. I'm looking for something to check that there is at least one file in the folder. I'm using the findFile task but the issue is it doesn't wait, and i'm not sure how to get it to wait?
findFile: (mask) => {
if (!mask)
throw new Error('Missing a file mask to seach');
return globby(mask).then((list) => {
if (!list.length)
throw new Error(`Could not find files matching mask "${mask}"`);
return list[0];
});
}
});

get most recent file from azure blob storage

My azure blob storage has several files , like
name last modified
data-GUID1 jan 1,20
data_guid2 jan 2, 20
How would I grab the file most recent 'last modified' ,like data_guid2 ?
Currently I hard-code the name :
file_location= /dbfs/mnt/blob/container/data_Guid1
Thanks in advance.
You can get a list of all the file names, then write whatever custom code you want to find the most recent (i.e. find the one with the largest number at the end)
You can get this list using the dbutils.fs.ls("") function: https://kb.databricks.com/data-sources/wasb-check-blob-types.html
The url for your block container will have the following format:
wasbs://<containername>#<accountname>.blob.core.windows.net/<file.path>/
If you are having trouble with this approach or if you would like to also get the "last modified" timestamps for the files, check out this link for additional ways to list the files in a blob directory: https://kb.databricks.com/data-sources/wasb-check-blob-types.html

Automate downloading of multiple xml files from web service with power query

I want to download multiple xml files from web service API. I have a query that gets a JSON document:
= Json.Document(Web.Contents("http://reports.sem-o.com/api/v1/documents/static-reports?DPuG_ID=BM-086&page_size=100"))
and manipulates it to get list of file names such as: PUB_DailyMeterDataD1_201812041627.xml in a column on an excel spreadsheet.
I hoped to get a function to run against this list of names to get all the data, so first I worked on one file: PUB_DailyMeterDataD1_201812041627
= Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/PUB_DailyMeterDataD1_201812041627.xml"))
This gets an xml table which I manipulate to get the data I want (the half hourly metered MWh for generator GU_401970
Now I want to change the query into a function to automate the process across all xml files avaiable from the service. The function requires a variable to be substituted for the filename. I try this as preparation for the function:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = (Web.Contents("https://reports.sem-o.com/documents/Filename")),
(followed by the manipulating Mcode)
This doesnt work.
then this:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/[Filename]")),
I get:
DataFormat.Error: Xml processing failed. Either the input is invalid or it isn't supported. (Internal error: Data at the root level is invalid. Line 1, position 1.)
Details:
Binary
So stuck here. Can you help.
thanks
Conor
You append strings with the "&" symbol in Power Query. [Somename] is the format for referencing a field within a table, a normal variable is just referenced with it's name. So in your example
let Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & Filename)),
Would work.
It sounds like you have an existing query that drills down to a list of filenames and you are trying to use that to import them from the url though, so assuming that the column you have gotten the filenames from is called "Filename" then you could add a custom column with this in it
Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & [Filename]))
And it will load the table onto the row of each of the filenames.

How to separate tables contained in an excel file in different CSV?

I have an excel with multiple tables separated by blank rows and I want to save each table in separate CSV files with a script. How could I do it?
Thanks for the help
UPDATE:
INPUT EXAMPLE:
Excel Example
OUTPUT EXAMPLE:
I want everyone of those columns in a file like this one.
276.1722 54.318 50.6335
276.373 52.573 51.4047
277.0097 50.864 51.9912
277.9329 49.4127 52.8294
279.0832 47.9623 53.3041
280.3554 46.5477 53.5295
281.3679 44.9695 53.8862
282.4689 43.4235 54.1254
283.4763 41.8019 54.0885
284.5859 40.3595 53.5828
285.7263 38.941 52.988
286.8929 37.5684 52.3438
288.0729 36.2914 51.5373
289.0561 35.1335 50.4119
289.7246 34.2113 48.8901
290.0624 33.3207 47.2446
290.1395 32.2516 45.6541
290.0895 31.2818 44.0091
289.7804 30.5224 42.2812
289.211 29.8383 40.5862
Alternate way of storing the data:
You could actually store them in an excel file and in different sheets and use these gems depending on if your working with old or new excel:
'Roo', 'roo-xls', 'spreadsheet', 'write_xlsx'
Loop through the sheets and perform the same logic instead of placing them throughout a single sheet.

Validation : how to check if the file being uploaded is in excel format? - Apache POI

Is there any way I can check if the file being uploaded is in excel format? I am using Apache POI library to read excel, and checking the uploaded file extension while reading the file.
Code snippet for getting the extension
String suffix = FilenameUtils.getExtension(uploadedFile.getName());
courtesy BalusC : uploading-files-with-jsf
String fileExtension = FilenameUtils.getExtension(uploadedFile.getName());
if ("xls".equals(fileExtension)) {
//rest of the code
}
I am sure, this is not the proper way of validation.
Sample code for browse button
<h:inputFileUpload id="file" value="#{sampleInterface.uploadedFile}"
valueChangeListener="#{sampleInterface.uploadedFile}" />
Sample code for upload button
<h:commandButton action="#{sampleInterface.sampleMethod}" id="upload"
value="upload"/>
User could change an extension of a doc or a movie file to "xls" and upload,then it would certainly throw an exception while reading the file.
Just hoping somebody could throw some input my way.
You can't check that before feeding it to POI. Just catch the exception which POI can throw during parsing. If it throws an exception then you can just show a FacesMessage to the enduser that the uploaded file is not in the supported excel format.
Please try to be more helpful to the poster. Of course you can test before poi.
Regular tests, to be performed before the try/catch, include the following. I suggest a fail-fast approach.
Is it a "good" file?
if file.isDirectory() -> die and exit.
if !file.isReadable() -> die and exit.
if file.available <= 100 -> die and exit (includes file size zero)
if file.size >= some ridiculous large number (check your biggest excel file and multiply by 10)
File seems good, but is contents like Excel?
Does it start with "ÐÏà" -> if not, die.
Does it contain the text "Sheet"-> if not, die
Some other internal excel bytes that I expected from you guys here.

Resources