URI not found in CSV row - import error for AutoML NL - google-cloud-automl

I have been running several independent multi-class NL models on an identical data set (to compare performance to a multi-label model) and had no problems importing the data or running the models. I've just been through the identical preparation process, uploaded the file to the bucket and now get this error on import:
Uri is not found in CSV row ",NotWarm".
Warm and NotWarm are my labels. A sample of the csv is below so you can see the format:
"To ensure you get the best possible service, we stagger the cut-off time for next day delivery from 5pm right up until Midnight.",Warm
You’ll be able to see if Next Day Delivery is still available when you place your order.,NotWarm
"You can choose a home delivery option, which lets you have your order delivered to an address of your choice.",Warm
"Some eligible items also let you choose Click + Collect, where your order is delivered to a local store.",NotWarm
I've double checked all the advice about preparing datasets on the AutoML help pages. The file itself has been encoded in UTF-8 using Notepad++ so there should be nothing amiss with the CSV format. The file is identical to those I've used previously except for the labels.
Has something changed on the AutoML NL process as it was over a month since my last model was created?
Thanks in advance for any guidance.

SOLVED
I tagged all my labels with unique numbers to determine which line of data it was failing on upload. Turned out some blank lines had crept into the file so I was trying to assign a label to a null string. Removed the empty lines and all works. Hopefully this may help someone else.

Related

AutoML Vision: Dataset import takes long time and fails eventually

I am currently trying to import a single-label dataset that contains ~7300 images. I use a single CSV file in the following format to create the dataset from (paths shortened):
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
[...]
However, the import process failed after processing for over 7 hours (which I find unusually long based on previous experience) with the following error:
File unreadable or invalid gs://[...]
The strange thing is: The files were there and I was able to download and view them on my machine. And once I removed all entries from the CSV except the two "unreadable or invalid" ones and imported this CSV file (same bucket), it worked like a charm and took just a few seconds.
Another dataset with 500 other images caused the same strange behavior.
I have imported and trained a few AutoML Vision models before and I can't figure out what is going wrong this time. Any ideas or debugging tips appreciated. The GCP project is "its-2018-40128940-automl-vis".
Thanks in advance!
File unreadable or invalid is returned when a file either cannot be accessed from GCS (cannot be read due to file size or permissions) or when the file format is considered invalid. For example image is in different format than the extension used or in format that is not supported by image service.
When there are errors the pipeline may be slow because currently it does re-tries with exponential backoff. It tries to detect non-retry-able errors and fail fast - but errs on the re-try if unsure side.
It would be best if you could ensure the images are in the right format - for example by re-converting the images into one of the supported formats.
Depending on your platform there are tools to do that.
When I check a file via uploaded in the UI of GCP Storage
To match this one we have to upload the file in following configurations,
storage.bucket(bucketName).upload(`./${csv_file}`, {
// Support for HTTP requests made with `Accept-Encoding: gzip`
destination: `csv/${csv_file}`,
gzip: false,
metadata: {
},
});

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

How to use RRDTool/Cacti to count "user activities" in apache access logs?

Goal
I wish to use RRDTool to count logical "user activity" from our web application's apache/tomcat access logs.
Specifically we want to count, for a period, occurrences of several url patterns.
Example
We have two applications (call them 'foo' and 'bar')
These url's interest us. They indicate when users 'did interesting stuff'.
/foo/hop
/foo/skip
/foo/jump
/bar/crawl
/bar/walk
/bar/run
Basically we want to know for a given interval (10 minutes, hour, day, etc.) how many users: hopped,skipped,jumped,crawled, walked, etc.
Reference/Starting point
This article on importing access logs into RRDTool seemed like a helpful starting point.
http://neidetcher.com/programming/2014/05/13/just-enough-rrdtool.html
However to clarify, this example uses the access log directly , whereas we want to a handful of url's 'in buckets' and count the 'number in each bucket'
Some Scripting Required..
I could do this with bash & grep & wc --iterating through the patterns, sending output to an 'intermediate results' text file....but believe RRDTool could do this with minimal 'outside coding'
That said, I believe RRDTool could do this with minimal 'outside coding'--but am unclear on the details.
Some points
I mention 'two applications' because we actually serve them up from separate servers with different log file formats. I'd like go get them into the same RRA file
Eventually I'd like to report this in cacti; initially however, I wanted to understand RRDTool details
Open to doing any coding, but would like to keep it as efficient as possible--both administratively and computer-resources. (By administratively, I mean: easy to monitor new instances)
I am very new to RRDTool and am RTM'ing . (and Walking through the Tutorial). I'm used to relational databases and spreadsheets, etc and don't have my mind around all the nuances of the RRA format.
Thanks in advance!
You could setup a separate RRD file with ABSOLUTE type datasources for each address you want to track.
Then you tail the log file and whenever you see one of the interesting urls rush by you call:
rrdtool update url-xyz.rrd N:1
The ABSOLUTE data source type is like a counter, but it gets reset every time it is read. Your counter will just count to one, but that should not be a problem.
In the example above I am using N: and not the timestamp from the access log. You could also use that if you are not doing this in real time ... but beware that you can not update the same rrd file twice at the same time. N: will use milli timestamps internally and thus probably avoid this problem.
On the other hand it may make more sense to accumulate matching log entries with the same timestamp and only update rrdtool with that number once the timestamp on the logfile changes.

QTP VB Script to update data excel placed in QC

I'm trying to automate some set of test cases which would pass inputs from one to another. For instance, if I have 5 test cases then 1st test case would pass input to 2nd - 2nd to 3rd - likewise it goes on.
And another point to be noted is that I won't perform batch execution and there will be a certain time gap between each test case.
So what I'm trying to do is like updating the outputs into some excel sheet and call them during succeeding execution. I have tried searching and tried some codes, but nothing has worked out.
So please share some idea to update excel sheet during run time which is placed in QC. Thanks!
What you're essentially saying is that you have test runs separated by some indeterminate amount of time, and you need to share data between runs. The answer is you need persisted storage of your data. You could use a database, flat file, Excel spreadsheet, or anything else that will let you programmatically write data in one run then read it in the next.
Excel spreadsheets are one such solution. You said you tried it and it did not work. That likely means that the method you used to write or read the data was incorrect, and not that there was a problem with the concept. If you provide some more specifics about exactly what you tried and where it failed, hopefully the community will be able to assist you.
I Believe you have Input Excel Sheet(s) in QC, What you can do is download the excel file from QC to local machine, store output from 1st test case to this excel sheet and upload back to QC. Which now you can use as input to next test case.

JMeter - saving results + configuring "graph results" time span

I am using JMeter and have 2 questions (I have read the FAQ + Wiki etc):
I use the Graph Results listener. It seems to have a fixed span, e.g. 2 hours (just guessing - this is not indicated anywhere AFAIK), after which it wraps around and starts drawing on same canvas from the left again. Hence after a long weekend run it only shows the results of last 2 hours. Can I configure that span or other properties (beyond the check boxes I see on the Graph Results listener itself)?
Can I save the results of a run and later open them? I know I can save the test plan or parts of it. I am unclear if I can save separately just the test results data, and later open them and perform comparisons etc. And furthermore can I open them with different listeners even if they weren't part of original test (i.e. I think of the test as accumulating data, and later on I want to view and interpret the data using different "viewers").
Thanks,
-- Shaul
Don't know about 1. Regarding 2: listeners typically have a configuration field for "Write All Data to a File", which lets you specify the file name. You can use the Simple Data Writer to store results efficiently for later analysis.
You can load results from a previous test into a visualizer by choosing "Write All Data to a File" and browsing for the file you wish to load. Somewhat counterintuitively, selecting a file for writing also loads that file into the visualizer and displays the results. Just make sure you don't run the test again while that file is selected, otherwise you will lose your saved test data. :-)
Well, I later found a JMeter group that was discussing the issue raised in my first question, and B.Ramann gave me an excellent suggestion to use instead a better graph found here.
-- Shaul

Resources