How to delete all frames in H2O? - h2o

I accidently imported more data into 0xdata's h2o flow than I actually intended. How can I delete all my data frames?
I already tried Data -> List All Frames -> Delete, but I get the following error message:
Error evaluating cell
Error calling DELETE/....ink
Object 'nfs:....lnk' not found for argument: key
Is there another way to erase those data frames? Where are those data frames physically stored?

can you please provide more details about your environment - which version of H2O, platform.
I would recommend to re-try with the latest H2O (see http://h2o.ai/download).

Agreed with #Michal that more info is needed.
If you're using R (I would recommend using R or Python) use h2o.removeAll().
If you're using the Flow UI select Data -> List All Frames then select the check box for all frames and then click Delete selected frames.

Related

How to prevent doubling values on Stream analytics due to blob input

We see an issue that on stream analytics when using a blob reference input. Upon restarting the stream, it prints double values for things joined to it. I assume this is an issue with having more than 1 blob active during the time it restarts. Currently we pull the files from a folder path in ADLS structured as Output/{date}/{time}/Output.json, which ends up being Output/2021/04/16/01/25/Output.json. These files have a key that the data matches on in the stream with:
IoTData
LEFT JOIN kauiotblobref kio
ON kio.ParentID = IoTData.ConnectionString
which I don't see any issue with, but those files are actually getting created every minute on the minute by an azure function. So it may be possible during the start of stream analytics, it grabs the last and the one that gets created following. (That would be my guess, but I'm not sure how we would fix that).
Here's a visual in powerBI of the issue:
Peak
Trough
This is easily explained when looking at the cosmosDB for that device it's capturing from, there are two entries with the same value, assetID, timestamp, different recordID(just means cosmosDB counted it as two separate events). This shouldn't be possible because we can't send duplicates with the same timestamp from a device.
This seems to be a core issue with blob storage on stream analytics, since it traditionally takes more than 1 minute to start. The best way I've found to resolve is to stop the corresponding functions before starting stream back up. Working to automate through CI/CD pipelines, which is good practice anyways for editing the stream.

i lost many pages, need to recover and fix the error : Fatal exception of type "MediaWiki\Revision\RevisionAccessException"

mediawiki 1.32 installation on XAMPP on windows 10.
using it for few months, suddenly this error started coming in many pages : Fatal exception of type "MediaWiki\Revision\RevisionAccessException"
cannot see my data, undo any changes or edit the page anymore, it's locked.
have lots of data on pages which i need to recover and make the page editable again.
the wiki site was created on media wiki 1.32
tried : rolling back to previous vesions of mediawiki, restoring the database, didn't work.
tried : moving the mediawiki & importing database on different system (linux, mysql, lighthttpd), didn't work.
... There is no short answer, you should check the database schema and define where your text data is.
https://www.mediawiki.org/wiki/Manual:Database_layout
Edit.
You may try this:
SELECT P.page_namespace, P.page_title, R.rev_id, C.content_id, C.content_address, Convert(T.old_text USING utf8)
FROM page P
INNER JOIN revision R ON R.rev_id=P.page_latest
INNER JOIN slots S ON R.rev_id = S.slot_revision_id
INNER JOIN content C ON S.slot_content_id=C.content_id
INNER JOIN text T ON Concat("tt:",T.old_id)=C.content_address
If you succes in extracting the data, be aware of namespaces and other metadata you may need to restore.
https://www.mediawiki.org/wiki/Manual:Namespace

AutoML Vision: Dataset import takes long time and fails eventually

I am currently trying to import a single-label dataset that contains ~7300 images. I use a single CSV file in the following format to create the dataset from (paths shortened):
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
gs://its-2018-40128940-automl-vis-vcm/[...].jpg,CAT_00
[...]
However, the import process failed after processing for over 7 hours (which I find unusually long based on previous experience) with the following error:
File unreadable or invalid gs://[...]
The strange thing is: The files were there and I was able to download and view them on my machine. And once I removed all entries from the CSV except the two "unreadable or invalid" ones and imported this CSV file (same bucket), it worked like a charm and took just a few seconds.
Another dataset with 500 other images caused the same strange behavior.
I have imported and trained a few AutoML Vision models before and I can't figure out what is going wrong this time. Any ideas or debugging tips appreciated. The GCP project is "its-2018-40128940-automl-vis".
Thanks in advance!
File unreadable or invalid is returned when a file either cannot be accessed from GCS (cannot be read due to file size or permissions) or when the file format is considered invalid. For example image is in different format than the extension used or in format that is not supported by image service.
When there are errors the pipeline may be slow because currently it does re-tries with exponential backoff. It tries to detect non-retry-able errors and fail fast - but errs on the re-try if unsure side.
It would be best if you could ensure the images are in the right format - for example by re-converting the images into one of the supported formats.
Depending on your platform there are tools to do that.
When I check a file via uploaded in the UI of GCP Storage
To match this one we have to upload the file in following configurations,
storage.bucket(bucketName).upload(`./${csv_file}`, {
// Support for HTTP requests made with `Accept-Encoding: gzip`
destination: `csv/${csv_file}`,
gzip: false,
metadata: {
},
});

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

How to rename a data frame in H2o Flow?

Using the h2o interface I am not able to figure out how to rename a data frame previously created.
I was trying to find a way via: getFrameSummary command, but there is no rename option.
Any workaround?, Thanks in advance.
I don't know of a way to change the frame id from Flow once the data has been parsed. If you want to change it when you're uploading, then you do that here (see ID field):
Though that is not helpful if you're talking about renaming frames that were created in the modeling process. An alternative is to open up R or Python, connect to your H2O cluster, and change it from there using the h2o.assign() function (same function name in R/Py).

Resources