H2o Driverless-AI cannot import data - h2o

I tried to import my data in json format but it took forever to import and I cannot do anything except waiting.
The files consist of a list of images, and for each image, you can find the following fields:
id - the id of the image
band_1, band_2 - the flattened image data. Each band has 75x75 pixel values in the list, so the list has 5625 elements. Note that these values are not the normal non-negative integers in image files since they have physical meanings - these are float numbers with unit being dB. Band 1 and Band 2 are signals characterized by radar backscatter produced from different polarizations at a particular incidence angle.
inc_angle - the incidence angle of which the image was taken. Note that this field has missing data marked as "na", and those images with "na" incidence angles are all in the training data to prevent leakage.
is_iceberg - the target variable, set to 1 if it is an iceberg, and 0 if it is a ship.
Please advise what I can do to try this product on my data. I want to predicted probability that this image is iceberg.

reposting branden murray's solution: you can convert your json to csv.
also here are the currently support file formats for driverless as of version 1.1.6 (May 29 2018)
File formats supported:
Plain text formats of columnar data (.csv, .tsv, .txt)
Compressed archives (.zip, .gz)
Excel files
Feather binary files
Python datatable binary directories

Related

PyTorch: wrapping multiple records in one file?

Is there a standard way of encoding multiple records (in this case, data from multiple .png or .jpeg images) in one file that PyTorch can read? Something similar to TensorFlow's "TFRecord" or MXNet's "RecordIO", but for PyTorch.
I need to download image data from S3 for inference, and it's much slower if my image data is in many small .jpg files rather than fewer files.
Thanks.
One thing is to store batches of images together in a single npz file. Numpy's np.savez lets you save multiple arrays compressed into a single file. Then load the file as np arrays and use torch.from_numpy to convert to tensors.

H2O “OUTPUT - CLUSTER MEANS” section not report metrics correctly

(note: this is related to a question I posted before
H2O (open source) for K-mean clustering)
I am using K-Means for our data set of about 100 features (some of them are timestamps)
(1) I checked the “OUTPUT - CLUSTER MEANS” section and the timestamp filed is with the value like “1.4144556086883196e+22”. Our timestamp file is about data in year 2018 and the year 2018 Unix time is like “1541092918000”. Hence, it cannot be that big number “1.4144556086883196e+22”. My understand of the numbers in “OUTPUT - CLUSTER MEANS” section should be close to the raw data (before standarization). Right ?
(2) About standardization, can you use this example https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/test/resources/hex/genmodel/algos/kmeans/model.ini#L21-L27 and tell me how the input data is converted to standardized value? Say, I have a raw vector of value ( a,b,c,d, 1.8 ) , I only keep last element and omit others. How can I know if it’s close to center 2 below in this example. Can you show me how H2O convert the raw data using standardize_means, standardize_mults and standardize_modes. I am sure H2O has a way to compute standardized value from the model output, but I cannot find the place and the formula.
center_2 = [2.0, 0.0, -0.5466317772145349, 0.04096506994984166, 2.1628815416218337]
Thanks.
1) I'm not sure where you are seeing a timestamp in Flow or if you mean your dataset contains a timestamp that H2O-3 has converted. Either way it sounds like you may have encountered a bug. The timestamps you see in H2O-3 are milliseconds since the Unix epoch, so you have to divide by 1000 before using a unix time converter (for example you could use https://currentmillis.com/). But again given that the number is so large, I'm leaning towards a bug - any code you can provide to make it reproducible would be great.
1a) When you check standardize in flow in addition to “OUTPUT - CLUSTER MEANS” (which is not standardized) you will see "OUTPUT - STANDARDIZED CLUSTER MEANS" so the non-standardize output should reflect the unit of your input.
2) Standardization in H2O-3 is described here (which says: "standardizes numeric columns to have zero mean and unit variance. "). The link you provided points to a model for testing that has been saved as MOJO and I'm not sure it makes sense to use as an example. But in general the way standardization works for h2o-3 is as standardization is defined.

How to implement a PNG decoder completely from scratch

I started to work on a PNG encoding/decoding library for learning purposes so I want to implement every part of it by hand.
I got pretty long with it but now I'm a bit stuck. Here are the things I succesfully implemented already:
I can load a PNG binary and go through its bytes
I can read the signature and the IHDR chunk for metadata
I can read the IDAT chunks and concatenate the image data into a buffer
I can read and interpret the zlib headers from the above mentioned image data
And here is where I got stuck. I vaguely know the steps from here which are:
Extract the zlib compressed data according to its headers
Figure out the filtering methods used and "undo" them to get the raw data
If everything went correctly, now I have raw RGB data in the form of [<R of 1st pixel>, <G of 1st pixel>, <B of 1st pixel>, <R of 2nd pixel>, <G of 2nd pixel>, etc...]
My questions are:
Is there any easy-to-understand implementation (maybe with examples) or guide on the zlib extraction as I found the official specifications hard to understand
Can there be multiple filtering methods used in the same file? How to figure these out? How to figure out the "borders" of these differently filtered parts?
Is my understanding of the how the final data will look like correct? What about the alpha channel or when a palette is used?
Yes. You can look at puff.c, which is an inflate implementation written with the express purpose of being a guide to how to decode a deflate stream.
Each line of the image can use a different filter, which is specified in the first byte of the decompressed line.
Yes, if you get it all right, then you will have a sequence of pixels, where each pixel is a grayscale value, G, that with an alpha channel, GA, RGB (red-green-blue, in that order), or RGBA.

how to save visualization output of NVIDIA DIGITS to a folder?

I am using NVIDIA DIGITS for image classification. After I train my network and I test the model on the test data, I want to save the visualization and statistics that DIGITS can generate in my defined folder. How can I do it?
For example how can I save each image square of the following image separately in a folder I specify in the program written for this part??

Read an image pixel by pixel in Ruby

I'm trying to open an image file and store a list of pixels by color in a variable/array so I can output them one by one.
Image type: Could be BMP, JPG, GIF or PNG. Any of them is fine and only one needs to be supported.
Color Output: RGB or Hex.
I've looked at a couple libraries (RMagick, Quick_Magick, Mini_Magick, etc) and they all seem like overkill. Heroku also has some sort of difficulties with ImageMagick and my tests don't run. My application is in Sinatra.
Any suggestions?
You can use Rmagick's each_pixel method for this. each_pixel receives a block. For each pixel, the block is passed the pixel, the column number and the row number of the pixel. It iterates over the pixels from left-to-right and top-to-bottom.
So something like:
pixels = []
img.each_pixel do |pixel, c, r|
pixels.push(pixel)
end
# pixels now contains each individual pixel of img
I think Chunky PNG should do it for you. It's pure ruby, reasonably lightweight, memory efficient, and provides access to pixel data as well as image metadata.
If you are only opening the file to display the bytes, and don't need to manipulate it as an image, then it's a simple process of opening the file like any other, reading X number of bytes, then iterating over them. Something like:
File.open('path/to/image.file', 'rb') do |fi|
byte_block = fi.read(1024)
byte_block.each_byte do |b|
puts b.asc
end
end
That will merely output bytes as decimal. You'll want to look at the byte values and build up RGB values to determine colors, so maybe using each_slice(3) and reading in multiples of 3 bytes will help.
Various image formats contain differing header and trailing blocks used to store information about the image, data format and EXIF information for the capturing device, depending on the type. Probably going with a something that is uncompressed would be good if you are going to read a file and output the bytes directly, such as uncompressed TIFF. Once you've decided on that you can jump into the file to skip headers if you want, or just read those too to see or learn what's in them. Wikipedia's Image file formats page is a good jumping off place for more info on the various formats available.
If you only want to see the image data then one of the high-level libraries will help as they have interfaces to grab particular sections of the image. But, actually accessing the bytes isn't hard, nor is it to jump around.
If you want to learn more about the EXIF block, used to describe a lot of different vendor's Jpeg and TIFF formats ExifTool can be handy. It's written in Perl so you can look at how the code works. The docs nicely show the header blocks and fields, and you can read/write values using the app.
I'm in the process of testing a new router so I haven't had a chance to test that code, but it should be close. I'll check it in a bit and update the answer if that didn't work.

Resources