find and replace lines of data in h2o flow - h2o

I am wondering how to find and replace certain values of data in H2O flow. For example, if a line of data is '0.003 8.938 0.005 9.999' I would like to find all of the '9.999' and replace with NaN, so sort of like sed?
Is there a way to do this directly in the Flow interface?
Thanks

At this point in time, Flow is primarily a modeling tool and not a data munging tool, so there is not currently a way to do this. Sed is an efficient tool for find/replace, so that's what I'd recommend. Once you edit the file, you should import the file into the H2O cluster for modeling. Or you can do more advanced munging on the data as an H2OFrame using the R or Python API for H2O.

Related

Do ASCII Point Clouds need R G B data?

I am trying to open NASA's Clementine LIDAR data of the Lunar surface (Link: https://pds-geosciences.wustl.edu/missions/clementine/lidar.html). The data is saved as a *.tab file which is really an ASCII file. When I look for the data I need (x,y,z,r,g,b) I can only find data points x,y,z but not r,g,b.
Main Question If my goal is to open this file in CloudCompare and develop a mesh/dem file from this, do I need r,g,b data?
Side Questions If so, how do you recommend I get it from data that is from the 90s? Or atleast, how would I go about opening this.
As far as I know, there is no need to have R,G,B in CloudCompare, but you definitely will have to prepare the data in the format that is readable by CloudCompare. That shouldn't be a problem using simple Python scripts. If you further help, let me know.
You do not need the RGB data to open it on CloudCompare, just the point coordinates. There is no way of retrieving the radiometric information of this point cloud unless you have some imagery of the same epoch. Then you could associate the RGB to the point cloud by using collinearity equations, for instance.

icCube - ETL - Creating a File dump and then use it as data source

Following the question on how to execute a file dump in icCube, I would like to know it it is possible to:
create a file dump
then use it as a data source
I tried to build a sequence of data views, but I can not get it to work, and I wonder if it is even possible at al?
(The reason I would like to do it is that my main data source is an odata feed and I need a lot of data manipulation before I can load it. I anticipate that it will be much easier to do these on CSV files.)
This is not possible as the rationale behind the ETL support is to transform data-tables as returned by the data-sources.

Best ETL Packages In Python

I have 2 use cases:
Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).
I am currently using:
SQLAlchemy for extracts (works well generally).
PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).
I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).
You can try pyetl an etl framework write by python3
from pyetl import Task, DatabaseReader, DatabaseWriter
reader = DatabaseReader("sqlite:///db.sqlite3", table_name="source")
writer = DatabaseWriter("sqlite:///db.sqlite3", table_name="target")
columns = {"id": "uuid", "name": "full_name"}
functions={"id": str, "name": lambda x: x.strip()}
Task(reader, writer, columns=columns, functions=functions).start()
How about
Python
Pandas
This is what we use for our ETL processing.
I'm using Pandas to access my ETL files, try doing something like this:
Create a class with all your queries there.
Create another class that processes the actual Datawarehouse that includes Pandas and Matplotlib for the graph.
Consider having a look at convtools library, it provides lots of data processing primitives, is pure python and has zero dependencies.
Since it generates ad hoc python code under the hood, sometimes it outperforms pandas/polars, so it can some gaps in your workflows. Especially if those have dynamic nature.

Parse Freebase RDF dump with MapReduce

I downloaded the rdf data dump from Freebase and what I need to extract is the name of every entity in English in Freebase.
Do I have to use Hadoop and MapReduce to do this, if so how? Or is there another way to extract the entity names?
It would be nice if each entity title / name were on its own line in a .txt file
You could use Hadoop, but for such simple processing, you'd spend more time uncompressing and splitting the input than you would save in being able to do the search in parallel. A simple zgrep would accomplish your task in much less time.
Something along the lines of this:
zegrep $'name.*#en\t\\.$' freebase-public/rdf/freebase-rdf-2013-09-15-00-00.gz | cut -f 1,3 | gzip > freebase-names-20130915.txt.gz
will give you a compressed two column file of Freebase MIDs and their English names. You'll probably want to make the grep a little more specific to avoid false positives (and test it, which I haven't done). This file is over 20GB compressed, so it'll take a while, but less time than even getting started to prepare a Hadoop job.
If you want to do additional filtering such as only extract entities with type of /common/topic, you may find that you need to move to a scripting language like Python to be able to look at and evaluate across multiple lines at once.
No I dont think you need to use Hadoop and MapReduce to do this. You can easily create a web service to extract RDF and send to a file. Following [1] blog post explains how you can extract RDF data using WSo2 Data services server. Similarly you can use WSO2 DSS data federation [2] feature to extract RDF data and send it to a excel data sheet
[1] - http://sparkletechthoughts.blogspot.com/2011/09/extracting-rdf-data-using-wso2-data.html
[2] - http://prabathabey.blogspot.com/2011/08/data-federation-with-wso2-data-service.html
There's a screencast for Google Compute Engine that shows you how to do this as well.

working with big scientific data on Hadoop

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop".
The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes.
So far I have not been successful in finding any java code which reads hdf files and extract data from them.
If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Thanks
Ayush
Here are some resources:
SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.
For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.
For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.
If you do not find any java code and can do in other languages then you can use hadoop streaming.
SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

Resources