We have our data stored in hive text files and parquet files is there anyway to load directly from these into H2O or do we have to go through an intermediate step like csv or pandas dataframe?
yes, you can find all the information you need here
H2O currently supports the following file types:
CSV (delimited) files (including GZipped CSV)
ORC
SVMLight
ARFF
XLS
XLSX
Avro version 1.8.0 (without multifile parsing or column type modification)
Parquet
Notes:
ORC is available only if H2O is running as a Hadoop job.
Users can also import Hive files that are saved in ORC format.
When doing a parallel data import into a cluster:
If the data is an unzipped csv file, H2O can do offset reads, so each node in your cluster can be directly reading its part of the csv file in parallel.
If the data is zipped, H2O will have to read the whole file and unzip it before doing the parallel read.
So, if you have very large data files reading from HDFS, it is best to use unzipped csv. But if the data is further away than the LAN, then it is best to use zipped csv.
Related
I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database.
I know that backup files saved using spark, but there is a strict restriction for me that I cant install spark in the DB machine or read the parquet file using spark in a remote device and write it to the database using spark_df.write.jdbc. Everything needs to happen on the DB machine and in the absence of spark and Hadoop only using Postgres and Bash scripting.
my files structure is something like:
foo/
foo/part-00000-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
foo/part-00001-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
foo/part-00002-2a4e207f-4c09-48a6-96c7-de0071f966ab.c000.snappy.parquet
..
..
I expect to read data and schema from each parquet folder like foo, create a table using that schema and write the data into the shaped table, only using bash and Postgres CLI.
You can using spark and converting parquet files to csv format, then moving the files to DB machine and import them by any tools.
spark.read.parquet("...").write.csv("...")
import pandas as pd
df = pd.read_csv('mypath.csv')
df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces
from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password#localhost:5432/dbname')
df.to_sql("my_table_name", engine)
I made a library to convert from parquet to Postgres’ binary format: https://github.com/adriangb/pgpq
I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.
I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.
It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html
After creating a DataFrame I can save it in avro, csv or parquet format.
Is there any other format available in dataframe or rdd by which data can be saved in Hadoop HDFS?
From What Is Apache Hadoop?:
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
With that, you can use HDFS to store virtually files in any format, including avro, CSV, parquet, etc.
In Spark, you specify the format of a DataFrame using format method while the location in a storage using save method.
format(source: String): DataFrameWriter[T] Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
save(path: String): Unit Saves the content of the DataFrame at the specified path.
You could also use the shortcut to define the format and the path of a DataFrame on a storage using the format-specific methods like json(path: String), parquet(path: String) or alike.
RDD save*
pyspark.RDD.saveAsHadoopDataset
pyspark.RDD.saveAsHadoopFile
pyspark.RDD.saveAsNewAPIHadoopDataset
pyspark.RDD.saveAsNewAPIHadoopFile
pyspark.RDD.saveAsPickleFile
pyspark.RDD.saveAsSequenceFile pyspark.RDD.saveAsTextFile
DataFrame save
pyspark.sql.DataFrame.save
pyspark.sql.DataFrameWriter.save
pyspark.sql.DataFrame.saveAsParquetFile
pyspark.sql.DataFrame.saveAsTable
pyspark.sql.DataFrameWriter.saveAsTable
Last but not least...
Spark Dataframe Docs to better understand how to use the DataFrame Writer.
I have months' worth of data from a single domain stored in HDFS in Avro Container files. Each file has the schema for all the data in that file, of course. How do I process all the data using Hive or Pig? It seems both Hive and Pig need the avsc file of some form of table structure definition up front. i.e. even if I use Avro tools to extract avsc from each file I will have to load each dataset using a different avsc file and I cannot process all of them using one job or DDL + Query.
Isn't it possible for Hive and Pig to pull the avsc at runtime based on the Avro Container file it is processing? Is it already implemented and I'm not finding it or too difficult to implement?