Using Parquet Writer is it possible to write data into parquet with bucketing? - parquet

I am writing data into parquet files programatically with AvroParquetWriter but i also want to write parquet file with bucketing, is it possible to do same with bucketing ?
Thanks in advance!!

Related

Can ParquetWriter or AvroParquetWriter store the schema separately?

Do you know, can ParquetWriter or AvroParquetWriter store the schema separately without data?
Now schema is written into parquet file:
AvroParquetWriter.Builder builder = AvroParquetWriter.<GenericRecord>builder(new Path(file.getName()))
.withSchema(payload.getSchema())
.build90;
Do you know is possible write only data without schema into parquet file?
Thank you!
#ЭльфияВалиева. No, the parquet metadata (schema) in the footer is necessary to provide parquet readers the necessary schema to read the parquet data.

How do i use Sqoop to save data in a parquet-avro file format?

I need to move my data from a relational database to HDFS but i would like to save the data to a parquet-avro file format. Looking at the sqoop documentation it seems like my options are --as-parquetfile or --as-avrodatafile, but not a mix of both. From my understanding of this blog/picture below, the way parquet-avro works is that it is a parquet file with the avro schema embedded and a converter to convert and save an avro object to a parquet file and vise versa.
My initial assumption is that if i use the sqoop option --as-parquetfile then the data being saved to the parquet file will be missing the avro schema and the converter won't work. However upon looking at the sqoop code that saves the data to a parquet file format it does seem to be using a util related to avro but i'm not sure what's going on. Could someone clarify? If i cannot do this with sqoop, what other options do i have?
parquet-avro is mainly a convenience layer so that you can read/write data that is stored in Apache Parquet into Avro object. When you read the Parquet again with parquet-avro, the Avro schema is inferred from the Parquet schema (alternatively you should be able to specify an explicit Avro schema). Thus you should be fine with --as-parquetfile.

How to move HBase tables to HDFS in Parquet format?

I have to build a tool which will process our data storage from HBase(HFiles) to HDFS in parquet format.
Please suggest one of the best way to move data from HBase tables to Parquet tables.
We have to move 400 million records from HBase to Parquet. How to achieve this and what is the fastest way to move data?
Thanks in advance.
Regards,
Pardeep Sharma.
Please have a look in to this project tmalaska/HBase-ToHDFS
which reads a HBase table and writes the out as Text, Seq, Avro, or Parquet
Example usage for parquet :
Exports the data to Parquet
hadoop jar HBaseToHDFS.jar ExportHBaseTableToParquet exportTest c export.parquet false avro.schema
I recently opensourced a patch to HBase which tackles the problem you are describing.
Have a look here: https://github.com/ibm-research-ireland/hbaquet

how to read parquet schema in non mapreduce java program

Is there a way to direct read Parquet file column names by getting metadata without mapreduce. Please give some example. I am using snappy as compression codec.
You can use either ParquetFileReader or use existing tool https://github.com/Parquet/parquet-mr/tree/master/parquet-tools for reading parquet file using command line.

Using ParquetFileWriter to write data into parquet file?

I am newBee to parquet!
I have tried below Example code to write data into parquet file using parquetWriter .
http://php.sabscape.com/blog/?p=623
The above example uses parquetWriter, But I want to use ParquetFileWriter to write data efficiently in parquet files.
Please suggest an example or how we can write parquet files using ParquetFileWriter ?
You can probably get some idea from a parquet column reader that i wrote here.

Resources