I have a file encrypted parquet data and it is read as an Inputstream. I want to extract individual parquet records from this Inputstream.Is there any way to do this? In avro it is possible with DatumReader.I am not supposed to write my data on disk in between.
Download tmp File
ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(),new Path(file.getAbsolutePath()))
.withConf(conf)
.build();
Related
I am writing data into parquet files programatically with AvroParquetWriter but i also want to write parquet file with bucketing, is it possible to do same with bucketing ?
Thanks in advance!!
I have outputstream and i want to create parquet file using this outputstream is that possible to do that?
Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types? I know writing RDD as these formats is not supported. I was actually trying to write a parquet file with first line containing only the header date and other lines containing the detail records. A sample file layout
2019-04-06
101,peter,20000
102,robin,25000
I want to create a parquet with the above contents. I already have a csv file sample.csv with above contents. The csv file when read as dataframe contains only the first field as the first row has only one column.
rdd = sc.textFile('hdfs://somepath/sample.csv')
df = rdd.toDF()
df.show()
o/p:
2019-04-06
101
102
Could someone please help me with converting the entire contents of rdd into dataframe. Even when i try reading the file directly as a df instead of converting from rdd same thing happens.
Your file only has "one column" in Spark's reader, so therefore the dataframe output will only be that.
You didn't necessarily do anything wrong, but your input file is malformed if you expect there to be more than one column, and if so, you should be using spark.csv() instead of sc.textFile()
Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types?
Because those types need a schema, which RDD has none.
trying to write a parquet file with first line containing only the header date and other lines containing the detail records
CSV file headers need to describe all columns. There cannot be an isloated header above all rows.
Parqeut/Avro/ORC/JSON cannot do not have column headers like CSV, but the same applies.
I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.
It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html
I am newBee to parquet!
I have tried below Example code to write data into parquet file using parquetWriter .
http://php.sabscape.com/blog/?p=623
The above example uses parquetWriter, But I want to use ParquetFileWriter to write data efficiently in parquet files.
Please suggest an example or how we can write parquet files using ParquetFileWriter ?
You can probably get some idea from a parquet column reader that i wrote here.