How to read parquet file from s3 bucket in nifi? - apache-nifi

I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.

It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html

Related

Amazon Glue - Create Single Praquet

I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.

appending to ORC file

I'm new to Big data and related technologies, so I'm unsure if we can append data to the existing ORC file. I'm writing the ORC file using Java API and when I close the Writer, I'm unable to open the file again to write new content to it, basically to append new data.
Is there a way I can append data to the existing ORC file, either using Java Api or Hive or any other means?
One more clarification, when saving Java util.Date object into ORC file, ORC type is stored as:
struct<timestamp:struct<fasttime:bigint,cdate:struct<cachedyear:int,cachedfixeddatejan1:bigint,cachedfixeddatenextjan1:bigint>>,
and for java BigDecimal it's:
<margin:struct<intval:struct<signum:int,mag:struct<>,bitcount:int,bitlength:int,lowestsetbit:int,firstnonzerointnum:int>
Are these correct and is there any info on this?
Update 2017
Yes now you can! Hive provides a new support for ACID, but you can append data to your table using Append Mode mode("append") with Spark
Below an example
Seq((10, 20)).toDF("a", "b").write.mode("overwrite").saveAsTable("tab1")
Seq((20, 30)).toDF("a", "b").write.mode("append").saveAsTable("tab1")
sql("select * from tab1").show
Or a more complete exmple with ORC here; below an extract:
val command = spark.read.format("jdbc").option("url" .... ).load()
command.write.mode("append").format("orc").option("orc.compression","gzip").save("command.orc")
No, you cannot append directly to an ORC file. Nor to a Parquet file. Nor to any columnar format with a complex internal structure with metadata interleaved with data.
Quoting the official "Apache Parquet" site...
Metadata is written after the data to allow for single pass writing.
Then quoting the official "Apache ORC" site...
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file (...)
The file’s tail consists of 3 parts; the file metadata, file footer
and postscript.
Well, technically, nowadays you can append to an HDFS file; you can even truncate it. But these tricks are only useful for some edge cases (e.g. Flume feeding messages into an HDFS "log file", micro-batch-wise, with fflush from time to time).
For Hive transaction support they use a different trick: creating a new ORC file on each transaction (i.e. micro-batch) with periodic compaction jobs running in the background, à la HBase.
Yes this is possible through Hive in which you can basically 'concatenate' newer data. From hive official documentation https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-WhatisACIDandwhyshouldyouuseit?

What's the recommended way of loading data into Hive from compressed files?

I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?
You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).

Avro file type for images?

I try to...figure that case in Hadoop.
What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?
SequenceFile are key-value oriented, so I think that Avro files will work better?
I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.
SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.
In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.
If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.
I never use Avro to store images and I don't know about any project that use it.

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Resources