We are using aws glue etl jobs to convert the s3 Json or CSV to parquet format and the result we are saving in nnew s3 .
This job is running periodically.
We are facing an issue,for example if we have 10json files each time it runs it is creating new 10parquet files so it becomes 10 20 30 40....and so on we only want to see 10 files .
Is there any way we can override existing parquet files .We are using glue generated Python script only.
Can we convert only the updated files or can we overdue all files ?
df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python.
Related
Need help in processing incremental files.
Scenario: Source team is creating file in every 1hr in s3 (hrly partitioned). I would like to process in every 4hr. The Glue etl will read the s3 files (partitioned hrly) and process to store in different s3 folders.
Note : Glue ETL is called from airflow.
Question How can I make sure that I only process the incremental files ( let’s say 4 files in each execution)?
Sounds like a use case for Bookmarks
For example, your ETL job might read new partitions in an Amazon S3
file. AWS Glue tracks which partitions the job has processed
successfully to prevent duplicate processing and duplicate data in the
job's target data store.
I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.
I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.
It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html
We have our data stored in hive text files and parquet files is there anyway to load directly from these into H2O or do we have to go through an intermediate step like csv or pandas dataframe?
yes, you can find all the information you need here
H2O currently supports the following file types:
CSV (delimited) files (including GZipped CSV)
ORC
SVMLight
ARFF
XLS
XLSX
Avro version 1.8.0 (without multifile parsing or column type modification)
Parquet
Notes:
ORC is available only if H2O is running as a Hadoop job.
Users can also import Hive files that are saved in ORC format.
When doing a parallel data import into a cluster:
If the data is an unzipped csv file, H2O can do offset reads, so each node in your cluster can be directly reading its part of the csv file in parallel.
If the data is zipped, H2O will have to read the whole file and unzip it before doing the parallel read.
So, if you have very large data files reading from HDFS, it is best to use unzipped csv. But if the data is further away than the LAN, then it is best to use zipped csv.
I have a use case where i have all my 4 tb of data in HBase tables that i have interrogated with HIVE tables .
Now i want to extract 5 k files out of this 30 tables that i have created in HIVE.
This 5K files will be created by predefined 5K queries.
Can somebody suggest me what approach i should follow for this?
Required time for this is 15 hrs .
Should i write java code to generate all this files .
File generation is fast .Out of 5k text files there are 50 files that takes around 35 minutes rest of all creates very fast .
I have to generate zipped file and have to send it to client using ftp.
If I understand your question right, you can accomplish your task by first exporting the query results via one of methods from here : How to export a Hive table into a CSV file?, compressing the files in a zip archive and then FTP'ing them. You can write a shell script to automate the process.