I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.
Related
I have to process some raw data files in csv with cleansing transformations and load as .parquet file in clenase layer. Raw layer file(csv) and Cleanse layer file should have same name.
But I cannot save the .parquet file with the given name, it is creating directory and underneath .parquet files are saved with random name. Please help how to accomplish this.
This is how parquet files are designed to be, a collection of multiple row groups.
The name of your parquet is the folder the chunks of data will be saved under.
If you want a single file, you will have to use a different file format, and you will likely lose the parallelization capabilities offered by parquet for both read and write.
I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.
I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.
I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.