One question about the read efficiency when using Azure Blob Storage. Whehter it's faster to read from multiple small size file (e.g. 5MB) or it's faster to read from a large file (e.g. > 200MB)?
In my current project, I need to persist the stream data in Azure Blob Storage in Avro format and read it afterwards. For example, I can either persist the data every 15min into a single Avro file and that will generate 4*24 = 96 files for one day. Or I can use AppendBlob to append all data to one file, which will generate a single Avro file for one day. When reading the data of past few days from the Blob Stroage, which case could be more efficient?
Related
I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3.
The other department which consumes this data wants the files to be consolidated into a single file for yesterday.
I have written a python program that consolidates yesterday's 24 files into a single CSV file.
Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.
You can break out to DataFrames, call repartition(1) and then call write.
I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.
I would like to know what is better way of handling large files such as 3-4 gigabytes as Oracle blob SecureFile.
The scenario here is, I am planning to upload large files to oracle db over wcf service. I am spiltting file in to smaller chunks of 200mb and uploading it one by one. On oracle side, I just append to the single blob until whole files get uploaded. This happens in sequential manner. However, I am thinking to upload chunks in parallel so I can speed up the operation of uploading. But this will not possible to handle at Oracle end as I can't update single blob with multiple uploads as it would then write bytes not in the order it receives from the service. Is it good than to insert each blob separately and merge them later once into a single blob record in Oracle side?
Thanks
Jay
Based on the FAQ at Parse.com:
What is the difference between database storage and file storage?
Database storage refers to data stored as Parse Objects, which are
limited to 128 KB in size. File storage refers to static assets that
are stored using the Parse File APIs, typically images, documents, and
other types of binary data.
Just want some clarification here:
So the Strings, Arrays etc created are considered as Parse Objects and would fall under the database storage, also the URL of the file will be considered under the database storage since it is a Parse Object. But the actual files itself are considered under File Storage?
Thanks.
Yes. Any file that you upload to Parse goes to the File storage, the rest is stored in the database including the URL of such files.
I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html