Reading a large CSV file in Azure Logic Apps using Azure Storage Blob - azure-blob-storage

I have a large line delimited (not comma) CSV file (1.2 million lines - 140mb) that contains both data and metadata from a test. The first 50 or so lines are metadata which I need to extract to populate an SQL table.
I have built a Logic App which uses the Azure Blob Storage connector as a trigger. This CSV file is copied into the blob and it triggers the app to do it's stuff. For small files under 50mb this works fine however I get this error for larger files.
InvalidTemplate. Unable to process template language expressions in action 'GetMetaArray' inputs at line '0' and column '0': 'The template language function 'body' cannot be used when the referenced action outputs body has large aggregated partial content. Actions with large aggregated partial content can only be referenced by actions that support chunked transfer mode.'.
The output query is take(split(body('GetBlobContent'), decodeUriComponent('%0D%0A')),100)
The query allows me to put the line delimited meta data into an array so I can perform some queries against it to extract data I convert into variables and use them to check the file for consistency (e.g meta data must meet certain criteria)
I understand that the "Get Blob Content V2" supports chunking natively however from the error it seems like I cannot use the body function to return my array. Can anyone offer any suggestions how I get around this issue? I only need to use a tiny proportion of this file
Thanks Jonny

Related

ORA-29285: file write error

I'm trying to extract data from an Oracle table. I'm using utl file for that and I'm receiving the error ORA-29285: file write error. The weird here is if I try extract the data directly from the table return the error, if I extract the data using a simple view the error is returned as well, BUT if I extract the data using a view with an ORDER BY the extraction is well succeed. I can't understand where the error is, I already look for the length of lines and nothing. Any suggestion from which can be?
I extract a lot of other data through the utl_file and I'm well succed. This data in specific is at the first time uploaded to Oracle table directly from a csv file with ANSI encoding. However I have other data uploaded by the same way and then I can export correctly. I checked the encoding too in order to reduce the possible mistakes and I found nothing.
Many thanks,
Priscila Ferreira

Codeigniter PHPExcel reader displaying data very slow

I am using Codeigniter PHPExcel.
Basically i am using an excel files to read data and display it on the website through PHPExcel. Currently it is taking time to load.
What it basically is doing is creating JSON files through PHPExcel libraries and the data is being read through JSON once the page has been loaded.
But, I am facing a slow load now. When i went through the JSON file i saw that the size is around 3.5 MB and i have more than 3 files through which i am reading the data.
Can anyone suggest me any workarounds for the optimisation? I have read about "Reading in Chunks".
Can we read for the few rows for the first request, like basic filtering we generally do while fetching from database?
May be you should use and DB to prevent loading files for many time? If you use JSON format - MongoDB or PostrgesQL (with json field) will be perfect. Or just parse fields from excel file and load this to normalized DB.

Analyzing huge amount of JSON files on S3

I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Huge files in hadoop: how to store metadata?

I have a use case to upload some tera-bytes of text files as sequences files on HDFS.
These text files have several layouts ranging from 32 to 62 columns (metadata).
What would be a good way to upload these files along with their metadata:
creating a key, value class per text file layout and use it to create and upload as sequence files ?
create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?
Any inputs are appreciated !
Thanks
I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.
You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.
As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.
As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.
If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.
The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.
For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.

Resources