Matillion S3 load component Issue - etl

I am loading data from S3 bucket to Landing table via S3 Load Component using Matillion ETL tool.
I have one records like below
000D3A8B328E|"Rila Borovets" AD||83634A3C|DDFS
Filed separator is | and while loading this records I am getting following error "err code: 1214 Delimited value missing end quote"
I want to load values as it coming from source without removing double quotes.
I have million of records and few are with double quotes and few are not. I am having issue for only those records which are having double quotes.
How do i handle this scenario.?

You don't say which Matillion ETL product you're using there (could be Matillion ETL for Snowflake, for Redshift or for Delta Lake on Databricks).
If you're targeting Snowflake, choose File Type CSV then the only non-default parameter you would need to set in your S3 Load component is to change the Field Delimiter to |
If you're targeting Redshift, choose Data File Type Delimited, and again the only non default parameter is setting the Delimiter to |

Related

Reading a large CSV file in Azure Logic Apps using Azure Storage Blob

I have a large line delimited (not comma) CSV file (1.2 million lines - 140mb) that contains both data and metadata from a test. The first 50 or so lines are metadata which I need to extract to populate an SQL table.
I have built a Logic App which uses the Azure Blob Storage connector as a trigger. This CSV file is copied into the blob and it triggers the app to do it's stuff. For small files under 50mb this works fine however I get this error for larger files.
InvalidTemplate. Unable to process template language expressions in action 'GetMetaArray' inputs at line '0' and column '0': 'The template language function 'body' cannot be used when the referenced action outputs body has large aggregated partial content. Actions with large aggregated partial content can only be referenced by actions that support chunked transfer mode.'.
The output query is take(split(body('GetBlobContent'), decodeUriComponent('%0D%0A')),100)
The query allows me to put the line delimited meta data into an array so I can perform some queries against it to extract data I convert into variables and use them to check the file for consistency (e.g meta data must meet certain criteria)
I understand that the "Get Blob Content V2" supports chunking natively however from the error it seems like I cannot use the body function to return my array. Can anyone offer any suggestions how I get around this issue? I only need to use a tiny proportion of this file
Thanks Jonny

Glue Crawler excluding many files from table after running on S3 json GZIP data

I have a lambda that is ingesting json data from a load balancer and then writing each individual json record with a PUT to a kinesis stream. The kinesis stream is the producer for kinesis firehose, which deposits GZIP into S3 bucket under prefix 'raw'. Example JSON record:
{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payload":{"checking": "true"}}
I created an x-ray trace in the producing lambda so I have an idea of how many PUT request (so each individual JSON record). In the time period I had this ingestion turned "On", I sent about 18,000 records to kinesis stream. When I ran the crawler on the table with prefix "raw" ( I used default settings but checked in "Crawlers Output" section "Update all new and existing partitions with metadata from the table." to avoid the HIVE_PARTITION_SCHEMA_MISMATCH. The crawler runs and successfully detects the schema, and looks like this:
column . data type
level . string
hash string
source . string
hat_name string
event_type string
payload string . <--- (only nested json field that has lots of possible internal structure)
parition_0 string
partition_1 string
partition_2 string
partition_3 string
Once the table is created I notice that there are only about 4,000 records, and it should have about 4 times the amount of records. Later I reran the crawler and I noticed in the logs that one line says:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler
I examined some of the files excluded, the majority of them had valid JSON data, however one or two the file had truncated json record at the end of the file like so:
{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payload":{"checking":
"true"}}{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payl
What do I need to do in glue to have all records loaded into the table, i should have around 18,000 not 4,200? I think one issue is the schema may not match exaclty on some records? But I validate in the kinesis producer that it is a valid json strucutre with appropriate top level fields. The second issue I see is the file with truncated json record? I am assuming this may be an issue with firehose batching the files? Any help is appreciated.
Note: I have tried to manually create the json table defining all top level fields, and I still have the same problem, It only finds around 4,200 entries when I query in athena.

Pig : Use to Write record types in a single file to multiple outputs

I have the following data in a single file
"HD",003498,"20160913:17:04:10","D3ZYE",1
"EH","XXX-1985977-1",1,"01","20151215","20151215","20151229","20151215","2304",,,"36-126481000",1340.74,61808.00,1126.62,0.00,214.12,0.00,0.00,0.00,"30","20151229","00653845",,,"PARTS","001","ABI","20151215","Y","Y","N","36-126481000",
I would like to use Pig to read this single file and then segregate it to different files based on the first column
In the same light, I was looking for a way to treat the record first as a following construct:
recTypCd, recordData
And then later on just treat recordData as a CSV record
In this regard, after I store them in separate files with the same record types, I can simply load them to its own External HIVE Tables by using a CSV serde
You can use split by in pig based on your condition
E.g multiple =split line by recTypeCd
Case hd1 when rectypecd ==‘hd’,
Case hd2 ...
Store hd1 into op1;
Store hd2 into op2;

How to store streaming data in cassandra

I am new in Cassandra, I am very confused.I know that cassandra write speed is very fast.I want to store twitter data coming from storm.I googled, Every time I got make sstable and load into cluster. If every time I have to make sstable then how it possible to store twitter data streaming in cassandra.
please help me.
How I can store log data, which is generated at 1000log per second.
please correct me if I am wrong
I think Cassandra single node can handle 1000 logs per second without bulk loading if your schema is good. Also depends on the size of each log.
Or you could use Cassandra's Copy From CSV command.
For this you need to create a table first.
Here's an example from datastax website :
CREATE TABLE airplanes (
name text PRIMARY KEY,
manufacturer text,
year int,
mach float
);
COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv';
You need to specify the names of the columns based on the order in which they will be stored in the CSV. And for values with comma(,) you could enclose them in double quotes (") or use a different delimiter.
For more details refer http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/copy_r.html

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Resources