Incremental data mapping - c++11

How can we efficiently load the data from an incremental CSV file without reading the whole file repetitively?
I have used the timestamp information from the given file to load the data in every 5 minutes but in case the timestamp information is not available how can we make it work?

Related

what is the best way of handling data resubmission in a datawarehouse?

Let's assume that we have a datawarehouse comprised of four components :
extract : source data is extracted from an oracle database to a flat file. there is a flat file per source table. Extraction date is kept as part of the flat file name. Each record contains a insert/update date from the source system.
staging area : temporary tables used to load the extracted data into database tables
operational data store : staged data will be loaded in the ODS. The ODS keeps all the history of all the loaded data and the data is typecast. Surrogate keys are not yet generated.
datawarehouse : data is loaded from the ODS, surrogate keys are generated, dimensions are historized, and finally fact data is loaded and attached to the proper dimension.
So far so good, and regarding regular delta loading I have no issue. However the question I ask myself is : I have regularly encountered in the past situations where, for whatever reason, you will want to resubmit extracted data into the loading pipeline. Let's assume for instance that we select all the extracted flat files over the last 15 days, and that we push them again to the ETL process.
There is no new extraction from the source systems. Previously loaded files are re-used and fed into the ETL process.
Data is then reloaded into the staging tables, which will have been truncated previously
now data has to move to the ODS. And here I have a real headache on how to proceed.
Alternative 1 : just insert the new rows. So we would have :
row 2, natural key : ID001, batch date : 12/1/2022 16:34, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022 10:43:00
previous row : natural key : ID001, batch date : 10/1/2022 01:00, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
But then, when loading to the DWH, we need to have some kind of insert/update mechanism and we cannot do a straight insert as it will created duplicate facts.
Alternative 2 : apply an insert/update logic at ODS level. With the previous example we would have :
check if the ODS table contains already a row with natural key : ID001 - extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
insert if not found
Alternative 3 : purge in the ODS the previously loaded data, i.e.
purge all the data where extraction date in the last 15 days
load the data from the staging.
Alternative 1 is performant but shifts the insert/update task at DWH level, so the performance-killer is still there.
Alternative 2 requires an insert update, which for millions of rows does not seem optimal.
Alternative 3 looks good but if feels wrong to delete data from the ODS.
What is your view on this ? In other words my question would be how to reconcile the recommandation to have insert-only processes in the datawarehouse, with the reality that from time to time you will need to reload previously extracted data to fix bugs or correct missing data.
There are two primary methods to load data into your data warehouse:
Full load: with a full load, the entire data staged is dumped, or loaded, and is then completely replaced with the new, updated data flow. No additional information, such as timestamps or audit technical columns, are needed.
Incremental load/ Delta load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the data volumetry , streaming incremental load and batch incremental load.

Blob Storage read efficiency

One question about the read efficiency when using Azure Blob Storage. Whehter it's faster to read from multiple small size file (e.g. 5MB) or it's faster to read from a large file (e.g. > 200MB)?
In my current project, I need to persist the stream data in Azure Blob Storage in Avro format and read it afterwards. For example, I can either persist the data every 15min into a single Avro file and that will generate 4*24 = 96 files for one day. Or I can use AppendBlob to append all data to one file, which will generate a single Avro file for one day. When reading the data of past few days from the Blob Stroage, which case could be more efficient?

How to read and perform batch processing using spring batch annotation config

I have 2 different file with different data. The file contains 10K record per day.
Ex:
Productname price date
T shirt,500,051221
Pant,1000,051221
Productname price date
T shirt,800,061221
Pant,1800,061221
I want to create final output file by checking price difference by todays and yesterdays file.
Ex:
Productname price
T shirt,300
Pant,800
By using spring batch I have to do this.
I have tried with batch configuration by creating two different step. but its only able to read the data. but unable to
do the processing. because here I need the data of both file for processing. but in my case its reading one step after another.
Could anyone help me on this with some sample code.
I would suggest to save FlatFile data into the database for yesterday's and today's date (may be two separate tables or in a same table if you can identify difference two records easily). Read this stored data using JdbcCursorItemReader or PagingItemReader and perform calculation/logic/massaging of data at the processor level and create a new FlatFile or save into DB as per convenience. OOTB Spring Batch does not provide facility to read data and perform calculation.
Suggestion - Read data from both the FlatFile keep it in cache and read from the cache and do the further processing.

Load on demand for dynamic report

I had a huge CSV file which is of 2GB for which I would like to generate a dynamic report. Is there any possibility to load only 1st few MB of data into that report and on scrolling the report to next page then load next few MB of data and so on... for good performance to avoid crashes data to be visualized is huge?
Our JRCsvDataSource implementation (assuming this is the one you want to use) does not consume memory while reading CSV data from the file or input stream, as it does not hold to any values and just parses data row by row. But if the data amount is huge, then the report output itself will be huge, in which case you need to use a report virtualizer during report filling and then viewing/exporting.

Analyzing huge amount of JSON files on S3

I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Resources