SSIS load validation - validation

I'm importing data from a text file using Bulk Insert in the script component in SSIS package.
Package Ran successfully and data imported into SQL
Now how do I validate the completeness of the data?
1. I can get the row count from source and destination and compare.
but my manager wants to know how we can verify all the data has come a cross as it is without any issues.
If we are comparing 2 tables then probably a joining them on all fields and see anything missing out.
I’m not sure how to compare a text file and a sql table.
One way I could is write code to ready the file line by line and query the database for that record and compare each and every field. We have millions of records so this is not going to be a simple task.
Any other way to validate all of the data ??
Any suggestions would be much appreciated
Thanks
Ned

Well you could take the same file and do a look-up to the SQL source and if any of the columns don't match move to a row count.
Here's a generic example of how you can do this.

Related

Loading csv and writing bad records with individual errors

I am loading a csv file into my database using SQL Loader. My requirement is to create an error file combining the error records from .bad file and their individual errors from the log file. Meaning if a record has failed because the date is invalid, against that record in a separate column of error description , Invalid date should be written. Is there any way that SQL loader provides to combine the too. I am a newbie to SQL loader.
Database being used Oracle 19.c
You might be expecting a little bit too much of SQL*Loader.
How about switching to external table? In the background, it still uses SQL*Loader, but source data (which resides in a CSV file) is accessible to you by the means of a table.
What does it mean to you? You'd write some (PL/)SQL code to fetch data from it. Therefore, if you wrote a stored procedure, there are numerous options you can use - perform various validations, store valid data into one table and invalid data into another, decide what to do with invalid values (discard? Modify to something else? ...), handle exceptions - basically, everything PL/SQL offers.
Note that this option (generally speaking) requires the file to reside on the database server, in a directory which is a target of Oracle directory object. User which will be manipulating CSV data (i.e. the external table) will have to acquire privileges on that directory from the owner - SYS user.
SQL*Loader, on the other hand, runs on a local PC so you don't have to have access to the server itself but - as I said - doesn't provide that much flexibility.
it is hard to give you a code answer without the example.
If you want to do your task I can suggest two ways.
From Linux.
If you loaded data and skipped the errors, you must do two executions.
That is not an easy way and not effective.
From Oracle.
Create a table with VARCHAR2 columns with the same length as in the original.
Load data from bad_file. Convert your CTL adapted to everything. And try to load in
the second table.
Finally MERGE the columns to original.

Serializing query result

I have a financial system with all its business logic located in the database and i have to code an automated workflow for transactions batch processing, which consists of steps listed below:
A user or an external system inserts some data in a table
Before further processing a snapshot of this data in the form of CSV file with a digital signature has to be made. The CSV snapshot itself and its signature have to be saved in the same input table. Program updates successfully signed rows to make them available for further steps of code
...further steps of code
Obvious trouble is step#2: I don't know, how to assign results of a query as a BLOB, that represents a CSV file, to a variable. It seems like some basic stuff, but I couldn't find it. The CSV format was chosen by users, because it is human-readable. Signing itself can be made with a request to external system, so it's not an issue.
Restrictions:
there is no application server, which could process the data, so i have to do it with plsql
there is no way to save a local file, everything must be done on the fly
I know that normally one would do all the work on the application layer or with some local files, but unfortunately this is not the case.
Any help would be highly appreciated, thanks in advance
I agree with #william-robertson. you just need to create a comma delimited values string (assuming header and data row) and write that to a CLOB. I recommend an "insert" trigger. There are lots of SQL tricks you can do to make that easier). On usage of that CSV string will need to be owned by the part of the application that reads it in and needs to do something with it.
I understand yo stated you need to create a CVS, but see if you could do XML instead. Then you could use DBMS_XMLGEN to generate the necessary snapshot into a database column directly from the query for it.
I do not accept the concept that a CVS is human-readable (actually try it sometime as straight text). What is valid is that Excel displays it in human-readable form. But is should also be able to display the XML as human-readable. Further, if needed the data in it can be directly back-ported into the original columns.
Just a alternate idea.

Talend : migrate data maintaining the same row sequence in input db and output db

I am migrating data from sybase to oracle using talend. I am using tSybaseInput for input and tOracleOutput for output db. I am mapping them through t_Map in some whereas direct in others.
After running the job, the row order is not maintained i.e. the order in which the data comes from sybase is not same as reflected in oracle. I need the order to be same so that I can validate the data later by outputting the data of both db to csv's and then comparing them(right now I am sorting them(unix sort) ..but it seems wrong).
Please suggest a way to maintain row order of input db in output db.
Also , is my method of validation correct or should I try something else?
The character sets and sort orders between the two vendors may be slightly different, which probably why you are seeing a change in the order. You may want to add a numeric key value to your tables in the Sybase DB, which can then be used to force a particular order once the data is imported into Oracle.
As for validation, if you are already using Unix command line, once you have a key value, you should just be able to use diff to compare the two CSV files without having to involve Excel. Alternatively you can add both Sybase and Oracle as data sources for Excel, and query the data directly into your worksheets, instead of generating CSV.

SSIS does not pull the whole source data rows

I've an SSIS package which has source as a Oracle view.
Select * From VwWrkf
When I execute it , I get only 3rd of the data. There is about 1.5mil rows. But there is about 450K that Tabular loads.
Any reason why thay could be?
Use fast load at destination OLEDB task which clears buffer faster and allow all records to process. May be as the buffer getting filled and the records not processed it might not getting the rest of records and might be the connection timedout.
The issue was the date format of a particular section of the report. It did something which Microsoft did not like.
Related document could be found here
It is nothing about "SSIS does not pull the whole source data rows".
If you preview the table data ,it shows only sample data right?.Likewise in the case with
select count(*) as well.If you can run the data flow,it would pick all the data form the source and will be loading it into target table.
If you still doubt,Instead of checking the ssis preview ,can you/is it possible to load data into a destination temp table,and check whether all the data being loaded into destination temp table

csv viewer on windows environement for 10MM lines file

We a need a csv viewer which can look at 10MM-15MM rows on a windows environment and each column can have some filtering capability (some regex or text searching) is fine.
I strongly suggest using a database instead, and running queries (eg, with Access). With proper SQL queries you should be able to filter on the columns you need to see, without handling such huge files all at once. You may need to have someone write a script to input each row of the csv file (and future csv file changes) into the database.
I don't want to be the end user of that app. Store the data in SQL. Surely you can define criteria to query on before generating a .csv file. Give the user an online interface with the column headers and filters to apply. Then generate a query based on the selected filters, providing the user only with the lines they need.
This will save many people time, headaches and eye sores.
We had this same issue and used a 'report builder' to build the criteria for the reports prior to actually generating the downloadable csv/Excel file.
As other guys suggested, I would also choose SQL database. It's already optimized to perform queries over large data sets. There're couple of embeded databases like SQLite or FirebirdSQL (embeded).
http://www.sqlite.org/
http://www.firebirdsql.org/manual/ufb-cs-embedded.html
You can easily import CSV into SQL database with just few lines of code and then build a SQL query instead of writing your own solution to filter large tabular data.

Resources