Need suggestion on strategy to handle informatica error "value larger than specified precision allowed this column" as it is causing repeated failures - business-intelligence

We are facing Informatica ERROR "ORA-01438: value larger than specified precision allowed for this column" again and again due to incorrect value entered by user in source system (Oracle EBS).
What strategy we follow currently is:
Informatica ETL fails due to this error
We request user to correct the value in source system
Informatica ETL fails until the value is corrected in source
After value correction, ETL completed successfully.
But we need a strategy to handle this kind of incorrect values which are causing repetative failure to whole execution plan.
Note: Issue is not with floating number.
For example: If there is column unit price having precision Number(8,2), users are putting larger values by mistake such as 123456789123.00 , 9876541236487.00 etc.

Your question isn't a problem with a pure, technical solution, but a question about data quality and data handling.
If the users are "putting larger values by mistake" then why doesn't the source system prevent this at data entry? Why correct a mistake that should never exist in the first place?
If you "need a strategy" as you claim: Do things right from the start.

The field size of your source should match what is set for the field size on the source system then whether the user entered a wrong value or not it will go through your mapping. You shouldnt have a more restrictive source than allowed on the source system (although it probably boils down to one party not fulfilling the interface agreement between your teams and that party changes their code)

Related

Oracle Sequence to change currval why not just drop it and recreate?

I want to increase the CURRVAL of a set of Sequences.
One option is to drop the Sequence and recreate it with a starting value equal to the value I want to increase to however I've come across a number of suggestions (here, here and here) which don't drop it but instead :
restrict access
change the "increment by" to whatever value you with to increase CURRVAL by
pull a value
change "increment by" back to 1
un-restrict access
This is all fine but I'm left wondering what's so bad about dropping the Sequence ?
I get it that the triggers will go invalid but the next time they're used they'll be recompiled - is this really such a big deal ?
Is there something I've overlooked ?
It becomes a big deal when there are lots of complex dependencies on either the sequence or between the trigger code and other pl/sql packages.
As a general practice, you want to minimize situations that will result in invalidated pl/sql code (stored procedures, packages, and trigger code). The auto compile will run on-demand the next time the code is called. But if you have a high volume transaction system or a large-scale application with complex dependencies, those delays (or subsequent exceptions) caused by recompiling may be unacceptable.
If your particular setup is such that the trigger code is very simple with no dependencies and you can perform the DDL during a time where the trigger will not be fired, it is probably acceptable to drop and recreate the synonym and then recompile the trigger code.

Design a dimension with multiple data sources

I am designing a few dimensions with multiple data sources and wonder what other people have done to align the multiple business keys per data source.
My Example:
I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact.
In both the Ordering and Execution system they is a Location table. The business keys from both systems are mapped via an esb . There are attributes in both systems that make up the complete picture about a single location. Billing information is in the Ordering system, latitude and longitude are in the Execution system. And Location Name exists in both systems.
How do you design a SCD accomodate changes from both systems to the dimension?
We follow a fairly strict Kimball methodology - fyi, but I am open to looking at everyone's solutions.
Not necessarily an answer but here are my thoughts:
You've already covered the real options in your comment. Either:
A. Merge it beforehand
You need some merge functionality in staging which matches the two (or more) records, creates a new common merge key and uses that in the dimension. This requires some form of lookup or reference to be stored in addition to normal DW data
OR
B. Merge it in the dimension
Put both records in the dimension and allow the reporting tool to 'merge' it by, for example, grouping by location name. This means you don't need prior merging logic you just dump it in the dimension
However you have two constraints that I feel makes the choice between A & B clearer
Firstly, you need an SCD (Type 2 I assume). This means Option B could get very complicated as when there is a change in one source record you have to go find the the other record and change it as well - very unpleasant for option B. You still need some kind of pre-stored key to link them, which means option B is no longer simple
Secondly, given that you have two sources for one attribute (Location Name), you need some kind of staging logic to pick a single name when these don't match
So given these two circumstances, I suggest that option A would be best - build some pre-merging logic, as the complexity of your requirements warrants it.
You'd think this would be a common issue but I've never found a good online reference explaining how someone solved this before.
My thinking is actually very trivial. First you need to be able to conclude what is your master dataset on Geo+Location and granularity.
My method will be:
DIM loading
Say below is my target
Dim_Location = {Business_key, Longitude, Latitude, Location Name}
Dictionary
Business_key = Always maps to master record from source system (in this case it is the execution system). Imagine now the unique key from business is combined (longitude, latitude) for this table.
Location Name = Again, since we assume the "Execution system" is master for our data then it will host from Source="Execution System".
The above table is now loaded for Fact lookup.
Fact Loading
You have already integrated record between execution system and billing system. It's a straight forward lookup and load in staging since it exists with necessary combination of geo_location.
Challenging scenarios
What if execution system has a late arriving records on orders?
What if same geo_location points to multiple location names? Not possible but worth profiling the data for errors.

Handling Duplicates in Data Warehouse

I was going through the below link for handling Data Quality issues in a data warehouse.
http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/
"
Responding to Quality Events
I have already remarked that each quality screen has to decide what happens when an error is thrown. The choices are: 1) halting the process, 2) sending the offending record(s) to a suspense file for later processing, and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice.
"
In some dimensional feeds (like Client list), sometimes we get a same Client twice (the two records having difference in certain attributes). What is the best solution in this scenario?
I don't want to reject both records (as that would mean incomplete client data).
The source systems are very slow in fixing the issue, so we get the same issues every day. That means a manual fix to the problem also is tough as it has to be done every day (we receive the client list everyday).
Selecting a single record is not possible as we don't know what the correct value is.
Having both the records in our warehouse means our joins are disrupted. Because of two rows for the same ID, the fact table rows are doubled (in a join).
Any thoughts?
What is the best solution in this scenario?
There are a lot of permutations and combinations with your scenario. The big questions is "Are the differing details valid or invalid? as this will change how you deal with them.
Valid Data Example: Record 1 has John Smith living at 12 Main St, Record 2 has John Smith living at 45 Main St. This is valid because John Smith moved address between the first and second record. This is an example of Valid Data. If the data is valid you have options such as create a slowly changing dimension and track the changes (end date old record, start date new record).
Invalid Data Example: However if the data is INVALID (eg your system somehow creates duplicate keys incorrectly) then your options are different. I doubt you want to surface this data, as it's currently invalid and, as you pointed out, you don't have a way to identify which duplicate record is "correct". But you don't want your whole load to fail/halt.
In this instance you would usually:
Push these duplicate rows to a "Quarantine" area
Push an alert to the people who have the power to fix this operationally
Optionally select one of the records randomly as the "golden detail" record (so your system will still tally with totals) and mark an attribute on the record saying that it's "Invalid" and under investigation.
The point that Kimball is trying to make is that Option 1 is not desirable because it halts your entire system for errors that will happen, Option 2 isn't ideal because it means your aggregations will appear out of sync with your source systems, so Option 3 is the most desirable as it still leads to a data fix, but doesn't halt the process or the use of the data (but it does alert the users that this data is suspect).

Default values in target tables

I have some mappings, where business entities are being populated after transformation logic. The row volumes are on the higher side, and there are quite a few business attributes which are defaulted to certain static values.
Therefore, in order to reduce the data pushed from mapping, i created "default" clause on the target table, and stopped feeding them from the mapping itself. Now, this works out just fine when I am running the session in "Normal" mode. This effectively gives me target table rows, with some columns being fed by the mapping, and the rest taking values based on the "default" clause on the table DDL.
However, since we are dealing with higher end of volumes, I want to run my session in bulk mode (there are no pre-existing indexes on the target tables).
As soon as I switch the session to bulk mode, this particular feature, (of default values) stops working. As a result of this, I get NULL values in the target columns, instead of defined "default" values.
I wonder -
Is this expected behavior ?
If not, am I missing out on some configuration somewhere ?
Should I be making a ticket to Oracle ? or Informatica ?
my configuration -
Informatica 9.5.1 64 bit,
with
Oracle 11g r2 (11.2.0.3)
running on
Solaris (SunOS 5.10)
Looking forward to help here...
Could be expected behavior.
Seem that bulk mode in Informatica use "Direct Path" API in Oracle (see for example https://community.informatica.com/thread/23522 )
From this document ( http://docs.oracle.com/cd/B10500_01/server.920/a96652/ch09.htm , search Field "Defaults on the Direct Path") I gather that:
Default column specifications defined in the database are not
available when you use direct path loading. Fields for which default
values are desired must be specified with the DEFAULTIF clause. If a
DEFAULTIF clause is not specified and the field is NULL, then a null
value is inserted into the database.
This could be the reason of this behaviour.
I don't believe that you'll see a great benefit from not including the defaults, particularly in comparison to the benefits of a direct path load. If the data is going to be readonly then consider compression also.
You should also note that SQL*Net features compression for same values in the same column, so even in conventional path inserts the network overhead is not as high as you might think.

Effect of renaming table column on explain/execution plans

I have a table with 300+ columns and hundreds of thousands of records. I need to re-name one of the existing columns.
Is there anything that I need to be worried about? Will this operation have any effect on the explain plans etc ?
Notes:
I am working on a live production database on Oracle 11g.
This column is not being used currently. It's not populated for any of the rows and I am 100% sure none of the existing queries refer to this column.
If "working on a live production database" means that you are going to try to do this without testing in lower environments while people are working, I would strongly caution against that plan.
Existing query plans that involve the table you're doing DDL on will be invalidated so those queries will need to be hard parsed again. That can easily be an expensive operation if there are large numbers of such queries. It is certainly possible that some query plans will change because something else has changed (i.e. statistics are different, settings are different, bind variables are different, etc.) They won't change because of the column name change but the column name change may result in changed plans.
Any queries that you're executing will, obviously, need to use the new name as soon as you rename the column. That generally means that you need to do a coordinated release where you modify the code (including stored procedures) as well as the column name. That, in turn, generally implies that you're doing this as part of a build that includes at least a bit of downtime. You probably could, if you have the enterprise edition, do edition-based redefinition without downtime but that adds complexity to the process and is something that you would absolutely need to test thoroughly before implementing it in prod.

Resources