when would I need to modify knowledge module in ODI? - oracle

I have came across an ODI project and there seems to be a lot of user defined KMs and I don't understand why they were modified? Is there any particular scenario where existing KM doesn't work?

There are a lot of reasons for writing your own KM or modify the existing ones, for example:
log in your own paths/tables;
read metadata from flex fields (Metadata like: default values for some columns, base table name used for temporary tables, type of load: full/incremental etc);
make different transformations/stage steps, different from the standard KM's;
customize your CKM: make error table - where you see your rows in error, make correct result table and so on;
for modifying KM you may have want to save the temporary tables with your own standard and so on.
The benefits of writing KM's is that the limit is your imagination (or almost). You can do plenty of stuff. The standard KM's are very good but there are some moments when you reach the limit with them and from there you should create your own.
Hope that this helps you.

Related

How can I load large amount of data into oracle database from .csv -file without risking to drop och mismatch data?

I’m in the middle of trying to migrate a large amount of data into a oracle database from existing excel-files.
Due to the large amount of rows loaded (10 000 and more) every time, it is not possible to use SQL Developer for this tasks.
In every work-sheet there’s data that need to go into different tables, but at the same time keep the relations and not dropping any data.
As for now, I use one .CSV file for each table and mapping them together afterwards. This is thou combined with a great risk of adding the wrong FK and with that screw up the hole shit. And I don’t have the time, energy or will for clean ups even if it is my own mess…
My initial thought was if I could bulk transfer with sql loader using some kind of plsql-script in maybe an ctl-file (the used for mapping the properties) but it seems like I.m quite out in the bush with that one… (or am I…? )
The other thought was to create a simple program In c# and use fastMember and load the database that way. (But that means that I need to take the time to actually make the program, however small it is).
I can’t possible be the only one that have had this issue, but trying to us my notToElevatedNinjaGoogling-skills ends up with either using sql developer (witch is not an alternative) or the bulk copy thing from sql load (and where I need to map it all together afterwards).
Is there any alternative solutions for my problem or is the above solutions the one that I need to cope with?
Did you consider using CSV files as external tables? As they act as if they were ordinary Oracle tables, you can write (PL/)SQL against them, inserting data into different tables in the target schema. That might give you some more freedom & control over what you are doing.
Behind the scene, it is still SQL*Loader.

Multidimensional data types

So I was thinking... Imagine you have to write a program that would represent a schedule of a whole college.
That schedule has several dimensions (e.g.):
time
location
indivitual(s) attending it
lecturer(s)
subject
You would have to be able to display the schedule from several standpoints:
everything held in one location in certain timeframe
everything attended by individual in certain timeframe
everything lecturered by a certain lecturer in certain timeframe
etc.
How would you save such data, and yet keep the ability to view it from different angles?
Only way I could think of was to save it in every form you might need it:
E.g. you have folder "students" and in it each student has a file and it contains when and why and where he has to be. However, you also have a folder "locations" and each location has a file which contains who and why and when has to be there. The more angles you have, the more size-per-info ratio increases.
But that seems highly inefficinet, spacewise.
Is there any other way?
My knowledge of Javascript is 0, but I wonder if such things would be possible with it, even in this space inefficient form.
If not that, I wonder if it would work in any other standard (C++, C#, Java, etc.) language, primarily in Java...
EDIT: Could this be done by using MySQL database?
Basically, you are trying to first store data and then present it under different views.
SQL databases were made exactly for that: from one side you build a schema and instantiate it in a database to store your data (the language is called Data Definition Language, DDL), then you make requests on it with the query language (SQL), what you call "views". There are even "views" objects in SQL databases to build these views Inside the database (rather than having to the code of the request in the user code).
MySQL can do that for sure, note that it is possible to compile some SQL engine for Javascript (SQLite for example) and use local web store to store the data.
There is another aspect to your question: optimization of the queries. While SQL can do most of the request job for your views. It is sometimes preferred to create actual copies of the requests results in so called "datamarts" (this is called de-normalizing a request), so that the hard work of selecting or computing aggregate/groups functions and so on is done once per period of time (imagine that a specific view changes only on Monday), then requesters just have to read these results. It is important in this case to separate at least semantically what is primary data from what is secondary data (and for performance/user rights reasons, physical separation is often a good idea).
Note that as you cited MySQL, I wrote about SQL but mostly any database technology could do that what you searched to do (hierarchical, object oriented, XML...) as long as the particular implementation that you use is flexible enough for your data and requests.
So in short:
I would use a SQL database to store the data
make appropriate views / requests
if I need huge request performance, make appropriate de-normalized data available
the language is not important there, any will do

Database design: Same table structure but different table

My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?
If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.
There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.
Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.
I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.

Building Oracle DB; Good Directory Layout

I'm looking for advice on how to best organize a new Oracle schema and dependent files in my project directory - with the sequences, triggers, DDL, etc. I've been using one monolothic file called schema.sql for some time, but I'm wondering if there's a best practice? Something like...
database/
tables/
person.sql
group.sql
sequences/
person.sequence
group.sequence
triggers/
new_person.trigger
Penny for your thoughts or a URL that I may have missed!
Thank you!
Storing DDL by object type is a reasonable approach-- anything is likely to be easier to navigate than a monolithic SQL script. Personally, though, I'd much rather have DDL organized by function. If you're building an accounting system, for example, you probably have a series of objects to manage accounts payable and a separate set of objects to manage accounts receivable along with some core objects for managing the general ledger accounts. That would lead to something along the lines of
database/
general_ledger/
tables/
packages/
sequences/
accounts_receivable/
tables/
packages/
sequences/
accounts_payable/
tables/
packages/
sequences
As the system gets more complex, that hierarchy would naturally get deeper over time. This sort of approach would more naturally mirror the way non-database code is stored in source control. You wouldn't have a single directory of Java classes in a directory structure like
middle_tier/
java/
Foo.java
Bar.java
You would organize the classes that implement the same sorts of business logic together and separate from the classes that implement different bits of business logic.
One item to consider is those SQLs which can act as 'latest only' scripts. These include CREATE OR REPLACE PROCEDURE/FUNCTION/TRIGGER etc. You run the latest version and you are not worried about what may have previously existed in the database.
On the other hand you have tables where you may start off with a CREATE TABLE followed by several ALTER TABLEs as changes to the schema evolve. And if you are doing an upgrade you may want to apply several of the ALTER TABLE scripts (preferably in order).
I'd argue against a 'functional grouping' unless it is really obvious where the lines are drawn. You probably don't want to be in a position where you have a USERS table in one group and a USER_AUTHORITIES in another and an AUTHORITY group in a third.
If you do have decent separation, then they are probably in separate schemas and you do want to keep schemas distinct (since you can have the same object names in different schemas).
The division-by-object-type arrangement, with the addition of a "schema" directory below the database directory works well for me.
I've worked with source control systems that have the additional division-by-function layer - if there are many objects it adds additional searching if you're trying to cross-reference the source control file with the object that you see in a database GUI navigator that generally groups objects by type. It's also not always clear how an object should be classified this way.
Consider adding a "grants" directory for the grants made by that schema to other schemas or roles, with one file per grantee. If you have "rule-based" grants such as "the APPLICATION_USER role always gets SELECT on all of schema X's tables", then write a PL/SQL anonymous block to perform this action. (You might be tempted to reverse-engineer the grants after they get put in place by some ad-hoc method, but it's easy to miss something when new tables or views are added to the application).
Standardize on a delimiter for all scripts and you'll make your life easier if you start deploying through a build utility such as Ant. Using "/" (vs. ";") works for both SQL statements as well as PL/SQL anonymous blocks.
In our projects we use somewhat combined approach: we have a core of our program as a root and other functionalities in subfolders:
root/
plugins/
auth/
mail/
report/
etc.
In all these folders we have both DDL and DML scripts almost all of them can be run more that once, e.g. all packages are defined as create or replace..., all data insertion scripts check whether data already exists and so on. This gives us the opportunity to rus almost all scripts without thinking that we can crash something.
Obviously this scenario can't be applied for create table and similar statements. For these scripts we have manually written small bash script that extracts specified files and runs them not failing on particular ORA errors, like: ORA-00955: name is already used by an existing object.
Also all files are mixed in the directories but differ with extensions: .seq goes for sequence, .tbl goes for table, .pkg goes for package interface, .bdy goes for package body, .trg goes for trigger an so on...
Also we have a naming convention denoting prefixes for all of our files: we can have cl_oper.tbl table with cl_oper.seq and cl_oper.trg sequence and triggers and cl_oper_processing.pkg together with cl_oper_processing.bdy with logic for mentioned objects. With this naming convention in file managers it's very easy to see all the files connected with some unit of logic for our project (whilst the grouping in directories by object types does not provide this).
Hope this information helps you somehow. Please leave comments if you have any questions.

Separating Demo data in Live system

If we put aside the rights and wrongs of putting demo data into a live system for a minute (that's a whole separate discussion!), we are being asked to store some demo data in our live system so that it can be credibly demonstrated without the appearance of smoke + mirrors (we want to use the same login page for example)
Since I'm sure this is a challenge many other people must have - I'd be interested to know what approaches have people have devised to separating this data so that it doesn't get in the way of day to day operations on their systems?
As I alluded to above, I'm aware that this probably isn't best practice. :-)
Can you instead, segregate the data into a new database, and just redirect your connection strings (they're not hard-coded, right? right?) to point to the demo database. This way, live data isn't tainted, and your code looks identical. We actually do a three tier-deployment system this way, where we do local development, deploy to QC environments that have snapshots of the live data every few months, and then deploy to live when testing is complete.
FWIW, we're looking at using Oracle's row level security / virtual private database feature to seperate the demo data from the rest.
I've often seen it on certain types of live systems.
For example, point of sale systems in a supermarket: cashiers are trained on the production point of sale terminals.
The key is to carefully identify the test or training data. I wouldn't say that there's any explicit best practice for how to model this in a database - it's going to be applicaiton specific.
You really have to carefully define the scope of what is covered by the test/training scenarios. For example, you don't want the training/test transactions to appear in production reports (but you may want to be able to create reports with this data for training/test purposes).
Completely disagree with Joe. Oracle has a tool to do this regardless of implementation. Before I read your answer I was going to say VPD... But that could have an impact on Production.
Remember Every table in a query changes from
SELECT * FROM tableA
to
SELECT * FROM (SELECT * FROM tableA WHERE Data_quality = 'PROD' <or however you do it>
Every table with a policy that is...
So assuming your test data has to span EVERY table, every table will have to have a policy and every table will be filtered before a SQL can begin working.
You can even hide that column from the users. You'll need to write the policy with some deftness if you do. You'll have to create that value based on how the data is inserted and expose the column to certain admin accounts for maintenance.

Resources