Comparing delimited files for database update - go

Any suggestions on packages (or methodologies) that might help with this? I need to take a ~40MB file we receive weekly and determine what changed from the previous to the current file. Whatever those changes are, then need to be made to a single simple database table. In a previous life I've accomplished similar via Linux "diff" with -Hae parameters, resulting in an "ed script". The contents were then handled by a PERL program, using Tie::File to reference the changed record in the previous file. In an effort to strengthen my Go skills I'm trying to utilize it for this current task. https://github.com/sergi/go-diff looks like it might be the ticket, but I'm not sure "patch" output will quite do what I need (easily).
Fixed width and/or delimited text files are still commonly used, does anyone have any samples or pointers or suggestions on packages that might help in dealing with them in this way?

Are you sure you need the intermediate step? 40 MB is not very much, and your database engine is specialized in handling data like that..
For instance with postgresql just load the new data into a temporary table:
create table temptable (
a varchar,
b varchar,
c varchar
);
copy temptable from '/path/to/csv/newdata.txt' delimiter ',' csv;
Then you can use simple SQL query to get the lines that do not have exact match in the old data, for example:
select *
from temptable t
where not exists (
select 1
from oldtable o
where t.a=o.a and t.b=o.b and t.c=o.c
)
If you did not save the data from previous week's batch, then just remember to copy it to an other table for storing. Now the real question is what you want to do with the information, but you should be able to handle most scenarios.

Related

Is there any way to handle source flat file with dynamic structure in informatica power center?

I have to load the flat file using informatica power center whose structure is not static. Number of columns will get changed in the future run.
Here is the source file:
In the sample file I have 4 columns right now but in the future I can get only 3 columns or I may get another set of new columns as well. I can't go and change the code every time in the production , I have to use the same code by handling this situation.
Expected result set is:-
Is there any way to handle this scenario? PLSQL and unix will also work here.
I can see two ways to do it. Only requirement is - source should decide a future structure and stick to it. Come tomorrow if someone decides to change the structure, data type, length, mapping will not work properly.
Solutions -
create extra columns in source towards the end. If you have 5 columns now, extra columns after 5th column will be pulled in as blank. Create as many as you want but pls note, you need to transform them as per future structure and load into proper place in target.
This is similar to above solution but in this case, read the line as single column in source+ source qualifier as large string of length 40000.
Then split the columns as per delimiter in Informatica expression transformation. This splitting can be done by following below thread. This can be also tricky if you have 100s of columns.
Split Flat File String into multiple columns in Informatica

Check all table columns for a value

Ok, tricky question I am trying to figure out where a database schema is storing a particular pointer. I know the pointer value I just don't what table it is in or what column. I know the pointer is 123123123. How do I check all table columns to see if any of them have that value?
Thanks.
In h2 you can use fulltext search, but then you would need to add all tables in the search scope and indexing.
If you need to index only primary keys, then it might be better but you still need to come up with individual FT_CREATE_INDEX() calls for each table. You can automate this with several languages or with ETLs (like scriptella).
If you've enough disk space, you could dump a SQL from your db and use a viewer for big files like glogg.
The advantage of the first solution is no external tools but you need to work out a specific indexing script for SQL for any existing or new table. The 2nd solution is a 1 time fix.
I use SQL Search from RedGate. It's free and it helps you find any text anywhere in the database.
https://www.red-gate.com/products/?gclid=CjwKEAjwiYG9BRCkgK-G45S323oSJABnykKAE7IH_EMhnmq7OdLdXljfIkdGZrDD6OnOrT4VB0agahoCVn3w_wcB

ETL for processing history records

I am in sort of a DWH project (not quite, but still). And there is this issue we constantly run into which I was wondering if there would be a better solution. Follows
We receive some big files with records containing the all states a user have been into, like:
UID | State | Date
1 | Active | 20120518
2 | Inactive | 20120517
1 | Inactive | 20120517
...
And we are usually inly interested in the latest state of each user. So far so good, with just a little sorting and we could get the way we want it. Only problem is, these files are usually big.. like 20-60gb, sorting these guys sometimes is a pain since the logic for sorting isn't usually so straight forward.
What we do generally is load everything into our Oracle and use intermediary tables and materialized views to have it done. Still, sometimes performance bites us.
20-60gb might be big, but not that big. I mean, should be a somewhat more specialised way to deal with these records, shouldn't it?
I imagine two basic ways of seeing tackling the issue:
1) Programming outside the DBMS, scripts and compiled things. But maybe this is not very flexible unless some bigger amount of time is invested developing something. Also, I might have to busy myself administrating the box resources, whereas I wish not to worry with that.
2) Load everything into the DBMS (Oracle in my case) and use whatever tools it provide to sort and clip the data. This would be my case, though, I am not sure we are using all the tools or simply doing it the right way that would be for Oracle 10g.
Question is then:
You have a 60gb file with millions of historical records like the one above and your user want a table in DB with the last state for each user.
how would you guys do?
thanks!
There are two things you can do to speed up the process.
The first thing is to throw compute power at it. If you have Enterprise Edition and lots of cores you will get significant reductions in load time with parallel query.
The other thing is to avoid loading the records you don't want. This is why you mention pre-processing the file. I'm not sure there's much you can do there, unless you have access to a Hadoop cluster to run some map-reduce jobs on your file (well, reduce mainly, the structure you post is about as mapped as can be already).
But there is an alternative: external tables. External tables are tables which have their data in OS files rather then tablespaces. And they can be parallel enabled (providing your file meet certain criteria). Find out more.
So, you might have an external table like this
CREATE TABLE user_status_external (
uid NUMBER(6),
status VARCHAR2(10),
sdate DATE
ORGANIZATION EXTERNAL
(TYPE oracle_loader
DEFAULT DIRECTORY data_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY newline
BADFILE 'usrsts.bad'
DISCARDFILE 'usrsts.dis'
LOGFILE 'usrsts.log'
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(
uid INTEGER EXTERNAL(6),
status CHAR(10),
sdate date 'yyyymmdd' )
)
LOCATION ('usrsts.dmp')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Note that you need read and write permissions on the DATA_DIR directory object.
Having created the external table you can load the only desired data into your target table with an insert statement:
insert into user_status (uid, status, last_status_date)
select sq.uid
, sq.status
, sq.sdate
from (
select /*+ parallel (et,4) */
et.uid
, et.status
, et.sdate
, row_number() over (partition by et.uid order by et.sdate desc) rn
from user_status_external et
) sq
where sq.rn = 1
Note that as with all performance advice, there are no guarantees. You need to benchmark things in your environment.
Another thing is the use of INSERT: I'm assuming these are all fresh USERIDs, as that is the scenario your post suggests. If you have a more complicated scenario then you probably want to look at MERGE or a different approach altogether.
One last thing: you seem to be assuming this is a common situation, which has some standard approaches. But most data warehouses load all the data they get. They may then filter it for various different uses, data marts, etc. But they almost always maintain a history in the actual warehouse of all the distinct records. So that's why you might not get an industry standard solution.
I'd go with something along the lines of what APC said as a first go. However, I think parallel tables can only load data in parallel if the data is in multiple files, so you might have to cut the files into several. How are the files generated? A 20 - 60GB file is a real pain to deal with - can you get the generation of the files changed so you get X 2GB files for example?
After getting all the records into the database, you might run into problems attempting to sort 60GB of data - it would be worth having a look at the sort stage of the query you are using to extract the latest status. In the past I helped large sorts by hash partitioning the data on one of the fields to be sorted, in this case user_id. Then Oracle only has to do X smaller sorts, each of which can proceed in parallel.
So, my thoughts would be:
Try and get many smaller files generated instead of 1 big one
Using External tables, see if it is feasible to extract the data you want directly from the external tables
If not, load the entire contents of the files into a hash partition table - at this stage make sure you do insert /*+ append nologging */ to avoid undo generation and redo generation. If your database has force_logging set to true, the nologging hint will have no effect.
Run the select on the staged data to extract only the rows you care about and then trash the staged data.
The nologging option is probably critical to you getting good performance - to load 60GB of data, you are going to generate at least 60GB of redo logs, so if that can be avoided, all the better. You would probably need to have a chat with your DBA about that!
Assuming you have lots of CPU available, it may also make sense to compress the data as you bulk load it into the staging table. Compression may well half the size of your data on disk if it has repeating fields - the disk IO saved when writing it usually more than beats any extra CPU consumed when loading it.
I may be oversimplifying the problem, but why not something like:
create materialized view my_view
tablespace my_tablespace
nologging
build immediate
refresh complete on demand
with primary key
as
select uid,state,date from
(
select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
from my_table t;
)
where rnum = 1;
Then refresh fully when you need to.
Edit: Any don't forget to rebuild stats and probably throw a unique index on uid.
I would write a program to iterate over each record and retain only those which are more recent than record previously seen. At the end, insert the data into the database.
How practical that is would depend on how many users we're talking about - you could end up having to think carefully about your intermediate storage.
In general, this becomes (in pseudo-code):
foreach row in file
if savedrow is null
save row
else
if row is more desirable than savedrow
save row
end
end
end
send saved rows to database
The point it, you need to define how one row is considered to be more desirable than another. In the simple case, for a given user, the current row's date is later than the last row we saved. At the end, you'd have a list of rows, one-per-user, each of which has the most recent date you saw.
You could general the script or program so that the framework is separate from the code that understands each data file.
It'll still take a while, mind :-)

How to get the Oracle external table "dump file" without doing "CREAT TABLE"

I have a to develop a PL/SQL procedure that dumps the content of a table when an error occurs during an application transaction, the content of the dump must match the content of the table before the ROLLBACK of the transaction.
I thought about using external table as the dump format of the table (TYPE ORACLE_DATAPUMP). After going through the Oracle documentation, I found that the only way to that is by executing:
CREATE TABLE tabtest_test (
F1 NUMBER,
F2 CHAR(10))
ORGANIZATION EXTERNAL (
TYPE ORACLE_DATAPUMP
DEFAULT DIRECTORY USER_DUMP_DEST
LOCATION ('tabtest.dmp’));
The problem is that by executing the “CREATE TABLE”, Oracle performs an implicit commit within our failed transaction which needs to be rolled back after the dump of the table.
I thought about using the “PRAGMA AUTONOMOUS_TRANSACTION;” to execute the “CREATE TABLE”, but it doesn’t really fit our need as it dumps the content of the table outside our application transaction.
My question: is there a way to get the 'tabtest.dmp’ without doing a “CREATE TABLE” ? for example by accessing directly the Oracle API responsible for this.
Regards.
How about creating the external table once, as part of your application setup process?
Failing that, you could create it at the beginning of the transaction that might need it. If there is an error, populate it; if the transaction finishes successfully, drop it.
If (and it's a big IF) you can use AUTONOMOUS_TRANSACTIONS to create the table in a separate transaction, I think this is what you need to do. If you manage to create the table within the scope of your current transaction, and write your data to that newly-created table, that data should, by all rights, disappear as soon as you do your ROLLBACK.
The problems you're experiencing here are a subset of the large class of issues known as "Problems Which Occur When Trying To Treat A Relational Database As A Flat File". Relational databases are great when used AS DATABASES, but are really bad at being flat files. It's kind of like animals on the farm - sheep are great AS SHEEP, but make lousy cows. Cows make lousy goats. Goats - great animals - intelligent, affectionate (yep), low-maintenance, won't hear a word spoken against 'em - but NOT what you want in a draft animal - use a horse, ox, or mule for that. Basically, you should pick horses for courses (pardon the expression). A database makes a crappy flat file, and vice versa. Use what's appropriate.
IMO you'd be better off writing your data to a flat file, and perhaps this file could be mapped in as an external table. You might want to write the file in something like CSV format that lots of other tools can process. YMMV.
Share and enjoy.
Why do you need to use external tables? You could just read the file using UTL_FILE.

Oracle Data Versioning/Partitioning Strategies/Best Practices

not sure if the subject entirely conveys what I'm trying to achieve, but let me explain:
We are building an application that uses Oracle as storage backend. Each year, last years dataset will be "Archived", and a new instance created and populated from scratch.
What are the options to do this within the same schema?
Keep version information on a record level (we presume this will be too slow for our use-case).
Keep version information on a table level, so for each new version, we will re-create all the tables but with a new version prefix. (We like this solution, since we can do it all in code).
?
Is there not something like partitions/personalities/namespaces available that will allow us to achieve this in Oracle?
My oracle experience is rather limited, any assistance will be greatly appreciated!
The RDBMS conceptual model is not very good at maintaining temporal versions of data. So it is not just Oracle which is lacking in this regard.
I am unclear why you think keeping version information at the record level will be too slow. Too slow in creating a new version? Or too slow where it comes to data retrieval during regular operations?
Here is how you could do it. Given a table CUSTOMERS with a business key of CUSTOMER_REF I might normally build it like this (I am using abbreviated syntax rather than best practice for reasons of space):
create table customers
( id number not null primary key
, customer_ref number not null unique key
, name varchar2(30) not null )
/
The versioned equivalent would look like this:
create table customers
( id number not null primary key
, customer_ref number not null
, version_number number
, name varchar2(30) not null
, constraint whatever unique (customer_ref, version_number) )
/
This works by keeping the current version of VERSION_NUMBER null, and only populating it at archival time. Any lookup is going to have to include and version_number is null. This will be a bit of a pain and you may need to include the column in any additional indexes you build.
Obviously maintaining all versions of the records in the same table will increase the size of your tables, which might have an effect on performance. Oracle's Partitioning option can definitely help here. It also would give you a neat way of creating next year's set of data. However, it is a chargeable extra on top of the Enterprise License, so it is an expensive option. Find out more..
The most time consuming aspect of this will be managing foreign key relationships in the new version of the table. Presuming you choose to use synthetic primary keys, the archival process will have to generate new IDs and then painstakingly cascade them to their dependent records in the new versions of referencing foreign keys.
Thinking about this makes discreet tables for each version seem very attractive. For ease of use I would keep the current version un-prefixed, so that archiving becomes a process simply of
create table customers_n as select * from customers;
You might want to avoid downtime while creating the versioned tables. In that case you could use materialized views to capture the tables' state during the run-up to the archival switchover. When the clock strikes twelve you can switch off the refresh. (caveat: this is thinking on the fly, I have never done anything like this so try before you buy.)
One pertinent advantage of multiple tables (and Partitioning) is that you can move the archived records to a READ ONLY tablespace. This not only preserves them from unwanted change, it also means you can exclude them from subsequent backups.
edit
I notice you have commented that the archived data can occasionbally be amended. In taht case moving it to READ ONLY tablespaces is not a go-er.
The only thing I wil add to what APC said is regarding your asking for "namespaces".
A namespace in Oracle is a schema, whereby you can have the same object name(s) in each schema.
Of course this all depends on how your app must access multiple versions, but I would lean towards a different schema for each year before I would use some sort of naming convention to maintain versions of tables in the same schema. The reason is, eventually you will have a nightmares. At least with different schemas, all DDL can be the same, all references to objects will be the same, and tools like ER modellers and query tools will work within the context of that schema. Data models change, so at some point you may need to run some compare tools, and if all your tables are named funky with some sort of version postfix, that won't work well.
Add a schema can be copied / moved with export or data pump quickly using the fromuser/touser or remap_schema options, so you won't need much code, except to do any cleanup of last years data out of the new version.
I find schemas are very useful as "containers" and most apps I host only have schema level privileges, so I'm guaranteed the app can be easily and quickly moved from instance to instance, or multiple copies of the app can be hosted side-by-side on the same instance.
Might the schema change between years. For example, in 2010 you have fifteen columns but in 2011 you add a sixteenth.
If so, will the same application work on both 2010 and 2011 data.
If the schema is static, I'd go for table with a 'YEAR' column and use VPD/RLS/FGAC to apply a YEAR = '2010' predicate.
I'd only worry about partitioning if performance was a problem.
1) Interval partition it by year and some date field in the row.
2) Add it at the end of each table and populate it with a sequence and trigger.
3) Then partition by interval year on this col.

Resources