ORACLE SQLLOADER, referencing calculated values - performance

hope you're having a nice day. I'm learning how to use functions on SQL-LOADER and i have a question about it, lets say i have this table
table a
--------------
code
name
dept
birthdate
secret
the data.csv file contains this data
name
dept
birthdate
and i'm using this code to load data to it with SQLLOADER
LOAD DATA
INFILE "data.csv"
APPEND INTO TABLE a;
FIELDS TERMINATED BY ',' optionally enclosed by '"'
TRAILING NULLCOLS
(code "getCode(:name,:dept)",name,dept,birthdate,secret "getSecret(getCode(:name,:dept),birthdate)")
so this works like a charm it gets the values from my getCode and getSecret functions, however, i would like to reference the previously calculated value (by getCode) so i don't have to nest functions on getSecret, like this:
getSecret(**getCode(:name,:dept)**,birthdate)
i've tried to do it like this:
getSecret(**:code**,birthdate)
but it gets the original value from the file (meaning null) and not the calculated by the function (guess because it does it on the fly), so my question is if there is a way to avoid these nest calls for previously calculated values, so i don't have to loose performance recalculating the same values over and over again (the real table i'm using it's like 10 times bigger and nests a lot of functions for these previously calculated values, so i guess that's reducing performance)
any help would be appreciated, Thanks!!
complement
Sorry, but i haven't used external tables before (kinda new here), how could i implement this using this tables? (considering all the calculated values i need to get from functions i developed, tried trigger (SQL Loader, Trigger saturation?), killed database...)

I'm not aware of a way of doing this.
If you switched to using external tables you'd have a lot more freedom for this sort of thing -- common table expressions, leveraging subquery caching, that sort of stuff.

Related

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Querying Large Datasets in Cassandra

I am by experience a RDBMS programmer. I am working on a scientific research problem involving genomic data. I was assigned to explore Cassandra since we needed a Big Data, scalable and cheap (free) solution. Setting Cassandra up and loading it with data was seductively trivial and similar to my experience with traditional DBs like Oracle and MySQL. My problem is finding a simple strategy to query data since this is a fundamental requirement for all data repositories. The data I am working with is mutation datasets which contain positional information as well as calculated numerical measures regarding the data. I set up an initial static column family that looks like this:
CREATE TABLE variant (
chrom text,
pos int,
ref text,
alt text,
aa text,
ac int,
af float,
afr_af text,
amr_af text,
an int,
asn_af text,
avgpost text,
erate text,
eur_af text,
ldaf text,
mutation_id text,
patient_id int,
rsq text,
snpsource text,
theta text,
vt text,
PRIMARY KEY (chrom, pos, ref, alt)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
CREATE INDEX af_variant_idx ON variant (af);
As you can see there is a natural primary key of positional data (chrome, pos, ref and alt). This data is not meaningful from a querying point of view. Much more interesting to my clients currently is to extract data with an 'AF' value below a certain value. I am using Java restful services to interact with this database using the CQL JDBC driver. It quickly became apparent that directly querying this table would not work using AF since it seems like the select statement must identify the row keys that you want to look at. I found some confusing discussions on this point but what I decided to do was since the distinct values of AF are below 100 values, I built a lookup table that looks like this:
CREATE TABLE af_lookup (
af_id float,
column1 text,
column2 text,
value text,
PRIMARY KEY (af_id, column1, column2)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
This was meant to be a dynamic table with very wide rows. I populated this table based on those data stored on my static column family. The 'AF' value is the key and the compound key from the other table is concantenate by '-' (i.e.1-129-T-G) and stored as a string as a dynamic column name. This worked OK but I still do not understand how all of these things work together. Dynamic Column Families seem to only work as advertised using CQL -2 but I really need to utilize function like >, <, >=, <=. It seems like this is theoretically possible but I have not found a solution in the last 4 weeks of trying a number of different tools (I tried astyanax as well as the JDBC driver).
I have two primary problems, the first is the rpc timeout limitation for querying these data which could produce 10 of thousands to millions of records. The second problem is how to present these data in HTML by getting the data that has not been presented already (previous - next links). Similar to the way opscenter displays column family record data. This doesn't seem possible with the functional limitations of not being able to use >, <, >=, <=. Based on my experience this is probably a lack of understanding on my part of how this product really works rather than a lack of capability of the product (databases wouldn't be very useful if they were only capable of handling writes well).
Is there anyone out there that has encountered this issue and solved it before? I would really appreciate sharing an example of how to implement a C* solution using java web services to display a large number of results that will have to be paginated through.
You may want to explore and use Playorm for Cassandra as it can resolve your problem of timout limitation and pagination. PlayOrm returns a cursor when you query and as your first page reads in the first 20 results and displays it, the next page can just use the same cursor in your session and it picks up right where it left off without rescanning the first 20 rows again.
Visit http://buffalosw.com/wiki/An-example-to-begin-with-PlayOrm/ to see the example for cursor and http://buffalosw.com/products/playorm/ for all features and more details about playorm

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Is it possible to traverse rowtype fields in Oracle?

Say i have something like this:
somerecord SOMETABLE%ROWTYPE;
Is it possible to access the fields of somerecord with out knowing the fields names?
Something like somerecord[i] such that the order of fields would be the same as the column order in the table?
I have seen a few examples using dynamic sql but i was wondering if there is a cleaner way of doing this.
What i am trying to do is generate/get the DML (insert query) for a specific row in my table but i havent been able to find anything on this.
If there is another way of doing this i'd be happy to use but would also be very curious in knowing how to do the former part of this question - it's more versatile.
Thanks
This doesn't exactly answer the question you asked, but might get you the result you want...
You can query the USER_TAB_COLUMNS view (or the other similar *_TAB_COLUMN views) to get information like the column name (COLUMN_NAME), position (COLUMN_ID), and data type (DATA_TYPE) on the columns in a table (or a view) that you might use to generate DML.
You would still need to use dynamic SQL to execute the generated DML (or at least generate static SQL separately).
However, this approach won't work for identifying the columns in an arbitrary query (unless you create a view of it). If you need that, you might need to resort to DBMS_SQL (or other tools).
Hope this helps.
As far as I know there is no clean way of referencing record fields by their index.
However, if you have a lot of different kinds of updates of the same table each with its own column set to update, you might want to avoid dynamic sql and look in the direction of statically populating your record with values, and then issuing update someTable set row = someTableRecord where someTable.id = someTableRecord.id;.
This approach has it's own drawbacks, like, issuing an update to every, even unchanged column, and thus creating additional redo log data, but I believe it should be considered.

Resources