Round-robin table using only basic data structures - performance

I have a list of elements (database tables) that may have relations with each other, including themselves.
Let's assume I have table1, table2, table3 and table4. What I would like to create and use in a loop would be structure like this:
|table1|table2|table3|table4||
table1| 1:1 | | | ||
table2| | 1:1 | | ||
table3| | | 1:1 | ||
table4| | | | 1:1 ||
Note that while it is not necessary to have 1:1 relations pre-filled, it is important for me to be able easily address any of these cells in a loop.
There will be while(True) loop that will iterate over and over again until the iteration without change occurs.
What I am currently struggling with is to find an efficient and pythonic way of doing this with basic python structures, while keeping decent performance and easy addressing.
Example of what I tried
tableNames = d.items(); # returns me this format ([(tableName, ['column1', 'column2']), ... ])
for tName in tableNames:
# some sort of comparism and conditions resulting in whatever decision
# for example, that I need to insert information about
# relation on [actualtable][table2]
fancyStructure[actualtable][table2] = "1:N";
Is it a good idea to use dictionary? Respectively dictionary of dictionaries? What is the best way?
And is it okay to pre-initialize that dictionary to the values that I will change in time? Note that this script will work with millions or even billions of records.
EDIT:
Input:
dict([('tableName', ['columnName', 'columnName2']), ('tableName2', ['columnName', 'columnName2'])])
Expected output:
tableName : tableName [relation],
tableName : tableName2[relation],
tableName2 : tableName[relation],
tableName2 : tableName2[relation]
Relations will be linked to both of mentioned tables separated by colon and at the beginning there will be no value, or empty.

Related

How does a multi-column index work in oracle?

I'm building a table to manage some articles:
Table
| Company | Store | Sku | ..OtherColumns.. |
| 1 | 1 | 123 | .. |
| 1 | 2 | 345 | .. |
| 3 | 1 | 123 | .. |
Scenario
Most time company, store and sku will be used to SELECT rows:
SELECT * FROM stock s WHERE s.company = 1 AND s.store = 1 AND s.sku = 123;
..but sometimes the company will not be available when accessing the table.
SELECT * FROM stock s WHERE s.store = 1 AND s.sku = 123;
..Sometimes all articles will be selected for a store.
SELECT * FROM stock s WHERE s.company = 1 AND s.store = 1;
The Question
How to properly index the table?
I could add three indexes - one for each select, but i think oracle should be smart eneugh to re-use other indexes.
Would an Index "Store, Sku, Company" be used if the WHERE-condition has no company?
Would an Index "Company, Store, Sku" be used if the WHERE-condition has no company?
You can think of the index key as conceptually being the 'concatenation' of the all of the columns, and generally you need to have a leading element of that key in order to get benefit from the index. So for an index on (company,store,sku) then
WHERE s.company = 1 AND s.store = 1 AND s.sku = 123;
can potentially benefit from the index
WHERE s.store = 1 AND s.sku = 123;
is unlikely to benefit (but see footnote below)
WHERE s.company = 1 AND s.store = 1;
can potentially benefit from the index.
In all cases, I say "potentially" etc, because it is a costing decision by the optimizer. For example, if I only have (say) 2 companies and 2 stores then a query on company and store, whilst it could use the index is perhaps better suited to not to do so, because the volume of information to be queried is still a large percentage of the size of the table.
In your example, it might be the case that an index on (store,sku,company) would be "good enough" to satisfy all three, but that depends on the distribution of data. But you're thinking the right way, ie, get as much value from as few indexes as possible.
Footnote: There is a thing called a "skip scan" where we can get value from an index even if you do not specify the leading column(s), but you will typically only see that if the number of distinct values in those leading columns is low.
first - do you need index at all? Indexes are not for free. If your table is small enoguh, perhaps you don't need index at all.
Second - what is data structure? You have store column in every scenario - I can imagine situation in which filtering data on store dissects source data to enough degree to be good enough for you.
However if you want to have maximum reasonable performance benefit you need two:
(store, sku, company)
(store, company)
or
(store, company, sku)
(store, sku)
Would an Index "Store, Sku, Company" be used if the WHERE-condition has no company?
Yes
Would an Index "Company, Store, Sku" be used if the WHERE-condition has no company?
Probably not, but I can imagine scenarios in which it might happen (not for the index seek operation which is really primary purpose of indices)
You dissect data in order of columns. So you group data by first element and order them by first columns sorting order, then within these group you group the same way by second element etc.
So when you don't use first element of index in filtering, the DB would have to access all "subgroups" anyway.
I recommend reading about indexes in general. Start with https://en.wikipedia.org/wiki/B-tree and try to draw how it behaves on paper or write simple program to manage simplified version. Then read on indexes in database - any db would be good enough.

SSRS Reporting - count number of group rows

I am trying to the count of number of groups in my report I know I could do it in the SQL however trying to avoid adding redundant data to my dataset if I can.
I have a MainDataSet that could have multiple entries per distinct group item. All I want is the no. of groups not the count of items within the group.
For example words starting with alphabet letters, lets say I have 2 groups A and B only (NB: number of groups can change dynamically as I filter the MainDataSet based on user parameter selection):
Group | Data
------|-----
A | Apple
A | Ant
B | Balloon
B | Book
B | Bowl
Final Result:
Group | Index | NGroups
A | 1 | 2
B | 2 | 2
I know I can get the Index using a aggregate function as follows:
RunningValue(Fields!Group.Value, CountDistinct, "TablixName")
But how do I get the NGroups value?
I guess I could also create another dataset based on the MainDataSet (make use of a sql function) and do:
SELECT 'X' AS GroupCount, COUNT(Distinct Group) AS NGroups
FROM dbo.udf_MainDataSet()
WHERE FieldX = #Parameter1
Then use a LookUp:
Lookup("X", Fields!GroupCount.Value, Fields!NGroups.Value, "NewDataSet")
But is there a simple solution that I am not seeing?
CountDistinct(Fields!Group.Value, "TablixName")

Spotfire: Using one selection as a range for another datatable

I've searched quite a bit for this and can't find a good solution anywhere to what seems to me like a normal problem for this product.
I've got a data table (in memory) that is from a rollup table(call it 'Ranges'). Basically like so:
id | name | f1 | f2 | totals
0 | Channel1 | 450 | 680 | 51
1 | Channel2 | 890 | 990 | 220
...and so on
Which creates a bar chart with Name on the X and Totals on the Y.
I have another table that is an external link to a large (500M+ rows) table. That table (call it 'Actuals') has a column ('Fc') that can fit inside the F1 and F2 values of Ranges.
I need a way for Spotfire Analyst (v7.x) to use the selection of the the bar chart for Ranges to trigger this select statement:
SELECT * FROM Actuals WHERE Actuals.Fc between [Ranges].[F1] AND [Ranges].[F2]
But there aren't any relationships (Foreign keys) between the two data sources, one is in memory (Ranges) and the other is dynamic loaded.
TLDR: How do I use the selected rows from one visualization as a filter expression for another visualization's data?
My choice for the workaround:
Add a button which says 'Load Selected Data'
This will run the following code, which will store the values of F1 and F2 in a Document Property, which you can then use to filter your Dynamically Loaded table and trigger a refresh (either with the refresh code or by setting it to load automatically).
rowIndexSet=Document.ActiveMarkingSelectionReference.GetSelection(Document.Data.Tables["IL_Ranges"]).AsIndexSet()
if rowIndexSet.IsEmpty != True:
Document.Properties["udF1"] = Document.Data.Tables["IL_Ranges"].Columns["F1"].RowValues.GetFormattedValue(rowIndexSet.First)
Document.Properties["udF2"] = Document.Data.Tables["IL_Ranges"].Columns["F2"].RowValues.GetFormattedValue(rowIndexSet.First)
if Document.Data.Tables.Contains("IL_Actuals")==True:
myTable=Document.Data.Tables["IL_Actuals"]
if myTable.IsRefreshable and myTable.NeedsRefresh:
myTable.Refresh()
This is currently operating on the assumption that you will not allow your user to view multiple ranges at a time, and simply shows the first one selected.
If you DO want to allow them to view multiple ranges, you can run a cursor through your IL_Ranges table to either get the Min and Max for each value, and limit the Actuals between the min and max, or you can create a string that will essentially say 'Fc between 450 and 680 or Fc between 890 and 990', pass that through to a stored procedure as a string, which will execute the quasi-dynamic statement, and grab the resulting dataset.

SELECT ... LIMIT 1 query results in more than one row?

I noticed that LIMIT queries will return more than the expected number of rows when they are executed against tables that contain nested or repeated data. For example, the following query run against the persons sample data set from the developer guide produces the following results:
% bq query 'SELECT fullName, children.name FROM [persons.person] LIMIT 1'
+----------+---------------+
| fullName | children_name |
+----------+---------------+
| John Doe | Jane |
| John Doe | John |
+----------+---------------+
It looks like BQL is applying the LIMIT operator before flattening the results as opposed to the other way around (which I think would make more sense).
Is this a bug in the BQL implementation or is this the expected behavior? If this is the expected behavior can someone please provide an explanation for why this makes sense?
This is expected given the way BigQuery flattens query results. When you run the query, the LIMIT 1 applies to the repeated record. Then the results get flattened in the output, and you get two rows. A workaround is to use an explicit flatten operation. For example:
SELECT fullName, children.name
FROM (FLATTEN([persons.person], children.name) LIMIT 1
This will return only a single row.

Dynamic database design for variable length spatio-temporal data in Oracle (need a schema design)

Currently I am working on a research project, where I need to store spatio-temporal data and analyze them efficiently. I am giving the exact requirement below.
The research is going on meteorological data, so the data attributes are temperature, humidity, pressure, wind-speed, wind-direction etc. The number of attributes is previously unknown to us, depending on requirement we may need to add more attributes (Table having dynamic attribute and different datatype nature). Again the data is captured from various locations, from various height and in a certain time duration as well as time interval.
So, what should be the best way to design a schema for the requirement? We must have to find out relation efficiently.
The purpose of the project is not only to store database, also need to manipulate the data.
Sample data in table format -
location | time | height | pressure | temperature | wind-direction | ...
L1 | 2011-12-18 08:04:02 | 7 | 1009.6 | 28.3 | east | ...
L1 | 2011-12-18 08:04:02 | 15 | 1008.6 | 27.9 | east | ...
L1 | 2011-12-18 08:04:02 | 27 | 1007.4 | 27.4 | east | ...
L1 | 2011-12-18 08:04:04 | 7 | 1010.2 | 28.4 | north-east | ...
L1 | 2011-12-18 08:04:04 | 15 | 1009.4 | 28.2 | north-east | ...
L1 | 2011-12-18 08:04:04 | 27 | 1008.9 | 27.6 | north-east | ...
L2 | 2011-12-18 08:04:02 | ..... so on
Here I need to design a schema for the above sample data where Location is a spatial location that can be implemented using oracle MDSYS.SDO_GEOMETRY type.
Constraints are:
The no of attributes (table column) is unknown during development. In runtime any new attribute(let say - humidity, refractive index etc.) can be added. So we can't design attribute specific table schema.
    1.1) for this constraint I thought to use a schema like -
           tbl_attributes(attr_id_pk, attr_name, attr_type);          
tbl_data(loc, time, attr_id_fk, value);
     The my design the attribute value must be varchar type, and as required I thought to cast (not a good idea at all).
     But finding relational data with this schema is very difficult using SQL query only. For example I want to find -
          1.1.1) avg pressure for location L1 when wind direction is east and temperature in between 27-28
         1.1.2) locations, where pressure is maximum at 15 height.
     1.2) I am also thinking to edit table schema during runtime, which is again not a good idea I think.
We will use a loader application, which will be taking care of this dynamic insertion depending on the schema (what ever it maybe).
Need to retrieve statistical data efficiently as some example is given above [1.1.*].
I am not completely sure I understand what you mean when you say that
The no of attributes (table column) is unknown during development. In
runtime any new attribute(let say - humidity, refractive index etc)
can be added.
first of all, I suppose that this is not really happening at random: i.e. when you get a new bunch of data from the field you know (before importing) that these have an extra dimension or two. Correct?
Also, the fact that in this new data batch you get "refractive index" will not make the older data magically acquire a proper value for this dimension.
Therefore I would go for a classical Object-to-RDBMS mapping where you have:
a header table with things that exist for every measurement: i.e. time and space, possibly the source (i.e. lab, sensor, team which provided the data) and an autogenerated key.
one or more detail table where the values are defined as proper fields.
Example:
Header
location | time | height | source |Key |
L1 | 2011-12-18 08:04:02 | 7 | team-1 | 002020013 |
L1 | 2011-12-18 08:04:02 | 15 | team-1 | 002020017 |
L1 | 2011-12-18 08:04:02 | 27 | Lab-X | 002020018 |
L1 | 2011-12-18 08:04:04 | 7 | Lab-Y | 002020021 |
L1 | 2011-12-18 08:04:04 | 15 | Lab-X | 002020112 |
Atmospheric data (basic)
Key | pressure | temp | wind-dir |
002020013 | 1009.6 | 28.3 | east |
002020017 | 1019.3 | 29.2 | east |
002020018 | 1011.6 | 26.9 | east |
Light-sensor data
Key | refractive-ind | albedo | Ultraviolet |
002020017 | 79.6 | .37865 | 7.0E-34 |
002020018 | 67.4 | .85955 | 6.5E-34 |
002020021 | 91.6 | .98494 | 8.1E-34 |
In other words: every different set of data will use one or more subtables (these you can add "dynamically", if needed) and you can still create queries by standard means, you will just have to join subtables (where possible: i.e. if you want to analyze by Wind Directions AND refractive index, you can - but only when you have set of data which have both values) by using the reference keys to keep these consistent).
I believe this more efficient than using text fields with CSV inside, or data blobs or using a key-values associations.
I would definitely go with 1.2 (edit table schema during runtime), at least to begin with. Any sufficiently advanced configuration is indistinguishable from programming; don't think you can magically avoid making changes to your program.
Don't be scared of alter table. Yes, the upfront costs are higher - you may need a process (not just a program) to ensure your schema stays clean. And there are some potential locking problems (that have solutions). But if you do it right you only have to pay the price once for each change.
With a completely generic solution you will pay a small price with every query. Your queries will be complicated, slower, ugly, and more likely to fail. You can never write a query like select avg(value) ..., it may or may not work, depending on how the data is accessed. You can use a PL/SQL function to catch exceptions, or use inline views and hints to force a specific access pattern. Either way, your queries are more complicated and slower, and you have to make sure that everybody understands these problems before they use the data.
And with a generic solution the optimizer will suck because it knows nothing about your data. Oracle can't predict how many rows will be returned by where attr_name = 'temperature' and is_number(value) = 28.4. But it can make a very good guess for where temperature = 28.4. You may have significantly more bad plans (i.e. slow queries) with generic columns.
Thank you for the quick response and good guidance. I have gotten some concepts from the both answers and decided to go with a mix model. I don't know whether I am in the write path or not. I want comments on the model. Below I am describing the complete conceptual model with MySQL code snippet.
Conceptual model
For dynamicity - (no of column is not defined previously) I have created 4 tables as follows -
geolocation(locid int, name varchar, geometry spatial_type) - to store information of a particular location, may be defined with spatial feature.
met_loc_event(loceventid int, locid* int, record_time timestamp, height float) - this is to identify a perticular event in a place with sudden height.
metfeatures(featureid int, name varchar, type varchar) - to store feature (ie. Column) details with a data type, that type field will help to cast data as required.
metstore(loceventid* int, featureid* int, value varchar) - to store an atom value for a feature at a particular time.
Up to that part I design a column orientation to store a dynamic nature of table. But as you suggest this is not a good design for quering (some will not work like arithmetic functions) the database. This is also not good if we consider performance.
For efficient query needs (to avoid to much joining and to avoid casting value during query) - I extend the model with some helper view, I write store procedure to generate views from the stored database.
First I created views for each feature (by taking value from feature table, so no of entry will be no of feature view initially) with the help of met_loc_event, metfeatures and metstore tables. These views store locid, record_time, height, and caste value according to feature type
Next from these views, I created a row oriented view named metrelview - which consist of all relation data row wise as like normal table. I have planned to fire query to the view, so the query performance will be improved.
This view generation procedure needs to execute whenever any insert, update or delete operation will be there in features table.
Below is the MySQL procedure that I have developed for the view generation
CREATE PROCEDURE `buildModel`()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE fid INTEGER;
DECLARE fname VARCHAR(45);
DECLARE ftype VARCHAR(45);
DECLARE cur_fatures CURSOR FOR SELECT `featureid`, `name`, `type` FROM `metfeatures`;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
SET #viewAlias = 'v_';
SET #metRelView = "metrelview";
SET #stmtCols = "";
SET #stmtJoin = "";
START TRANSACTION;
OPEN cur_fatures;
read_loop: LOOP
FETCH cur_fatures INTO fid, fname, ftype;
IF done THEN
LEAVE read_loop;
END IF;
IF fname IS NOT NULL THEN
SET #featureView = CONCAT(#viewAlias, LOWER(fname));
IF ftype = 'float' THEN
SET #featureCastStr = "`value`+0.0";
ELSEIF ftype = 'int' THEN
SET #featureCastStr = "CAST(`value` AS SIGNED)";
ELSE
SET #featureCastStr = "`value`";
END IF;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #featureView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #featureView, "` AS SELECT le.`loceventid` AS loceventid, le.`locid`, le.`rectime`, le.`height`, ", #featureCastStr, " AS value FROM `metlocevent` le JOIN `metstore` ms ON (le.`loceventid`=ms.`loceventid`) WHERE ms.`featureid`=", fid);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
SET #stmtCols = CONCAT(#stmtCols, ", ", #featureView, ".`value` AS ", #featureView);
SET #stmtJoin = CONCAT(#stmtJoin, " ", "LEFT JOIN ", #featureView, " ON (le.`loceventid`=", #featureView,".`loceventid`)");
END IF;
END LOOP;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #metRelView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #metRelView, "` AS SELECT le.`loceventid`, le.`locid`, le.`rectime`, le.`height`", #stmtCols, " FROM `metlocevent` le", #stmtJoin);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
CLOSE cur_fatures;
COMMIT;
END;
N.B. - I tried to call the procedure with any event in features table, so that every thing should be automated. But as MySQL is not supported dynamic query with function or trigger, I cant do it automatically
I also want criticism before i finalize as accepted model, I am not a DBA so, if you can help me how to improve performance for the model will be very helpful for me.
This sounds like a homework assignment whose underlying subject is: use-cases for abandoning strict normal-form design principles.
The solution to this conundrum is to develop a three-stage solution. Stage 1 is runtime adaptability using the flexible AttributeType, AttributeValue approach, so that rapidly incoming data can be captured and put somewhere temporarily in a quasi-structured manner. Stage 2 involves the analysis of that runtime data to see where the model must be extended with additional columns and validation tables to accommodate any new attributes. Stage 3 is the importing of the as-yet-unimported data into the revised model, which never relaxes its strict datatyping and declarative referential integrity constraints.
As they say: Life, friends, is a trade-off.

Resources