Where\How to store distributed configuration data - oracle

A single installation of our product stores it's configuration in a set of database tables.
None of the installations 'know' about any other installation.
It's always been common for customers to install multiple copies of our product in different datacentres, which are geographically far apart. This means that the configuration information needs to be created once, then exported to other systems. Some of the config is then modified to suit local conditions. e.g. changing IP addresses, etc. This is a clunky, error prone approach.
We're now getting requests for the ability to have a more seamless strategy for sharing of global data, but still allowing local modifications.
If it weren't for the local modifications bit then we could use Oracle's data replication features.
Due to HA requirements having all the configuration in the one database isn't an option.
Has anyone else encountered this problem and have you ever figured out a good programmatic solution for this? Know of any good papers that might describe a partial or full solution?
We're *nix based, and use Oracle. Changes should be replicated to all nodes pretty quickly (a second or 2).

I'm not sure how possible it is for you to change the way you handle your configuration, but we implemented something similar to this by using the idea of local overrides. Specifically, you have two configuration tables that are identical (call them CentralConfig and LocalConfig). CentralConfig is maintained at the central location, and is replicated out to your satellite locations, where it is read-only. LocalConfig can be set up at the local site. Your procedure which queries configuration data first looks for the data in the LocalConfig table, and if not found, retrieves it from the CentralConfig table.
For example, if you were trying to do this with the values in the v$parameter table, you could query your configuration using the FIRST_VALUE function in SQL analytics:
SELECT DISTINCT
NAME
, FIRST_VALUE(VALUE) OVER(PARTITION BY NAME
ORDER BY localsort
) VALUE
FROM (SELECT t.*
, 0 localsort
FROM local_parameter t
UNION
SELECT t.*
, 1 localsort
FROM v$parameter t
)
ORDER BY NAME;
The localsort column in the unions is there just to make sure that the local_parameter values take precedence over the v$parameter values.
In our system, it's actually much more sophisticated than this. In addition to the "name" for the parameter you're looking up, we also have a "context" column that describes the context we are looking for. For example, we might have a parameter "timeout" that is set centrally, but even locally, we have multiple components that use this value. They may all be the same, but we may also want to configure them differently. So when the tool looks up the "timeout" value, it also constrains by scope. In the configuration itself, we may use wildcards when we define what we want for scope, such as:
CONTEXT NAME VALUE
------------- ------- -----
Comp Engine A timeout 15
Comp Engine B timeout 10
Comp Engine % timeout 5
% timeout 30
The configuration above says, for all components, use a timeout of 30, but for Comp Engines of any type, use a timeout of 5, however for Comp Engines A & B, use 15 & 10 respectively. The last two configurations may be maintained in CentralConfig, but the other two may be maintained in LocalConfig, and you would resolve the settings this way:
SELECT DISTINCT
NAME
, FIRST_VALUE(VALUE) OVER(PARTITION BY NAME
ORDER BY (TRANSLATE(Context
, '%_'
, CHR(1) || CHR(2)
) DESC
, localsort
) VALUE
FROM (SELECT t.*
, 0 localsort
FROM LocalConfig t
WHERE 'Comp Engine A' LIKE Context
UNION
SELECT t.*
, 1 localsort
FROM CentralConfig t
WHERE 'Comp Engine A' LIKE Context
)
ORDER BY NAME;
It's basically the same query, except that I'm inserting that TRANSLATE expression before my localsort and I'm constraining on Context. What it's doing is converting the % and _ characters to chr(1) & chr(2), which will make them sort after alphanumeric characters in the descending sort. In this way, the explicitly defined "Comp Engine A" will come before "Comp Engine %", which in turn will come before "%". In cases where the contexts are defined identically, local config takes precedence over central ones; if you wanted local to always trump central, even in cases when central was scoped more tightly, you'd just reverse the positions of the two sort terms.

The way we're doing this is similar with Steve's.
First you need a Central Configure Service to save all the configure you want to apply to the distributed environment. Every time you want to modify the config, modify it in the Central Configure Service. In each production host you can write a loop script to update the configure.
For a more sophisticated solution, you need to set up some strategy to avoid a wrong configure batch into all servers, that would be a disaster. Maybe you need a simple lock or a grey release process.

Related

SSIS Google Analytics - Reporting dimensions & metrics are incompatible

I'm trying to extract data from Google Analytics, using KingswaySoft - SSIS Integration Toolkit, in Visual Studio.
I've set the metrics and dimensions, but I get this error message:
Please remove transactions to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
I've tried to remove transactions metric and it works, but this metric is really necessary.
Metrics: sessionConversionRate, sessions, totalUsers, transactions
Dimensions: campaignName, country, dateHour, deviceCategory, sourceMedium
Any idea on how to solve it?
I'm not sure how helpful this suggestion is but could a possible work around include having two queries.
Query 1: Existing query without transactions
Query 2: The same dimensions with transactionId included
The idea would be to use the SSIS Aggregate component to group by the original dimensions and count the transactions. You could then merge the queries together via a merge join.
Would that work?
The API supports what it supports. So if you've attempted to pair things that are incompatible, you won't get any data back. Things that seem like they should totally work go together like orange juice and milk.
While I worked on the GA stuff through Python, an approach we found helped us work through incompatible metrics and total metrics was to make multiple pulls using the same dimensions. As the data sets are at the same level of grain, as long as you match up each dimension in the set, you can have all the metrics you want.
In your case, I'd have 2 data flows, followed by an Execute SQL Task that brings the data together for the final table
DFT1: Query1 -> Derived Column -> Stage.Table1
DFT2: Query2 -> Derived Column -> Stage.Table2
Execute SQL Task
SELECT
T1.*, T2.Metric_A, T2.Metric_B, ... T2.Metric_Z
INTO
#T
FROM
Stage.T1 AS T1
INNER JOIN
Stage.T2 AS T2
ON T2.Dim1 = T1.Dim1 /* etc */ AND T2.Dim7 = T1.Dim7
-- Update you have solid data aka
-- isDataGolden exists in the "data" section of the response
-- Usually within 7? days but possibly sooner
UPDATE
X
SET
metric1 = S.metric1 /* etc */
FROM
dbo.X AS X
INNER JOIN #T AS T
ON T.Dim1 = X.Dim1
WHERE
X.isDataGolden IS NULL
AND T.isDataGolden IS NOT NULL;
-- Add new data but be aware that not all nodes might have
-- reported in.
INSERT INTO
dbo.X
SELECT
*
FROM
#T AS T
WHERE
NOT EXISTS (SELECT * FROM dbo.X AS X WHERE X.Dim1 = T.Dim1 /* etc */);

How to identify a new pattern in a URL with a machine learning algorithm (Text mining)

I am trying to identify new patterns after analyzing a number of URLs. So let's say, I am investigating the hypothetical website Yoohle.com and their URLs have the following structure.
domain = yoohle.com
q= search phrase
lan= language used
pr= partner_id
br= browser_id
so a sample url will look like this
www.yoohle.com/test_folder/test_page?q=hello+world&lan=en&pr=stackoverflow&br=chrome
If I am investigating the web traffic of this website and seeing abnormal increase month over month, I would like to find out what's causing this. In this example I can just parse out the URL and look at the pr= value since it will tell me if there is a new partnership (maybe stackoverflow is going to be powered by yoohle.com and that drives the increase etc.)
The question is, how can I build something robust that can compare 2 (or more) months and tell me exactly what's driving the increase. I want to get something like, "we are seeing an increase and it is driven by the following pattern"
www.yoohle.com/test_folder/test_page%pr=stackoverflow%
The tricky part is, you do not know anything about what the tokens mean unlike this example since I will not know what token stands for partner_id. Another issue is, if we look at token by token, this will be misleading because lan=en will also go up with a new partner assuming the users will still have English as the language.
My idea is to analyze the tokens by looking at all the combinations but it is very costly, (4! in this example and probably 10+! for other websites). Also analyzing tokens itself is not going to solve the problem since I still need to analyze the values of the tokens.
I tried k-means clustering, apriori algorithm did some research on URL/text mining but could not get what I want. Any ideas about how to approach building an algorithm will be beneficial.
Imagine that you are seeing realtime data, so we are talking about analyzing around 100K URLs in a given month.
I would go the following way. You can create the following table:
URL
time
time_month -- time rounded to month, for demonstration purpose
q_bol -- boolean flag whether question parameter was used
q -- question parameter value
lan -- language parameter value
lan_bol -- boolean flag whether language parameter was used
pr -- partner parameter value
pr_bol -- boolean flag whether partner parameter was used
br -- browser parameter value
br_bol -- boolean flag whether browse parameter was used
Now, you can write some query.
with t as (
select
time_month,
q_bol, lan_bol, pr_bol, br_bol, count(*)
from
urldata
where
time_month > '2013-02-01'::date and time_month < '2013-04-01'::date -- last two months data
group by
time_month
)
, u as (
select
*,
t2-coalesce(t1,0) as abs_change, -- change in pattern MoM,
case when t1 is null then 0 else t2/t1 end as relchange -- relative change
from
t t1 full outer join t t2 using (q_bol, lan_bol, pr_bol, br_bol)
)
select * from u where abs_change > 5000 or relchange > 3
The query above gives you parameters patterns where there is more than 5000 change month over month or more than 300% increase month over month. If you can use group by rollup in your sql system it would give also higher level aggregations (combinations of three parameters, two parameters, one parameter).
You can do pretty the same with values of parameters. Because you do not know what tokens will be present with values, you can parse url in the following structure of tables:
-- urls
id_url
url
time
-- parameters
id_url
token
value
Then you will need to rewrite the query above in some way, e.g. you can use array aggregation function in PostgreSQL array_agg().

SSIS - Join tables from different servers whose names are based on a variable

I have a simple query based on tables from two different linked servers. I need both servers to be changeable since we're moving from DEV to UAT to Production. I'm using an expression to set the Connection String and Password for server A. So, using that as a base I set a Data Flow Task and an 'OLE DB Source' to extract the data I need. Ultimately I'd like my query to look like this:
Select * from A.Payments p1
Full Outer Join ?.Payments p2 on p1.Id = p2.Id
where p1.OrderDesc is null or p2.OrderDesc is null
Is there a way around it? Can I use a variable or some kind of dynamic query? I haven't managed to parse a project parameter and run one. Thank you very much for your help.
This is done by making the Data Source SQL an expression.
Right click the Data Flow and then click the ellipsis [...] beside "Expressions". In there you will find one of the available properties you can set is the SQLCommand for your Data Flow Source.
It's not the most intuitive thing to be fair.

How to inline a variable in PL/SQL?

The Situation
I have some trouble with my query execution plan for a medium-sized query over a large amount of data in Oracle 11.2.0.2.0. In order to speed things up, I introduced a range filter that does roughly something like this:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
-- [...]
As you can see, I want to restrict the JOIN of organisations using an optional range of organisation numbers. Client code can call DO_STUFF with (supposed to be fast) or without (very slow) the restriction.
The Trouble
The trouble is, PL/SQL will create bind variables for the above org_from and org_to parameters, which is what I would expect in most cases:
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((:B1 IS NULL) OR (:B1 <= org.no))
AND ((:B2 IS NULL) OR (:B2 >= org.no)))
-- [...]
The Workaround
Only in this case, I measured the query execution plan to be a lot better when I just inline the values, i.e. when the query executed by Oracle is actually something like
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
-- [...]
By "a lot", I mean 5-10x faster. Note that the query is executed very rarely, i.e. once a month. So I don't need to cache the execution plan.
My questions
How can I inline values in PL/SQL? I know about EXECUTE IMMEDIATE, but I would prefer to have PL/SQL compile my query, and not do string concatenation.
Did I just measure something that happened by coincidence or can I assume that inlining variables is indeed better (in this case)? The reason why I ask is because I think that bind variables force Oracle to devise a general execution plan, whereas inlined values would allow for analysing very specific column and index statistics. So I can imagine that this is not just a coincidence.
Am I missing something? Maybe there is an entirely other way to achieve query execution plan improvement, other than variable inlining (note I have tried quite a few hints as well but I'm not an expert on that field)?
In one of your comments you said:
"Also I checked various bind values.
With bind variables I get some FULL
TABLE SCANS, whereas with hard-coded
values, the plan looks a lot better."
There are two paths. If you pass in NULL for the parameters then you are selecting all records. Under those circumstances a Full Table Scan is the most efficient way of retrieving data. If you pass in values then indexed reads may be more efficient, because you're only selecting a small subset of the information.
When you formulate the query using bind variables the optimizer has to take a decision: should it presume that most of the time you'll pass in values or that you'll pass in nulls? Difficult. So look at it another way: is it more inefficient to do a full table scan when you only need to select a sub-set of records, or to do indexed reads when you need to select all records?
It seems as though the optimizer has plumped for full table scans as being the least inefficient operation to cover all eventualities.
Whereas when you hard code the values the Optimizer knows immediately that 10 IS NULL evaluates to FALSE, and so it can weigh the merits of using indexed reads for find the desired sub-set records.
So, what to do? As you say this query is only run once a month I think it would only require a small change to business processes to have separate queries: one for all organisations and one for a sub-set of organisations.
"Btw, removing the :R1 IS NULL clause
doesn't change the execution plan
much, which leaves me with the other
side of the OR condition, :R1 <=
org.no where NULL wouldn't make sense
anyway, as org.no is NOT NULL"
Okay, so the thing is you have a pair of bind variables which specify a range. Depending on the distribution of values, different ranges might suit different execution plans. That is, this range would (probably) suit an indexed range scan...
WHERE org.id BETWEEN 10 AND 11
...whereas this is likely to be more fitted to a full table scan...
WHERE org.id BETWEEN 10 AND 1199999
That is where Bind Variable Peeking comes into play.
(depending on distribution of values, of course).
Since the query plans are actually consistently different, that implies that the optimizer's cardinality estimates are off for some reason. Can you confirm from the query plans that the optimizer expects the conditions to be insufficiently selective when bind variables are used? Since you're using 11.2, Oracle should be using adaptive cursor sharing so it shouldn't be a bind variable peeking issue (assuming you are calling the version with bind variables many times with different NO values in your testing.
Are the cardinality estimates on the good plan actually correct? I know you said that the statistics on the NO column are accurate but I would be suspicious of a stray histogram that may not be updated by your regular statistics gathering process, for example.
You could always use a hint in the query to force a particular index to be used (though using a stored outline or optimizer plan stability would be preferable from a long-term maintenance perspective). Any of those options would be preferable to resorting to dynamic SQL.
One additional test to try, however, would be to replace the SQL 99 join syntax with Oracle's old syntax, i.e.
SELECT <<something>>
FROM <<some other table>> cust,
organization org
WHERE cust.org_id = org.id
AND ( ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
That obviously shouldn't change anything, but there have been parser issues with the SQL 99 syntax so that's something to check.
It smells like Bind Peeking, but I am only on Oracle 10, so I can't claim the same issue exists in 11.
This looks a lot like a need for Adaptive Cursor Sharing, combined with SQLPlan stability.
I think what is happening is that the capture_sql_plan_baselines parameter is true. And the same for use_sql_plan_baselines. If this is true, the following is happening:
The first time that a query started it is parsed, it gets a new plan.
The second time, this plan is stored in the sql_plan_baselines as an accepted plan.
All following runs of this query use this plan, regardless of what the bind variables are.
If Adaptive Cursor Sharing is already active,the optimizer will generate a new/better plan, store it in the sql_plan_baselines but is not able to use it, until someone accepts this newer plan as an acceptable alternative plan. Check dba_sql_plan_baselines and see if your query has entries with accepted = 'NO' and verified = null
You can use dbms_spm.evolve to evolve the new plan and have it automatically accepted if the performance of the plan is at least 1,5 times better than without the new plan.
I hope this helps.
I added this as a comment, but will offer up here as well. Hope this isn't overly simplistic, and looking at the detailed responses I may be misunderstanding the exact problem, but anyway...
Seems your organisations table has column no (org.no) that is defined as a number. In your hardcoded example, you use numbers to do the compares.
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
In your procedure, you are passing in varchar2:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
So to compare varchar2 to number, Oracle will have to do the conversions, so this may cause the full scans.
Solution: change proc to pass in numbers

Cache and SqlCacheDependency (ASP.NET MVC)

We need to return subset of records and for that we use the following command:
using (SqlCommand command = new SqlCommand(
"SELECT ID, Name, Flag, IsDefault FROM (SELECT ROW_NUMBER() OVER (ORDER BY #OrderBy DESC) as Row, ID, Name, Flag, IsDefault FROM dbo.Languages) results WHERE Row BETWEEN ((#Page - 1) * #ItemsPerPage + 1) AND (#Page * #ItemsPerPage)",
connection))
I set a SqlCacheDependency declared like this:
SqlCacheDependency cacheDependency = new SqlCacheDependency(command);
But immediately after I run the command.ExecuteReader() instruction, the hasChanged base property of the SqlCacheDependency object becomes true although I did not change the result of the query in any way! And, because of this, the result of this query is not kept in cache.
HttpRuntime.Cache.Insert( cacheKey, list, cacheDependency, Cache.NoAbsoluteExpiration, TimeSpan.FromMinutes(AppConfiguration.CacheExpiration.VeryLowActivity));
Is it because the command has 2 SELECT statements? Is it ROW_NUMBER()? If yes, is there any other way to paginate results?
Please help! After too many hours, a little will be greatly appreciated! Thank you
Running into the same issue and finding the same answers online without any help, I was reasearching the xml invalid subsicription response from profiler.
I found an example on msdn support site that had a slightly different order of code. When I tried it I realized the problem - Don't open your connection object until after you've created the command object and the cache dependency object. Here is the order you must follow and all will be good:
Be sure to enable notifications (SqlCahceDependencyAdmin) and run SqlDependency.Start first
Create the connection object
Create the command object and assign command text, type, and connection object (any combination of constructors, setting properties, or using CreateCommand).
Create the sql cache dependency object
Open the connection object
Execute the query
Add item to cache using dependency.
If you follow this order, and follow all other requirements on your select statement, don't have any permissions issues, this will work!
I believe the issue has to do with how the .NET framework manages the connection, specifically what settings are set. I tried overriding this in my sql command test but it never worked. This is only a guess - what I do know is changing the order immediately solved the issue.
I was able to piece it together from the following to msdn posts.
This post was one of the more common causes of the invalid subscription, and shows how the .Net client sets the properties that are in contrast to what notification requires.
https://social.msdn.microsoft.com/Forums/en-US/cf3853f3-0ea1-41b9-987e-9922e5766066/changing-default-set-options-forced-by-net?forum=adodotnetdataproviders
Then this post was from a user who, like me, had reduced his code to the simplest format. My original code pattern was similar to his.
https://social.technet.microsoft.com/Forums/windows/en-US/5a29d49b-8c2c-4fe8-b8de-d632a3f60f68/subscriptions-always-invalid-usual-suspects-checked-no-joy?forum=sqlservicebroker
Then I found this post, also a very simple reduction of the problem, only his was a simple issue - needing 2 part name for tables. In his case the suggestion resolved the issue. After looking at his code I noticed the main difference was waiting to open the connection object until AFTER the command object AND the dependency object were created. My only assumption is under the hood (I have not yet started reflector to check so only an assumption) the Connection object is opened differently, or order of events and command happen differently, because of this association.
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/bc9ca094-a989-4403-82c6-7f608ed462ce/sql-server-not-creating-subscription-for-simple-select-query-when-using-sqlcachedependency?forum=sqlservicebroker
I hope this helps someone else in a similar issue.
Just a guess, but could it be because your SELECT statement doesn't have an ORDER BY clause?
If you don't specify an explicit ordering then it's possible for the query to return the results in any order each time it is run. Maybe this is causing the SqlCacheDependency object to think that the results have changed.
Try adding an ORDER BY clause:
SELECT ID, Name, Flag, IsDefault
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY #OrderBy DESC) AS Row,
ID, Name, Flag, IsDefault
FROM dbo.Languages
) AS results
WHERE Row BETWEEN ((#Page - 1) * #ItemsPerPage + 1) AND (#Page * #ItemsPerPage)
ORDER BY Row
i'm no expert on SqlCacheDependency, in fact, i found this question whilst looking for answers to my own issues with it! However, i believe the reason your SqlCacheDependency is not working is because your SQL contains a nested sub query.
Take a look at the documentation which lists what you can/can not use in your SQL: Creating a Query for Notification
"....The statement must not contain subqueries, outer joins, or self-joins....."
I also found some invaluable troubleshooting info from a guy at Redgate here: Using and Monitoring SQL 2005 Query Notification that helped me solve my own problem: By using Sql Profiler to trace the QN events he suggests, i was able to spot my connection was incorrectly using the 'SET ARITHABORT OFF' option, causing my notifications to fail.

Resources