SSIS Google Analytics - Reporting dimensions & metrics are incompatible - visual-studio

I'm trying to extract data from Google Analytics, using KingswaySoft - SSIS Integration Toolkit, in Visual Studio.
I've set the metrics and dimensions, but I get this error message:
Please remove transactions to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
I've tried to remove transactions metric and it works, but this metric is really necessary.
Metrics: sessionConversionRate, sessions, totalUsers, transactions
Dimensions: campaignName, country, dateHour, deviceCategory, sourceMedium
Any idea on how to solve it?

I'm not sure how helpful this suggestion is but could a possible work around include having two queries.
Query 1: Existing query without transactions
Query 2: The same dimensions with transactionId included
The idea would be to use the SSIS Aggregate component to group by the original dimensions and count the transactions. You could then merge the queries together via a merge join.
Would that work?

The API supports what it supports. So if you've attempted to pair things that are incompatible, you won't get any data back. Things that seem like they should totally work go together like orange juice and milk.
While I worked on the GA stuff through Python, an approach we found helped us work through incompatible metrics and total metrics was to make multiple pulls using the same dimensions. As the data sets are at the same level of grain, as long as you match up each dimension in the set, you can have all the metrics you want.
In your case, I'd have 2 data flows, followed by an Execute SQL Task that brings the data together for the final table
DFT1: Query1 -> Derived Column -> Stage.Table1
DFT2: Query2 -> Derived Column -> Stage.Table2
Execute SQL Task
SELECT
T1.*, T2.Metric_A, T2.Metric_B, ... T2.Metric_Z
INTO
#T
FROM
Stage.T1 AS T1
INNER JOIN
Stage.T2 AS T2
ON T2.Dim1 = T1.Dim1 /* etc */ AND T2.Dim7 = T1.Dim7
-- Update you have solid data aka
-- isDataGolden exists in the "data" section of the response
-- Usually within 7? days but possibly sooner
UPDATE
X
SET
metric1 = S.metric1 /* etc */
FROM
dbo.X AS X
INNER JOIN #T AS T
ON T.Dim1 = X.Dim1
WHERE
X.isDataGolden IS NULL
AND T.isDataGolden IS NOT NULL;
-- Add new data but be aware that not all nodes might have
-- reported in.
INSERT INTO
dbo.X
SELECT
*
FROM
#T AS T
WHERE
NOT EXISTS (SELECT * FROM dbo.X AS X WHERE X.Dim1 = T.Dim1 /* etc */);

Related

Neo4j building initial graph is slow

I am trying to build out a social graph between 100k users. Users can sync other social media platforms or upload their own contacts. Building each relationship takes about 200ms. Currently, I have everything uploaded on a queue so it can run in the background, but ideally, I can complete it within the HTTP request window. I've tried a few things and received a few warnings.
Added an index to the field pn
Getting a warning This query builds a cartesian product between disconnected patterns. - I understand why I am getting this warning, but no relationship exists and that's what I am building in this initial call.
MATCH (p1:Person {userId: "....."}), (p2:Person) WHERE p2.pn = "....." MERGE (p1)-[:REL]->(p2) RETURN p1, p2
Any advice on how to make it faster? Ideally, each relationship creation is around 1-2ms.
You may want to EXPLAIN the query and make sure that NodeIndexSeeks are being used, and not NodeByLabelScan. You also mentioned an index on :Person(pn), but you have a lookup on :Person(userId), so you might be missing an index there, unless that was a typo.
Regarding the cartesian product warning, disregard it, the cartesian product is necessary in order to get the nodes to create the relationship, this should be a 1 x 1 = 1 row operation so it's only going to be costly if multiple nodes are being matched per side, or if index lookups aren't being used.
If these are part of some batch load operation, then you may want to make your query apply in batches. So if 100 contacts are being loaded by a user, you do NOT want to execute 100 queries each, with each query adding a single contact. Instead, pass as a parameter the list of contacts, then UNWIND the list and apply the query once to process the entire batch.
Something like:
UNWIND $batch as row
MATCH (p1:Person {pn: row.p1}), (p2:Person {pn: row.p2)
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
It's usually okay to batch 10k or so entries at a time, though you can adjust that depending on the complexity of the query
Check out this blog entry for how to apply this approach.
https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi
You can use the index you created on Person by suggesting a planner hint.
Reference: https://neo4j.com/docs/cypher-manual/current/query-tuning/using/#query-using-index-hint
CREATE INDEX ON :Person(pn);
MATCH (p1:Person {userId: "....."})
WITH p1
MATCH (p2:Person) using index p2:Person(pn)
WHERE p2.pn = "....."
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2

BIRT Report Designer - split actual and budget data stored in one table into columns and add a variance

I have financial data in the following format in a SQL database and I have to live with this format unfortunately (example dummy data below).
I have however been struggling to get it into the following layout in a BIRT report.
I have tried creating a data cube with Package, Flow and Account as Dimensions and Balance as a Measure, but that groups actual PER and actual YTD next to each other and budget PER and YTD next to each-other etc so is not quite what I need.
The other idea I had was to create four new calculated columns, the first would only have a value if it were a line for actual and per, the next only if it was actual and ytd etc, but could not get the IF function working in the calculated column.
What are the options? Can someone point me in the direction of how to best create the above layout from this data structure so I can take it from there?
Thanks in advance.
I am not sure what DB you are using in the back end, but this is how I did it with SQL Server.
The important bit happens in the Data Set. Here is the SQL for my Data Set:
SELECT
Account,
Package,
Flow,
Balance
FROM data
UNION
SELECT DISTINCT
Account,
'VARIANCE',
Flow,
(SELECT COALESCE(SUM(Balance),0) FROM data WHERE Account = d.Account AND Flow = d.Flow AND Package = 'ACTUAL') - (SELECT COALESCE(SUM(Balance), 0) FROM data WHERE Account = d.Account AND Flow = d.Flow AND Package = 'BUD') as Balance
FROM data d
This gives me a table like:
Then I created a DataCube that contained
Groups/Dimensions
Account
Flow
Package
Summary Fields/Measures
Balance
Then I created a CrossTab Report that was based on that DataCube
And this produces the result of:
Hopefully this helps.

How to sort an optimized keys only query

I'm using datastore native api to access to gae database (for well studied specific reasons). I wanted to optimize the code and use the memcache in my requests instead of directly grabbing the values, the issue, is that my query is sorted.
When I do a findProductsByFiltersQuery.setKeysOnly(); on my query, I receive this error:
The provided keys-only multi-query needs to perform some sorting in
memory. As a result, this query can only be sorted by the key
property as this is the only property that is available in memory.
The weired thing is that it starts happening from a certain complexity of the request, for example this request fails:
SELECT __key__ FROM Product WHERE dynaValue = _Rs:2 AND productState = PUBLISHED AND dynaValue = _RC:2 AND dynaValue = RF:1 AND dynaValue = ct:1030003 AND dynaValue = _RS:4 AND dynaValue = _px:2 AND itemType = NEWS ORDER BY modificationDate DESC
while this one passes :
SELECT __key__ FROM Product WHERE itemType = CI AND productState = PUBLISHED ORDER BY modificationDate DESC
Can someone explain me why this is happening and if ordering is not possible when getting the keys, for what is that feature? : since results are paginated, it is useless to get a bad set of keys from the first filtering request. So how is it thought???
Please also notice that when I do non keysOnly very long request I receive this message
Splitting the provided query requires that too many subqueries are
merged in memory.
at com.google.appengine.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
at com.google.appengine.api.datastore.QuerySplitHelper.splitQuery(QuerySplitHelper.java:99)
at com.google.appengine.api.datastore.QuerySplitHelper.splitQuery(QuerySplitHelper.java:71)
Can someone explain me how is it possible there is in memory treatment when values are indexed? or is it the devmode server only that does this error?
In-memory queries are necessary when you use OR, IN, and != operators in Datastore. As described in this blog post, queries using these operators are split in the client into multiple Datastore queries. For example:
SELECT * FROM Foo WHERE A = 1 OR A = 2
gets split into two queries:
SELECT * FROM Foo WHERE A = 1
SELECT * FROM Foo WHERE A = 2
If you add ORDER BY B to your query, both sub-queries get this order:
SELECT * FROM Foo WHERE A = 1 ORDER BY B
SELECT * FROM Foo WHERE A = 2 ORDER BY B
While each of these Datastore queries returns results sorted by B, the union of the queries is not. On the client side, the SDK merges the results from the two ordered by B.
In order to do this, the Datastore queries must actually return the ordered property, otherwise the SDK won't know the correct way to merge these together.
If you are writing queries with a large number of filters, make sure to only use AND filters. This will allow all the operations to be performed only in the Datastore, in which case no in-memory sorting is necessary.

SSIS - Join tables from different servers whose names are based on a variable

I have a simple query based on tables from two different linked servers. I need both servers to be changeable since we're moving from DEV to UAT to Production. I'm using an expression to set the Connection String and Password for server A. So, using that as a base I set a Data Flow Task and an 'OLE DB Source' to extract the data I need. Ultimately I'd like my query to look like this:
Select * from A.Payments p1
Full Outer Join ?.Payments p2 on p1.Id = p2.Id
where p1.OrderDesc is null or p2.OrderDesc is null
Is there a way around it? Can I use a variable or some kind of dynamic query? I haven't managed to parse a project parameter and run one. Thank you very much for your help.
This is done by making the Data Source SQL an expression.
Right click the Data Flow and then click the ellipsis [...] beside "Expressions". In there you will find one of the available properties you can set is the SQLCommand for your Data Flow Source.
It's not the most intuitive thing to be fair.

Where\How to store distributed configuration data

A single installation of our product stores it's configuration in a set of database tables.
None of the installations 'know' about any other installation.
It's always been common for customers to install multiple copies of our product in different datacentres, which are geographically far apart. This means that the configuration information needs to be created once, then exported to other systems. Some of the config is then modified to suit local conditions. e.g. changing IP addresses, etc. This is a clunky, error prone approach.
We're now getting requests for the ability to have a more seamless strategy for sharing of global data, but still allowing local modifications.
If it weren't for the local modifications bit then we could use Oracle's data replication features.
Due to HA requirements having all the configuration in the one database isn't an option.
Has anyone else encountered this problem and have you ever figured out a good programmatic solution for this? Know of any good papers that might describe a partial or full solution?
We're *nix based, and use Oracle. Changes should be replicated to all nodes pretty quickly (a second or 2).
I'm not sure how possible it is for you to change the way you handle your configuration, but we implemented something similar to this by using the idea of local overrides. Specifically, you have two configuration tables that are identical (call them CentralConfig and LocalConfig). CentralConfig is maintained at the central location, and is replicated out to your satellite locations, where it is read-only. LocalConfig can be set up at the local site. Your procedure which queries configuration data first looks for the data in the LocalConfig table, and if not found, retrieves it from the CentralConfig table.
For example, if you were trying to do this with the values in the v$parameter table, you could query your configuration using the FIRST_VALUE function in SQL analytics:
SELECT DISTINCT
NAME
, FIRST_VALUE(VALUE) OVER(PARTITION BY NAME
ORDER BY localsort
) VALUE
FROM (SELECT t.*
, 0 localsort
FROM local_parameter t
UNION
SELECT t.*
, 1 localsort
FROM v$parameter t
)
ORDER BY NAME;
The localsort column in the unions is there just to make sure that the local_parameter values take precedence over the v$parameter values.
In our system, it's actually much more sophisticated than this. In addition to the "name" for the parameter you're looking up, we also have a "context" column that describes the context we are looking for. For example, we might have a parameter "timeout" that is set centrally, but even locally, we have multiple components that use this value. They may all be the same, but we may also want to configure them differently. So when the tool looks up the "timeout" value, it also constrains by scope. In the configuration itself, we may use wildcards when we define what we want for scope, such as:
CONTEXT NAME VALUE
------------- ------- -----
Comp Engine A timeout 15
Comp Engine B timeout 10
Comp Engine % timeout 5
% timeout 30
The configuration above says, for all components, use a timeout of 30, but for Comp Engines of any type, use a timeout of 5, however for Comp Engines A & B, use 15 & 10 respectively. The last two configurations may be maintained in CentralConfig, but the other two may be maintained in LocalConfig, and you would resolve the settings this way:
SELECT DISTINCT
NAME
, FIRST_VALUE(VALUE) OVER(PARTITION BY NAME
ORDER BY (TRANSLATE(Context
, '%_'
, CHR(1) || CHR(2)
) DESC
, localsort
) VALUE
FROM (SELECT t.*
, 0 localsort
FROM LocalConfig t
WHERE 'Comp Engine A' LIKE Context
UNION
SELECT t.*
, 1 localsort
FROM CentralConfig t
WHERE 'Comp Engine A' LIKE Context
)
ORDER BY NAME;
It's basically the same query, except that I'm inserting that TRANSLATE expression before my localsort and I'm constraining on Context. What it's doing is converting the % and _ characters to chr(1) & chr(2), which will make them sort after alphanumeric characters in the descending sort. In this way, the explicitly defined "Comp Engine A" will come before "Comp Engine %", which in turn will come before "%". In cases where the contexts are defined identically, local config takes precedence over central ones; if you wanted local to always trump central, even in cases when central was scoped more tightly, you'd just reverse the positions of the two sort terms.
The way we're doing this is similar with Steve's.
First you need a Central Configure Service to save all the configure you want to apply to the distributed environment. Every time you want to modify the config, modify it in the Central Configure Service. In each production host you can write a loop script to update the configure.
For a more sophisticated solution, you need to set up some strategy to avoid a wrong configure batch into all servers, that would be a disaster. Maybe you need a simple lock or a grey release process.

Resources