VisitorID in BigQuery doesn't match Sessions Google Analytics - session

I've looked around the web and I keep getting the same answer: to count sessions in BigQuery, take count(distinct concat(fullvisitorID, string(visitID))). But in some cases, that's not even getting me close to the sessions in Google Analytics. Is there any other way to count sessions better? Here's what I'm trying to do:
SELECT hits.customdimensions.value val,
count(*) as pageviews,
exact_count_distinct(CONCAT([fullVisitorId], STRING([visitid]))) sessions
FROM [xxx.ga_sessions_20150619]
where hits.customdimensions.index = 7 and lower(hits.type) = 'page'
group by val
order by pageviews desc
LIMIT 1000
For some custom dimension values, that gets close to GA, but others are off by twice the amount. Is there any way to get a better session count in BQ?

Well, I can't really speak to your GA data itself (of course, check to make sure you're not sampling the data at all), but if you run the following query, you'll pull the sum of each of the session counts per fullVisitorId:
SELECT SUM(sessionsPerUser)
FROM (SELECT fullVisitorId, COUNT(visitNumber) AS sessionsPerUser
FROM [xxx.ga_sessions_2017yyzz]
GROUP BY fullVisitorId)

Related

How to take data in portions from Oracle using Mybatis?

In my application I am making a query to oracle and getting data this way
<select id="getAll" resultType="com.mappers.MyOracleMapper">
SELECT * FROM "OracleTable"
</select>
I get all the data, the problem is that there is a lot of data and it will take too much time to process all the data at once, since the response from the database will come in 3-4 minutes, this is not convenient.
How to make it so that I receive lines in portions without using the id field (since it does not exist, I do not know why). That is, take the first portion of lines, for example, the first 50, process them and take the next portion. It would be desirable to place a variable in properties that will be responsible for the number of lines in portions.
I can't do this in mybatis. This is new to me. Thanks in advance.
there is such a field and it is unique
OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY
don't work, because the version is earlier than 12c
If you want to read millions of rows that's going to take time. It's normal to expect a few minutes to read and receive all the data over the wire.
Now, you have two options:
Use a Cursor
In MyBatis you can read the result of the query using the buffering a cursor gives you. The cursor reads a few hundred rows at a time and your app reads them one by one. Your app doesn't notice that behind the scenes there is buffering. Pretty good. For example, you can do:
Cursor<Client> clients = this.sqlSession.selectCursor("getAll");
for (Client c : clients) {
// process one client
}
Consider that cursors remain open until the end of the transaction. If you close the transaction (or exit the method marked as #Transactional) the cursor won't be usable anymore.
Use Manual Pagination
This solution can work well for the first pages of the result set, but it becomes increasingly inefficient and slooooooow the more you advance in the result set. Use it only as a last resort.
The only case where this strategy can be efficient is when you have the chance of implementing "key set pagination". I assume it's not the case here.
You can modify your query to perform explicit pagination. For example, you can do:
<select id="getPage" resultType="com.mappers.MyOracleMapper">
select * from (
SELECT rownum rnum, x.*
FROM OracleTable
WHERE rownum <= #{endingRow}
ORDER BY id
) x
where rnum >= #{startingRow}
</select>
You'll need to provide the extra parameters startingRow and endingRow.
NOTE: It's imperative you include an ORDER BY clause. Otherwise the pagination logic is meaningless. Choose any ordering you want, preferrably something that is backed up by an existing index.

Most efficient way to select in bulk from a multi million records table

I'm interested in getting and doing some processing on all the entities A returned by a query of the form:
SELECT * FROM A a WHERE a.id not in (select b.id from B)
Where A is a "complex" entity in the sense that it inherits (InheritanceTyped.Joined) from other entities and that several of its attributes are other entities (#OneToOne and #ManyToOne).
The query itself takes a few minutes to yield results hence my desire to execute it as few as possible.
Here are the different approaches i tried to get those A elements as efficiently as possible :
Pagination using setFirstResult/ setMaxResults
Do the job, but pretty slowly as the query seems to be executed everytime.(around 50 elements processed/sec)
Getting IDs first, A objects next
Keeping all the IDs in memory is doable, so I execute once
SELECT a.id FROM A a WHERE a.id not in (select b.id from B)
and then select a from A a WHERE a.id= :id, which goes relatively fast as the id column is indexed. This is currently the solution that is the most efficient with (around 100 elements processed/sec)
Using ScollableResults I had high hope with this solution, but it ended up being slower than other alternatives, leaving me at around 20 elements processed/sec ...
As a neophyte, I don't know what other options to investigate, or if I did something wrong in any of my attempts.
Hence my questions:
Are there (factually) other approaches to efficiently tackle this kind of problem ?
Is it normal that ScrollableResults performed so poorly ? Is there something I should have paid attention to while implementing this solution?
EDIT:
Here's the execution plan

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.
Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.
You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

Limit query size for specific users

Is there a way to limit the rows returned for a user in Oracle.
We have some users than can query some tables with millions of records decreasing the performance of the database, so I would like to know if there someway to set max size of records per user.
For example, If I have the table: APP.HISTORY with 10,000,000 records and the user 'dummy', I would like to set for dummy user that can only read 10,000 records from it.
For example if 'dummy' execute:
select * from APP.HISTORY
It will only return 10,000 records, instead try to fetch the 10,000,000 records
There isn't any built-in functionality to limit the number of results per user.
However, even if you could, that wouldn't necessarily help you resolve your performance concern.
Consider for example a query like:
select *
from (select *
from app.history
order by some_field desc)
where rownum < 2
According to your requirements, user dummy would be able to run this and get back the single result he's interested in. However, assuming some_field is not indexed, then, even though this query will return a single record, it still has to order all 10,000,000 records to produce that single row.
As suggested by OldProgrammer in the comments, consider using resource groups, which is a very flexible and configurable way of throttling CPU and I/O usage.
Otherwise, if you don't trust user dummy to write smart and efficient queries, then don't give him direct access to the database.

Is it a good idea to store and access an active query resultset in Coldfusion vs re-quering the database?

I have a product search engine using Coldfusion8 and MySQL 5.0.88
The product search has two display modes: Multiple View and Single View.
Multiple displays basic record info, Single requires additional data to be polled from the database.
Right now a user does a search and I'm polling the database for
(a) total records and
(b) records FROM to TO.
The user always goes to Single view from his current resultset, so my idea was to store the current resultset for each user and not have to query the database again to get (waste a) overall number of records and (waste b) a the single record I already queried before AND then getting the detail information I still need for the Single view.
However, I'm getting nowhere with this.
I cannot cache the current resultset-query, because it's unique to each user(session).
The queries are running inside a CFINVOKED method inside a CFC I'm calling through AJAX, so the whole query runs and afterwards the CFC and CFINVOKE method are discarded, so I can't use query of query or variables.cfc_storage.
So my idea was to store the current resultset in the Session scope, which will be updated with every new search, the user runs (either pagination or completely new search). The maximum results stored will be the number of results displayed.
I can store the query allright, using:
<cfset Session.resultset = query_name>
This stores the whole query with results, like so:
query
CACHED: false
EXECUTIONTIME: 2031
SQL: SELECT a.*, p.ek, p.vk, p.x, p.y
FROM arts a
LEFT JOIN p ON
...
LEFT JOIN f ON
...
WHERE a.aktiv = "ja"
AND
... 20 conditions ...
SQLPARAMETERS: [array]
1) ... 20+ parameters
RESULTSET:
[Record # 1]
a: true
style: 402
price: 2.3
currency: CHF
...
[Record # 2]
a: true
style: 402abc
...
This would be overwritten every time a user does a new search. However, if a user wants to see the details of one of these items, I don't need to query (total number of records & get one record) if I can access the record I need from my temp storage. This way I would save two database trips worth 2031 execution time each to get data which I already pulled before.
The tradeoff would be every user having a resultset of up to 48 results (max number of items per page) in Session.scope.
My questions:
1. Is this feasable or should I requery the database?
2. If I have a struture/array/object like a the above, how do I pick the record I need out of it by style number = how do I access the resultset? I can't just loop over the stored query (tried this for a while now...).
Thanks for help!
KISS rule. Just re-query the database unless you find the performance is really an issue. With the correct index, it should scales pretty well. When the it is an issue, you can simply add query cache there.
QoQ would introduce overhead (on the CF side, memory & computation), and might return stale data (where the query in session is older than the one on DB). I only use QoQ when the same query is used on the same view, but not throughout a Session time span.
Feasible? Yes, depending on how many users and how much data this stores in memory, it's probably much better than going to the DB again.
It seems like the best way to get the single record you want is a query of query. In CF you can create another query that uses an existing query as it's data source. It would look like this:
<cfquery name="subQuery" dbtype="query">
SELECT *
FROM Session.resultset
WHERE style = #SelectedStyleVariable#
</cfquery>
note that if you are using CFBuilder, it will probably scream Error at you for not having a datasource, this is a bug in CFBuilder, you are not required to have a datasource if your DBType is "query"
Depending on how many records, what I would do is have the detail data stored in application scope as a structure where the ID is the key. Something like:
APPLICATION.products[product_id].product_name
.product_price
.product_attribute
Then you would really only need to query for the ID of the item on demand.
And to improve the "on demand" query, you have at least two "in code" options:
1. A query of query, where you query the entire collection of items once, and then query from that for the data you need.
2. Verity or SOLR to index everything and then you'd only have to query for everything when refreshing your search collection. That would be tons faster than doing all the joins for every single query.

Resources