Google Webmaster data quality issues

Google Webmaster data quality issues - google-api

I am running into a weird error.
We have a standard implementation of getting data from searchconsole and storing it in a database. We have crosschecked the data during the implementation and it was good.
Lately we have seen huge differences in what is reported in search console and the data retrieved from the API. In some cases it is only 10% lower than the search console data but in some cases the API data shows 50% less than what is being reported in the search console.
Is any one aware of these issues and has anyone run into this recently?

I have had this problem for about a month now and finally fixed this issue.
This was my original request
service, flags = sample_tools.init(
argv, 'webmasters', 'v3', __doc__, __file__,
scope='https://www.googleapis.com/auth/webmasters.readonly')
I have fixed it by removing the ".readonly" on the end. This was causing me to get sampled data.
My scope now looks like this and return full results.
service, flags = sample_tools.init(
argv, 'webmasters', 'v3', __doc__, __file__,
scope='https://www.googleapis.com/auth/webmasters')

I'm having the same issue of reconciling to the console. How are you storing the data, i.e. your database table structure?
Have you read about the differences in the aggregation between page and property? These can cause discrepancies.
https://support.google.com/webmasters/answer/6155685?hl=en#urlorsite
For example a search query that returns multiple pages aggregated by the property counts as 1 impression. When you group by pages this would show as however many pages you have in the search results e.g. 3 or 4. Therefore by query and by date your impressions will be lower than if you aggregate by page.

Related

How trustworthy are query result counts?

I just ran a query in Logs Explorer for a 24 hour period and it returned 2409575 results
I then ran the exact query, changing only the start time and end time to define a time window that is a subset of the previous time window and it returned 2656840 results which is more than previously:
How can this be? My only conclusion from this is that the stated number of log results cannot be trusted. Can someone please inform me of what the expectations are for the log results tally? Can it be trusted?

Thank you for reporting this. The counts shown in the Log Fields panel and the Histogram are reliable. Only the total query results appear to have an issue due to the the nature of the incremental and approximate counts as logs are read. The final value should have been accurate. An internal issue has been crated and the team will fixing this soon.
Disclaimer: I work in Cloud Logging.

Why I miss some data on querying of Chrome UX Report API?

On querying of Chrome UX Report API i get sometimes a 404 error, "chrome ux report data not found". Documentation says: If 404 - CrUX API doesn't have any data for given origin.
For all URLs I query, I get some metrics, there is no URL, where all metrics would be missed, and for most URLs I get all data.
But there are cases, where data of certain metric missed. For one URL is FID data missing (data for all other metrics exist), for another URLs - FID, LCP and CLS are missed (data for FCP exist).
Is it a kind of API glitch? What should I do to get data for all queried metrics?
PS: if i query the same URLs now and after 30 minutes, I get different results: for same URLs are different metrics data missed: at first query is FCP missed, at second query - LCP and CLS... Why is it so?
On the image you see how missed data looks:

FCP is the only metric guaranteed to exist. If a user visits a page but it doesn't have an FCP, CrUX throws it away. It's theoretically possible for some users to experience FCP but not LCP, for example if they navigate away in between events. Newer metrics like CLS weren't implemented in Chrome until relatively recently (2019) so users on much older versions of Chrome will not report any CLS values. There are also periodic metric updates and Chrome may require that metrics reflect the latest implementation in order to be aggregated in CrUX.
The results should be stable for roughly 1 full day. If you're seeing changes after only 30 minutes, it's possible that you happened to catch it during the daily update.

API User Usage Report: Inconsistent Reporting

I'm using a JVM to perform API calls to the Google Apps Administrator API.
I've noticed with the User Usage Reports, I'm not getting complete data when it comes to a field I'm interested in (num_docs_externally_visible) and the fields which form that fields calculation. I generally request a single day's usage report at a time, across my entire user base (~40k users).
According to the documentation on the developer's, I should be able to see that in a report 2-6 days after; however after running a report for the first 3 weeks of February, I've only gotten it for 60% of the days. The pattern appears to be random (in that I have up to 4 day in a row streaks of the item appearing and 3 days in a row streaks of it not, but there is no consistency to this).
Has anyone else experienced this issue? And if so, were you able to resolve it? Or should I expect this behavior to continue if this is an issue with what the API is returning outside of my control?

I think it's only natural that the data you get is not yet complete, it takes a certain day to receive the complete data.
This SO question is not exactly the same of your question, but i think it will help you. Especially the part that you need to use your account time zone.

Google Analytics: incorrect number of sessions when grouping by Custom Dimension

For a while I have successfully queried the number of sessions for my website, including the number of sessions per 'Lang Code' and per 'Distribution Channel'; both Custom Dimensions I have created in Analytics with their own slot and their Scope Type set to 'Session'.
Recently the number of sessions has decreased significantly when I group by a Custom Dimension, e.g. Lang Code.
The following query gives me a number of say 900:
https://ga-dev-tools.appspot.com/query-explorer/?start-date=2015-10-17&end-date=2015-10-17&metrics=ga%3Asessions
Whereas this query gives returns around a quarter of that, say ~220:
https://ga-dev-tools.appspot.com/query-explorer/?start-date=2015-10-17&end-date=2015-10-17&metrics=ga%3Asessions&dimensions=ga%3Adimension14
Now, my initial reaction was that 'Lang Code' was not set on all pages but I checked and this data is includes guaranteed on all pages of my website.
Also, no changes have been made to the Analytics View I'm querying.
The same issue occurred a couple of weeks ago and at the time I fixed this by changing the Scope Type of said Custom Dimensions to Session, but now I'm no longer sure if this was the correct fix or if this was just a temporary glitch since:
the issue didn't occur before
the issue now reoccurs
Does anyone have any idea what may have caused this data discrepancy?
P.S. to make things stranger, for daily reporting we run this query every night (around 2am), and then the numbers are actually correct, so apparently it makes a difference at what time the query is executed?

REST Api for Infinite scrolled query results

I'm building an internal server which contains a database of customer events. The webpage which allows access to the events is going to utilize an infinite scroll/dynamic loading scheme for display of live events as well as for browsing the results of queries to the database. So, you might query the database and maybe get 200k results. The webpage would display the 'first' 50 and allow you to scroll and scroll and scroll to see more and more results (loading perhaps 50 more at time).
I'm supposed to be using a REST api for the database access (a C# server). I'm unsure what the API should be so it remains RESTful. I've come up with 3 options. The question is, are any of them RESTful and which is most RESTful(is there such a thing -- if not I'll pick one of the RESTful).
Option 1.
GET /events?query=asdfasdf&first=1&last=50
This simply does the query and specifies the range of results to return. The server, unable to keep state, would have to requery the database each time (though perhaps utilizing the first/last hints to stop early) the infinite scroll occurs. Seems bad and there isn't any feedback about how many results are forthcoming.
Option 2 :
GET /events/?query=asdfasdf
GET /events/details?id1=asdf&id2=qwer&id3=zxcv&id4=tyui&...&id50=vbnm
This option first does a query which then returns the list of event ids but no further details. The webpage simply has the list of all the ids(at least it knows the count). The webpage holds onto the event id list and as infinite scroll/dynamic load is needed, makes another query for the event details of the specified ids. Each id is would nominally be a guid, so about 36 characters per id (plus &id##= for 41 characters). At 50 queries per hit, the URL would be quite long, 2000+ characters. The URL limit mentioned elsewhere on SO is around 2k. Maybe if I limit it to 40 ids per query this would be fine. It'd be nice to simply have a comma separated list instead of all the query parameters. Can you make a query parameter like ?ids=qwer,asdf,zxcv,wert,sdfg,rtyu,gfhj, ... ,vbnm ?
Option 3 :
POST /events/?query=asdfasdf
GET /events/results/{id}?first=1&last=50
This would post the query to the server and cause it to create a results resource. The ID of the results resource would be returned and would then be used to get blocks of the query results which in turn contain the event details needed for the webpage. The return from the POST XML could contain the number of records and other useful information besides the ID. Either the webpage would have to later delete the resource when the query page closed or the server would have to clean them up once they expire (days or weeks later).
I am concerned at Option 1, while RESTful is horrible for the server. I'm not sure requesting so many simultaneous resources, like the second GET in Option 2 is really RESTful or practical(seems like there has to be a better way). I'm not sure Option 3 is RESTful at all or if it is, its sort of cheating the REST thing by creating state via a POST(or should that be PUT).

Option 3 worked out fine. It required the server to maintain the query results and there was a bit of debate about how many queries (from various users) should simultaneously be saved as there would be no way to know when a user was actually done with a query.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio