Google big query API returns "too many free query bytes scanned for this project" - google-api

I am using Google's big query API to retrieve results from their n-gram dataset. So I send multiple queries of "SELECT ngram from trigram_dataset where ngram == 'natural language processing'".
I'm basically using the same code posted here (https://developers.google.com/bigquery/bigquery-api-quickstart) replaced with my query statement.
On every program run, I have to get a new code of authorization and type it in the console, which gives authorization to my program to send queries to google big query under my project ID. However, after sending 5 queries, it just returns " "message" : "Exceeded quota: too many free query bytes scanned for this project".
According to Google Big Query policy, their free quota is 100G/month, and I don't think I've even nearly come close to their quota. Someone suggested in the previous thread that I should enable billing information to use their free quota, which I did, but it's still giving me the same error. Is there any way to check the leftover quota or how to resolve this problem? Thank you very much!

The query you've mentioned scans 1.12 GB of data, so you should be able to run it 89 times in a month.
The way the quota works is that you start out with 100 GB of monthly quota -- if you use it up, you don't have to wait an entire month, but you get 3.3 more quota every day.
My guess (please confirm) is that you ran a bunch of queries and used up your 100 GB monthly free quota, then waited a day, and only were able to run a few queries before hitting the quota cap. If this is not the case, please let me know, and provide your project id and I can take a look in the logs.
Also, note that this isn't the most efficient usage of bigquery; an option would be to batch together multiple requests. In this case you could do something like:
SELECT ngram
FROM trigram_dataset
WHERE ngram IN (
'natural language processing',
'some other trigram',
'three more words')

Related

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

Elastic search to Google big query

How do we send data from elastic search to google big query, Is there any specific connector?
I have been looking into various options and will need data to be available in google big query real time
I found google_bigquery output pligin that might be useful, but I have never use it personally.
Experiment with the settings depending on how much log data you generate, your needs to see "fresh" data, and how much data you could lose in the event of crash. For instance, if you want to see recent data in BQ quickly, you could configure the plugin to upload data every minute or so (provided you have enough log events to justify that)

Generate number of search requests over a given year

Does anyone know if there is a way to generate a report that details how many search requests the GSA has handled over a given timeframe?
On the admin console: Reports > Search Logs.
Select the appropriate collection or All Collections.
Select the desired date range for the Report Timeframe.
From memory this only has access to a max 90 days historical data though so if you haven't been regularly exporting this data than you'll need to extrapolate the values from what is available.
As pointed by #BigMikeW Logs only retain data for 90 days. Unless you download every 90 days, you wont get it.
Other way is integration with Google Analytics and pass all search data to GA search behavior. That way you can use GA to play around and export for a year or even more. This is what I use.

What will be the wait time before big query executes a query?

Every time I execute a query in Google bigquery in the Explanation tab, I can see that their involves an average waiting time. Is it possible to know the percentage or seconds of wait time?
Since BigQuery is a managed service, around the glob a lot of customers are using it. It has an internal scheduling system based on the billingTier (explained here https://cloud.google.com/bigquery/pricing#high-compute) and other internals of your project. Based on this the query is scheduled to be executed based on the cluster availability. So there will be a minimum time until it finds a cluster of machines to execute your job.
I never seen there significant times. In case you have this issue then contact google support to see your project. If you edit your original question and add a job ID, a google enginner may check it out if there is an issue orn ot.
It's currently not exposed in the UI.
But you can find a similar concept from API (search "wait" from following page):
https://cloud.google.com/bigquery/docs/reference/v2/jobs#resource
Is it possible to reduce the big query execution wait time to the minimum?
Purchase more BigQuery Slots.
Contact your sales representative or support for more information.

API User Usage Report: Inconsistent Reporting

I'm using a JVM to perform API calls to the Google Apps Administrator API.
I've noticed with the User Usage Reports, I'm not getting complete data when it comes to a field I'm interested in (num_docs_externally_visible) and the fields which form that fields calculation. I generally request a single day's usage report at a time, across my entire user base (~40k users).
According to the documentation on the developer's, I should be able to see that in a report 2-6 days after; however after running a report for the first 3 weeks of February, I've only gotten it for 60% of the days. The pattern appears to be random (in that I have up to 4 day in a row streaks of the item appearing and 3 days in a row streaks of it not, but there is no consistency to this).
Has anyone else experienced this issue? And if so, were you able to resolve it? Or should I expect this behavior to continue if this is an issue with what the API is returning outside of my control?
I think it's only natural that the data you get is not yet complete, it takes a certain day to receive the complete data.
This SO question is not exactly the same of your question, but i think it will help you. Especially the part that you need to use your account time zone.

Resources