BigQuery - Can only fetch 100,000 Records - ruby

I have a table with around 50,000,000 records.
I would like to fetch one column of the whole table
SELECT id FROM `project.dataset.table`
Running this code in the Web Console takes around 80 seconds.
However when doing this with the Ruby Gem, I'm limited to fetch only 100,000 records per query. With the #next method I can access the next 100,000 records.
require "google/cloud/bigquery"
#big_query = Google::Cloud::Bigquery.new(
project: "project",
keyfile: "keyfile"
)
#dataset = #big_query.dataset("dataset")
#table = #dataset.table("table")
queue = #big_query.query("SELECT id FROM `project.dataset.table`", max: 1_000_000)
stash = queue
loop do
queue = queue.next
unless queue
break
else
O.timed stash.size
stash += queue
end
end
The problem with this is that each request takes around 30 seconds. max: 1_000_000 is of no use, I'm stuck at 100,000. This way the query takes over 4 hours, which is not acceptable.
What am I doing wrong?

You should rather do an export job, this way you will have as file(s) on GCS.
Then downloading from there is easy.
https://cloud.google.com/bigquery/docs/exporting-data
Ruby example here https://github.com/GoogleCloudPlatform/google-cloud-ruby/blob/master/google-cloud-bigquery/lib/google/cloud/bigquery.rb

Related

What is the difference between soql for loop vs soql list

As per the documentation it mentions ) (soql for loop) retrieves all sObjects using a call to query and queryMore whereas (list for loop) retrieves a number of objects records. It is advisable to use (soql for loop) over (list for loop) to avoid heap size limit error.
Total Heap Size Limit : 6 M.B Synchronous and 12 M.B Asynchronous.
In below case, let say each record is taking 2 K.B so 50,000 will take 50,000*2=100000 K.B (100 M.B approx in conList) which will cause heap size limit error as the allowed limit is 6 M.B for synchronous.
list<contact> conList=new list<contact>();
conList=[Select id,phone from contact];
To avoid this we should use "SOQL for loop" as con variable highlighted below will have 1 record at a time i.e 2k.B of data at a time thus preventing heap size limit error.
for (List<Contact> con: [SELECT id, name FROM contact]){
}
Question - What does it mean that "SOQL for loop" as con variable highlighted below will have 1 record at a time i.e 2k.B of data at a time.
The main difference would be that you can't use the retrieved records outside of the for loop if you go with that. When you store the records in the List, you can use that list to manipulate that in the for loop, but you can also use it in other operations at a later time.
To give you an example:
List<Contact> conList = [SELECT Id, Name FROM Contact LIMIT 100];
for(Contact c:conList){
c.Title = Mr/Mrs;
}
update conList;//I am able to use the same list in the update call.

how to add data to one table from a large number of tables

I have problem with my project. I have more then 45 tables with 6 sheets in them. My script must find same rows from that tables and another table, then it must insert that row in the table for same rows. I finished it but it have one problem. Standard number of quota is 100 per 100seconds. I tried to fix the problem by time.sleep(1) after requests , but i have more than 45 tables and it's take much time to find all same rows
if i == x:
dopler_cell_list = dopler.range(f'A{str(len(lol2) + length)}:AI{str(len(lol2) + length)}')
time.sleep(1)
for cell in dopler_cell_list:
cell.value = output_cell_list[count].value
time.sleep(1)
count += 1
How i can make it faster?

LINQ AsNoTracking running slow

I have a need for a non edited list of items from my DB.
It was running slow so I was trying to squeak some speed increases.
So I added AsNoTracking to the LINQ query and it ran slower!
The following code took on average 7.43 seconds. AsNoTracking is after the Where
var result = await _context.SalesOrderItems.Where(x => x.SalesOrderId == SalesOrderId ).AsNoTracking().ToListAsync();
The following code took on average 8.62 seconds. AsNoTracking is before the Where
var result = await _context.SalesOrderItems.AsNoTracking().Where(x => x.SalesOrderId == SalesOrderId ).ToListAsync();
The following code took on average 6.95 seconds. The is no AsNoTracking
var result = await _context.SalesOrderItems.Where(x => x.SalesOrderId == SalesOrderId ).ToListAsync();
So am I missing something? I always though AsNoTracking() should run faster, and is ideal for read only list.
Also this table has two child table.
The first time a query is run it must be compiled. If entities are
already tracked by the context, then a tracking query will return
those instances rather than creating new.
That is the reason why tracked entities might execute faster then with AsNoTracking().
https://github.com/aspnet/EntityFrameworkCore/issues/14366
But execution time about 7 seconds indicates that it is not issue with/no Tracking. It meens issue with data base (not indexed column for example), About 1.5 bilions records are got with 30 mls but not seconds if data set up properly.

Linq Query Where Contains

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

streaming and bulk update to elasticsearch

As part of data analysis, I collect records I need to store in Elasticsearch. As of now I gather the records in an intermediate list, which I then write via a bulk update.
While this works, it has its limits when the number of records is so large that they do not fit into memory. I am therefore wondering if it is possible to use a "streaming" mechanism, which would allow to
persistently open a connection to elasticsearch
continuously update in a bulk-like way
I understand that I could simply open a connection to Elasticsearch and classically update as data are available but this is about 10 times slower, so I would like to keep the bulk mechanism:
import elasticsearch
import elasticsearch.helpers
import elasticsearch.client
import random
import string
import time
index = "testindexyop1"
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
if elasticsearch.client.IndicesClient(es).exists(index=index):
ret = elasticsearch.client.IndicesClient(es).delete(index=index)
data = list()
for i in range(1, 10000):
data.append({'hello': ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))})
start = time.time()
# this version takes 25 seconds
# for _ in data:
# res = es.bulk(index=index, doc_type="document", body=_)
# and this one - 2 seconds
elasticsearch.helpers.bulk(client=es, index=index, actions=data, doc_type="document", raise_on_error=True)
print(time.time()-start)
You can always simply split data into n approximately equally sized sets such that each of them fits in memory and then do n bulk updates. This seems to be the easiest solution to me.

Resources