Is there a way to get a specific row number or index of a certain record in a large data set? I came across a solution using list.IndexOf() but this will not work with large datasets as it freezes. I've tried also .Select((entry, index) => new { Id = entry.ID, Index = index }), but it gave an exception that the linq query cannot be translated. I cannot call .ToList() or .AsEnumerable() prior to getting the records as it will just freeze it for larger datasets.
Thank you.
I have a view on a Cloudant database that is designed to show events that are happening in the next 24 hours:
function (doc) {
// activefrom and activeto are in UTC
// set start to local time in UTC
var m = new Date();
var start = m.getTime();
// end is start plus 24 hours of milliseconds
var end = start + (24*60*60*1000);
// only want approved disruptions for today that are not changed conditions
if (doc.properties.status === 'Approved' && doc.properties.category != 'changed' && doc.properties.activefrom && doc.properties.activeto){
if (doc.properties.activeto > start && doc.properties.activefrom < end)
emit([doc.properties.category,doc.properties.location], doc.properties.timing);
}
}
}
This works fine for most of the time but every now and then the view does not show the expected results.
If I edit the view, even just adding a comment, the output changes to the expected results. If I re-edit the view and remove the change, the results return to the incorrect results.
Is this because of the time-sensitive nature of the view? Is there a better way to achieve the same result?
The date that is indexed by your MapReduce function is the time that the server dealing with the work performs the indexing operation.
Cloudant views are not necessarily generated at the point that data is added to the database. Sometimes, depending on the amount of work the cluster is having to do, the Cloudant indexer is not triggered until later. Documents can even remain unindexed until the view is queried. In that circumstance, the date in your index would not be "the time the document was inserted" but "the time the document was indexed/queried", which is probably not your intention.
Not only that, different shards (copies) of the database may process the view build at different times, giving you inconsistent results depending on which server you asked!
You can solve the problem by indexing something from your source document e.g.
if your document looked like:
{
"timestamp": 1519980078159,
"properties": {
"category": "books",
"location": "Rome, IT"
}
}
You could generate an index using the timestamp value from your document and the view you create would be consistent across all shards and would be deterministic.
I have a table with around 50,000,000 records.
I would like to fetch one column of the whole table
SELECT id FROM `project.dataset.table`
Running this code in the Web Console takes around 80 seconds.
However when doing this with the Ruby Gem, I'm limited to fetch only 100,000 records per query. With the #next method I can access the next 100,000 records.
require "google/cloud/bigquery"
#big_query = Google::Cloud::Bigquery.new(
project: "project",
keyfile: "keyfile"
)
#dataset = #big_query.dataset("dataset")
#table = #dataset.table("table")
queue = #big_query.query("SELECT id FROM `project.dataset.table`", max: 1_000_000)
stash = queue
loop do
queue = queue.next
unless queue
break
else
O.timed stash.size
stash += queue
end
end
The problem with this is that each request takes around 30 seconds. max: 1_000_000 is of no use, I'm stuck at 100,000. This way the query takes over 4 hours, which is not acceptable.
What am I doing wrong?
You should rather do an export job, this way you will have as file(s) on GCS.
Then downloading from there is easy.
https://cloud.google.com/bigquery/docs/exporting-data
Ruby example here https://github.com/GoogleCloudPlatform/google-cloud-ruby/blob/master/google-cloud-bigquery/lib/google/cloud/bigquery.rb
I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.
I have a document structure JobData that stores time based data starting from time 0 to time t in Ticks. And usually the data is a document per second.
public class JobData
{
long Ticks {get;set;}
double JobValue {get;set;}
}
For simplicity I am showing only one parameter JobValue, but in reality it is a complex graph of data. My question is if given a given an input time in Ticks, what kind of query would be the best for finding the last JobData based on a given tick?
So if the database has a document at 1000 ticks and then the next one at 2000 ticks, and the user wants to find the state at 1500 ticks, he/she should get the JobData at 1000 ticks as the answer.
The query I am using now is:
var jobData = documentSession.Query<JobData>().Where(t => t.Ticks <= 1500).OrderByDescinding(t => t.Ticks).FirstOrDefault();
Is this the right and most efficient query? I have thousands of these JobData nodes and want to just get to the one that is the closest.
Thanks!
Ahmad,
Yes, that is the way to go about it. And it would be very fast.