I've huge csv file database of ~5M rows having below fields
start_ip,end_ip,country,city,lat,long
I am storing these in LevelDB using start_ip as the key and rest as the value.
How can I retrieve records for keys where
( ip_key > start_ip and ip_key < end_ip )
Any alternative solution.
I assume that your keys are the hash values of the IP and the hashes are 64-bit `unsigned' integers, but if that's not the case then just modify the code below to account for the proper keys.
void MyClass::ReadRecordRange(const uint64 startRange, const uint64 endRange)
{
// Get the start slice and the end slice
leveldb::Slice startSlice(static_cast<const char*>(static_cast<const void*>(&startRange)), sizeof(startRange));
leveldb::Slice endSlice(static_cast<const char*>(static_cast<const void*>(&endRange)), sizeof(endRange));
// Get a database iterator
shared_ptr<leveldb::Iterator> dbIter(_database->NewIterator(leveldb::ReadOptions()));
// Possible optimization suggested by Google engineers
// for critical loops. Reduces memory thrash.
for(dbIter->Seek(startSlice); dbIter->Valid() && _options.comparator->Compare(dbIter->key(), endSlice)<=0); dbIter->Next())
{
// get the key
dbIter->key().data();
// get the value
dbIter->value().data();
// TODO do whatever you need to do with the key/value you read
}
}
Note that _options are the same leveldb::Options with which you opened the database instance. You want to use the comparator specified in the options so that the order in which you read the records is the same as the order in the database.
If you're not using boost or tr1, then you can either use something else similar to the shared_ptr or just delete the leveldb::Iterator by yourself. If you don't delete the iterator, then you'll leak memory and get asserts in debug mode.
Related
currently I am using gorm to retrieve data from db (postgresql db to be specific) and scan it in an array, the data stored in db is also in form of array. So problem I am facing is after scanning data in empty int array it changes into gorm.Record type which can't be used for basic array operation like, appending, iterating, etc.
Here's a related part of my code:
var winner_selection []int64 // empty int64 array
db.Table("giveaways").Select("Participants").Where("Name = ?", name).Scan(&winner_selection) // scanning the value in array
How can I retrieve the data directly in form of array so the array remains array or is there anyway to change gorm.Record type to array?
I'm using EF Core but I'm not really an expert with it, especially when it comes to details like querying tables in a performant manner...
So what I try to do is simply get the max-value of one column from a table with filtered data.
What I have so far is this:
protected override void ReadExistingDBEntry()
{
using Model.ResultContext db = new();
// Filter Tabledata to the Rows relevant to us. the whole Table may contain 0 rows or millions of them
IQueryable<Measurement> dbMeasuringsExisting = db.Measurements
.Where(meas => meas.MeasuringInstanceGuid == Globals.MeasProgInstance.Guid
&& meas.MachineId == DBMatchingItem.Id);
if (dbMeasuringsExisting.Any())
{
// the max value we're interested in. Still dbMeasuringsExisting could contain millions of rows
iMaxMessID = dbMeasuringsExisting.Max(meas => meas.MessID);
}
}
The equivalent SQL to what I want would be something like this.
select max(MessID)
from Measurement
where MeasuringInstanceGuid = Globals.MeasProgInstance.Guid
and MachineId = DBMatchingItem.Id;
While the above code works (it returns the correct value), I think it has a performance issue when the database table is getting larger, because the max filtering is done at the client-side after all rows are transferred, or am I wrong here?
How to do it better? I want the database server to filter my data. Of course I don't want any SQL script ;-)
This can be addressed by typing the return as nullable so that you do not get a returned error and then applying a default value for the int. Alternatively, you can just assign it to a nullable int. Note, the assumption here of an integer return type of the ID. The same principal would apply to a Guid as well.
int MaxMessID = dbMeasuringsExisting.Max(p => (int?)p.MessID) ?? 0;
There is no need for the Any() statement as that causes an additional trip to the database which is not desirable in this case.
I made auto increment index:
box.space.metric:create_index('primary', {
parts = {{'id', 'unsigned'}},
sequence = true,
})
Then I try to pass nil in id field:
metric.id = nil
When I try insert this values, I catch error:
Tuple field 1 type does not match one required by operation: expected unsigned
What value do I have to pass for autoincrement field?
Second questions. If I use tarantool-cluster with few instances (for ex. cartridge-application based), is it prove use autoincrement indexes? Will there be a cases that there will be duplicate keys on different instances?
It is not possible to pass nil. When you assign nil, you erase field. Use box.NULL instead.
But better, use some kind of cluster id, which perform well across cluster, instead of autoincrement, which works only inside one node.
For cluster-wide ids I could propose UUID or something like ULID (for ex from https://github.com/moonlibs/id)
I was redirected here after emailing the author of Dexie (David Fahlander). This is my question:
Is there a way to append to an existing Dexie entry? I need to store things that are large in dexie, but I'd like to be able to fill large entries with a rolling buffer rather than allocating one huge buffer and then doing a store.
For example, I have a 2gb file I want to store in dexie. I want to store that file by storing 32kb at a time into the same store, without having to allocate a 2gb of memory in the browser. Is there a way to do that? The put method seems to only overwrite entries.
Thanks for putting your question here at stackoverflow :) This helps me build up an open knowledge base for everyone to access.
There's no way in IndexedDB to update an entry without also instanciating the whole entry. Dexie adds the update() and modify() methods, but they only emulate a way to alter certain properties. In the background, the entire document will always be loaded in memory temporarily.
IndexedDB also has Blob support, but when a Blob i stored into IndexedDB, its entire content is cloned/copied into the database by specification.
So the best way to deal with this would be to dedicate a table for dynamic large content and add new entries to it.
For example, let's say you have a the tables "files" and "fileChunks". You need to incrementially grow the "file", and each time you do that, you don't want to instanciate the entire file in memory. You could then add the file chunks as separate entries into the fileChunks table.
let db = new Dexie('filedb');
db.version(1).stores({
files: '++id, name',
fileChunks: '++id, fileId'
});
/** Returns a Promise with ID of the created file */
function createFile (name) {
return db.files.add({name});
}
/** Appends contents to the file */
function appendFileContent (fileId, contentToAppend) {
return db.fileChunks.add ({fileId, chunk: contentToAppend});
}
/** Read entire file */
function readEntireFile (fileId) {
return db.fileChunks.where('fileId').equals(fileId).toArray()
.then(entries => {
return entries.map(entry=>entry.chunk)
.join(''); // join = Assume chunks are strings
});
}
Easy enough. If you want appendFileContent to be a rolling buffer (with a max size and erase old content), you could add truncate methods:
function deleteOldChunks (fileId, maxAllowedChunks) {
return db.fileChunks.where('fileId').equals(fileId);
.reverse() // Important, so that we delete old chunks
.offset(maxAllowedChunks) // offset = skip
.delete(); // Deletes all records older before N last records
}
You'd get other benefits as well, such as the ability to tail a stored file without loading its entire content into memory:
/** Tail a file. This function only shows an example on how
* dynamic the data is stored and that file tailing would be
* simple to do. */
function tailFile (fileId, maxLines) {
let result = [], numNewlines = 0;
return db.fileChunks.where('fileId').equals(fileId)
.reverse()
.until(() => numNewLines >= maxLines)
.each(entry => {
result.unshift(entry.chunk);
numNewlines += (entry.chunk.match(/\n/g) || []).length;
})
.then (()=> {
let lines = result.join('').split('\n')
.slice(1); // First line may be cut off
let overflowLines = lines.length - maxLines;
return (overflowLines > 0 ?
lines.slice(overflowLines) :
lines).join('\n');
});
}
The reason I know that chunks will come in the correct order in readEntireFile() and tailFile() is that indexedDB queries will always be retrieved in in the order of the queried column primary, but secondary in the order of the primary keys, which are auto-incremented numbers.
This pattern could be used for other cases, like logging etc. In case the file is not string based, you would have to alter this sample a little. Specifically, don't use string.join() or array.split().
Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()