I have a User struct with ID and LoginName fields and I want this struct to be accessible by either of these fields with single call to the DB. I know BoltDB is not supposed to handle arbitrary field indexing etc. (unlike SQL) but this case is a little different as I happen to know in advance the additional field to b used as index.
So is there some kind of secondary key or multiple key indexing? or maybe some strategy that I fail to see? If not then I'll just implement it with two calls, I just prefer a "cleaner" solution...
Thanks!
No, it's not there. BoltDB is a lot like Go. Clean and simple. And building a layer on top is easy. BoltDB even allows update transactions to be trivially implemented so two more more buckets can be updated, or not, atomically. So creating an update transaction that keeps two or more buckets in sync is easy. But it sounds like you know that and just wanted to check that you aren't missing something.
There is no secondary key indexing in BoltDB, but you can implement it.
You can store ID to LoginName mapping in another bucket, and it will be technically the "secondary key" for your struct. That is, first obtain the primary key value from the secondary key, and then the User struct.
If most of your calls are on LoginName key, use LoginName to ID mapping and store User struct under LoginName key and vice versa.
Be careful: you have to maintain consistency by your own, remember it.
Related
I have an object Company and multiple methods that can be used to get this object. Ex. GetById, GetByEmail, GetByName.
What I'd like is to cache those method calls with a possibility to invalidate all cache entries related to one object at once.
For example, a company is cached. There are 3 entries in cache with following keys:
Company:GetById:123
Company:GetByEmail:foo#bar.com
Company:GetByName:Acme
All three keys are related to one company.
Now let's assume that company has changed. Then I would like to invalidate all keys related to this company. I didn't find any built-in solution for that purpose.
Tagging cache entries with some common id (companyId for example) and then removing all entries by it would be great, but this feature doesn't seem to exist.
So to answer your question directly, You'd probably want to maintain all the keys related to your company in a list, scan through that list, and delete all the associated keys with a DEL command.
So something like:
LPUSH companies-keys:Acme Company:GetById:123 Company:GetByEmail:foo#bar.com Company:GetByName:Acme
Then
RPOP companies-keys:Acme
and for each entry you get out of the list:
UNLINK keyname
To answer it not so directly, you may want to consider using a Hash rather than just keys, that way you can just modify one of the fields in the hash rather than having to invalidate all the keys associated with it.
So you could create it with:
HSET companies:123 id 123 email foo#bar.com name acme
Then you could update a particular entry in the company record with HMSET:
HMSET companies:123 email bar#foo.com
Since it sounds like being able to look up a given record by different fields is really important to your use case - you may also want to consider adding RediSearch and indexing the fields you want to be able to search on different fields for the set of fields listed above, and index of:
FT.CREATE companies-idx ON HASH PREFIX 1 companies: SCHEMA id TAG email TEXT name TEXT
Might be appropriate - then you could look up a company with a given email like:
FT.SEARCH companies-idx "#email: foo"
I've just figured out a big mistake I had while creating the dynamodb structure.
I've created 11 tables, whereas one of them is the table mostly refereed to and the others are complementary tables.
For example, I have a table where I hold names (together with other info) called "Names" and another table called "NamesMappings" holding all these names added to the "Names" table so that each time a user wants to add a name to the "Names" table he first tries to put the name in "NamesMappings" and only if it succeed (therefore this name doesn't exist) he can add the name into the "Names" table. This procedure helps if the name is not unique and is not the primary key in the "Names" table and with this technique I don't have to search inside the "Names" table if the name exists, but instead I can try to add it to the "NamesMappings" table and only if it succeed I know this is a unique name.
First of all, I would like to ask you if this is a common approach or there is a better one?
Next, I figured out that with this design I soon reached to 11 tables each has 5 provisioned capacity of read and write which leads to overall 55 provisioned read and write under the free-tier. Then I understood why I get all these payments each month, because as the number of tables is getting bigger, and I leave the provisioned capacity as default (both read/write capacity are 5) I get more and more provisioned capacity.
So, what should be my conclusion from this understanding? Should I try to reduce the number of tables even if it takes more effort to preform scanning and querying inside the table? Or should I split the table same as I do but reduce the capacity of these mappings tables used only for indication if an item exists or not in another table?
If I understand your problem correctly you're missing the whole concept of NoSQL Databases.
Your Names table should have a Hash key (which is similar to a Primary key) that has a uniformly generated identifier (an UUID is a great candidate). This would automatically make this Table queryable by this unique identifier. You said, however, that you don't know the ID but you only know the Name instead. This leads me to think you could create a Global Secondary Index (GSI) on the Name attribute inside the Names table so you can also query by Name. Up to this point, your table structure should look like this:
id | name
Both of them are independently queryable, which gives you a lot of flexibility already.
Now, let's say you want to add the NameMapping attribute (which I don't know how it looks like), you can simply add it under the Names table, getting rid of the NamesMappings table, greatly reducing the number of WCUs and RCUs across your account. Your table structure should now look like this:
id | name | mappings
where mappings is, let's say, a JSON object.
Since you can only query on top level attributes in DynamoDB, you can now perform a query against the name attribute which has a GSI configured. If the query returns nothing, then name is unique. But let's say you still need some data inside the mappings object, then you could query by name and, in your code, you could apply a map/filter/reduce operation on the mappings attribute and decide what to do next.
Remember that duplication is just OK in a NoSQL world. This may look scary if you come from a purely SQL background, but data should be stored in such a way in NoSQL databases that you should be able to fetch all the needed information in one go, therefore avoiding "joins" (joins are still possible in a NoSQL database, but since there are no strong relationships between entities, you need to perform these joins manually on the code level). To give you some real context, imagine you have a Orders table where you keep track of the ordered Products and the Store that the Order belongs to: you'd save both the Products and the Store objects (and not their IDs, as it would happen in the SQL way) inside the Order object, so if you want to query for a given OrderId in the future, you wouldn't need to make extra calls (aka "joins") to the Product/Store tables to fetch the information, since everything would already be stored inside the Order object.
We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.
I am trying to wrap my head around Bleve and I understand everything that is going on in the tutorials, videos and documentation. I however get very confused when I am using it on BoltDB and don't know how to start.
Say I have an existing BoltDB database called data.db populated with values of struct type Person
type Person struct {
ID int `json:"id"`
Name string `json:"name"`
Age int `json:"age"`
Sex string `json:"sex"`
}
How do I index this data so that I can do a search? How do I handle the indexing of data that will be stored in the database in the future?
Any help will be highly appreciated.
Bleve uses BoltDB as one of several backend stores and is separate from where you store your application data. To index your data in Bleve, simply add your Index:
index.Index(person.ID, person)
That index exists separately from your application data (whether it's in Bolt, Postgres, etc).
To retrieve your data, you'll need to construct a search request using bleve.NewSearchRequest(), then call Index.Search(). This will return a SearchResult which includes a Hits field where you can retrieve the ID for your object. You can use this to look up the object in your application data store.
Disclaimer: I am the author of BoltDB.
How you index your data depends on how you want to query for it.
If you want to query by any arbitrary fields, like {Age:15, Name:"Bob"} then BoltDB isn't an awesome fit for your problem.
BoltDB is simply a key value store with fast access to sequential keys and efficient prefix seeking. It's not really a replacement for general use databases.
You likely want something more like a document store (ie: MongoDB) or RDBMS (ie: PostgreSQL).
If you just wanted something that uses simple files and is embedded, you could also use SQlite with the Go module
If you want to search by only a single field, like ID or Name, then use that as the key.
If lookup speed doesn't matter at all, I guess you can use Bolt to just iterate over the entire db, parse the json and check the fields. But that's probably the worst approach you could take.
I am making a site that each account will have an ID.
But, I didn't want to make it incrementable, meaning:
id=1
id=2
...
id=1000
What I want is to have random IDs:
id=2355
id=5647734
id=23532
...
(The reason is to avoid robots to check all accounts profiles by just incrementing a ID in URL - and maybe other reason, but that is not the question)
But, I am worried about performance on registration.
It will be something like this:
while (RANDOM_ID is not taken): generate new RANDOM_ID
On generating a new ID for the new account, I will query database (MySQL) to check if the ID exists, for each generation.
Is there any better solution for this?
Is there any disadvantage of using random IDs?
Thanks in advance.
There are many, many reasons not to do this:
Your solution, as written, is not transactionally-safe; two transactions at the same time could both generate the same "random" ID.
If you serialize the transaction in order to make it safe, you will slaughter performance because the query will keep every single collision row locked until it finds a spare ID.
Using a random ID as the primary key will fragment the hell out of your clustered index. This is bad enough with uuids - the whole point of an auto-generated identity column is so you can generate a safe sequence out of it.
Why not use a regular primary key, but just don't use that in any of your URLs? Generate a secondary non-sequential ID along with it - such as a uuid - index it, and use this column in any public-facing segments of your application instead of the primary key if you are really worried about security.
You can use UUIDs. It's a unique identifier generated based partly on timestamp. It's almost certainly guaranteed to be unique so you don't have to do a query to check.
i do not know what language you're using, but there should be library or sample code for this for most languages.
Yes you can use UUID but keep your auto_increment field. Just add a new field and set it so something like: md5(microtime(true).rand()) or whatever other method you like and use that unike key along the site to make the links instead to expose the primary key in urls.