How would you represent a relational entity as a single unit of retrievable data in BerkeleyDB? - ruby

BerkeleyDB is the database equivalent of a Ruby hashtable or a Python dictionary except that you can store multiple values for a single key.
My question is: If you wanted to store a complex datatype in a storage structure like this, how could you go about it?
In a normal relational table, if you want to represent a Person, you create a table with columns of particular data types:
Person
-id:integer
-name:string
-age:integer
-gender:string
When it's written out like this, you can see how a person might be understood as a set of key/value pairs:
id=1
name="john";
age=18;
gender="male";
Decomposing the person into individual key/value pairs (name="john") is easy.
But in order to use the BerkeleyDB format to represent a Person, you would need some way of recomposing the person from its constituent key/value pairs.
For that, you would need to impose some artificial encapsulating structure to hold a Person together as a unit.
Is there a way to do this?
EDIT: As Robert Harvey's answer indicates, there is an entity persistence feature in the Java edition of BerkeleyDB. Unfortunately because I will be connnecting to BerkeleyDB from a Ruby application using Moneta, I will be using the standard edition which I believe requires me to create a custom solution in the absence of this support.

You can always serialize (called marshalling in Ruby) the data as a string and store that instead. The serialization can be done in several ways.
With YAML (advantage: human readable, multiple implementation in different languages):
require 'yaml'; str = person.to_yaml
With Marshalling (Ruby-only, even Ruby version specific):
Marshal.dump(person)
This will only work if class of person is an entity which does not refer to other objects you want not included. For example, references to other persons would need to be taken care of differently.

If your datastore is able to do so (and BerkeleyDB does AFAICT) I'd just store a representation of the object attributes keyed with the object Id, without splitting the object attributes in different keys.
E.g. given:
Person
-id:1
-name:"john"
-age:18
-gender:"male"
I'd store the yaml representation in BerkleyDB with the key person_1:
--- !ruby/object:Person
attributes:
id: 1
name: john
age: 18
gender: male
Instead if you need to store each attribute as a key in the datastore (why?) you should make sure the key for the person record is somewhat linked to its identifying attribute, that's the id for an ActiveRecord.
In this case you'd store these keys in BerkleyDB:
person_1_name="john";
person_1_age=18;
person_1_gender="male";

Have a look at this documentation for an Annotation Type Entity:
http://www.oracle.com/technology/documentation/berkeley-db/je/java/com/sleepycat/persist/model/Entity.html

Related

What's the best way to store workout information?

I'm playing around with a workout app (android), and want to match workouts to dates. The basic structure is :
Each date has zero or one workouts.
Each workout has one or more exercises.
Each exercise has a name, and one or more sets.
Each set has a weight, and one or more repetitions.
I'm considering a json file, where:
Each date attribute has a list of exercise objects.
Each exercise object has a name, and a list of set objects.
Each set object has a weight attribute and a repetitions attribute.
Thoughts?
If you are doing it with Android, use clases to represent the different entities you have mentioned.
To persist the information inside the phone, I sugest you use the built in sqlite database.
If you plan to build the app as the front end for a rest api or webservice, then yes you can use a json file to exchange informtion with the server. Now, on the server, you would persist the data in a database of your choice. I would go with a relational database like mysql, but for the model you are proposing it would be feasable to also go with a Nosql alternative.

Google Datastore bulk retrieve data using urlsafe

Is there a way in Google DataStore to bulk fetch entities using their urlsafe key values?
I know about ndb.get_multi([list]) which takes a list of keys and retrieves the entities in bulk which is more efficient. But in our case we have a webpage with a few hundred entities, embedded with the entities urlsafe key values. At first we were only doing operations on single entities, so we were able to use the urlsafe value to retrieve the entity and do the operation without much trouble. Now, we need to change multiple entities at once, and looping on them one by one does not sound like an efficient approach. Any thoughts?
Is there any advantage of using the entities key ID directly (versus the key urlsafe value)? get_by_id() in the documentation does not imply being able to get entities in bulk (takes only one ID).
If the only way to retrieve entities in bulk is using the entities key, yet, exposing the key on the webpage is not a recommended approach, does that mean we're stuck when it comes to bulk operations on a page with a few hundred entities?
The keys and the urlsafe strings are exactly in a 1:1 relationship. When you have one you can obtain the other:
urlsafe_string = entity_key.urlsafe()
entity_key = ndb.Key(urlsafe=urlsafe_string)
So if you have a bunch of urlsafe strings you can obtain the corresponding keys and then use ndb.get_multi() with those keys to get all entities, modify them as needed then use ndb.put_multi() to save them back into the datastore.
As for using IDs - that only works (in a convenient manner) if you do not use entity ancestry. Otherwise to obtain a key you need both the ID and the entity's parent key (or its entire ancestry) - it's not convenient, better use urlsafe strings in this case.
But for entities with no parents (aka root entities in the respective entity groups) the entity keys and their IDs are always in a 1:1 relationship and again you can obtain one if you have the other:
entity_key_id = entity_key.id()
entity_key = ndb.Key(MyModel, entity_key_id)
So again from a bunch of IDs you can obtain keys to use with ndb.get_multi() and/or ndb.put_multi().
Using IDs can have a cosmetic advantage over the urlsafe strings - typically shorter and easier on the eyes when they apear in URLs or in the page HTML code :)
Another advantage of using IDs is the ability to split large entities or to deal in a simpler manner with entities in a 1:1 relationship. See re-using an entity's ID for other entities of different kinds - sane idea?
For more info on keys and IDs see Creating and Using Entity Keys.

Confusion about hash tables

I am currently studying for some interviews, and I've heard that at some of these interviews people are sometimes asked to build a data structure from scratch, including a hash table. However, I am having some trouble ..really understanding hash tables from a programming perspective.
I've been building these data structures from scratch using C++, and I know that using templates I can create linkedlists, dynamic arrays, binary search trees, etc, that can basically store whatever type of object (as long as that object is the only type that can be stored in that instance of the hash table). So I would assume I could create a template or "generic" hash table that depending on the instance of the hash table, could store a particular object. But I have two things that confuse me:
I know that the through a hash function, the different keys are mapped to different indices in the array that makes up the hash table. But let's say you are using the hash table you created to store objects of type Book, and then let's say you create another hash table to store objects of type People. Obviously, different types of objects will have different member attributes, and one of these attributes would have to be the key. Would this mean that basically every object that you would ever want to store on the hash table you created would have to have at least one attribute that has the same name? Because your hash function would have to have some key value to hash, so it would have to know which attribute of the object it is using as a key to hash? So for example, every object that you would wanna store in this hash table would have to have an attribute called "key" that you can use when using a hash function to map to an index of the array, no? Otherwise, how would it know what "key" to hash?
This would also lead to the problem of the hash function...I've read that depending on the datasets you're given, some hash functions are better than other. So if the hash function depends on the dataset, how could you possibly create a hash table data structure that could store any type of object?
So am I just overthinking this? Should I just learn to create an easy hash table that hashes integers when practicing for my interviews? And are hash tables in real life created generically, or do people usually come up with a different hash table depending on the type of data they have?
If this question is better suited for the Computer Science theory stack exchange, please let me know. I am just finding these little details are keeping me from truly understanding this data structure.
You need to seperate the hash table from the hash function, these are different functionalities.
There are two common practices to keep your hash table generic and still be able to properly hash objects.
The first is to assume your template type (let it be T) implements
the hash method, and use it. You don't care how it is being
implemented, as long as you have it.
The other option is to have in addition to the template type, a
template function hash(T), that needed to be provided when
declaring a hash table.
This basically solves both problems: The user, who knows the data distribution better than the library reader, is supplying the hash function, and the supplied hash function works on the supplied type, regardless of what the "key" is.
If chosen the 2nd option, you could implement some default hash functions for the known and primitive types, so users won't need to reinvent the wheel for each usage of the library, when using standard types.

Whats the difference between using serialize and store methods

I couldn't find much information online but it seems either methods used in the model enable the same functionality. How are they different and when should one be used over the other?
Example code:
class User < ActiveRecord::Base
store :extra_stuff
serialize :extra_stuff_too
end
Thanks!
Store wraps serialize so that you can store a hash in a column on your record. You can't however query data in a store.
Serialize basically saves the data as YAML in the record.
Serialize can store an array of things:
[thing1, thing2, thing3]
Store deals in hashes of key value pairs:
{thing1: "thing1 value", thing2: "thing2 value"}

would you use an array or a custom-made class for simple data manipulation? (ruby)

I can do bit of coding in ruby. I just touched objects and I am not so object literate, I mean I do not think in objects yet :-)
I have data that I scrape from the forum on regular basis. I need fields like
author, date posted, title, category, number of views, etc etc = array in my point of view.
Then I want to be able to these in ruby
save the whole lot (quick solution is csv or xml - later probably some sql database)
sort it by field
load/read my file to update fields and do some statistics, extract some data
add new fields easily in case I need to
edit, modify my "file/database" outside ruby.
I believe that I can do every operation like change the number of views of post, change the date of the last reply in the post etc etc either using array or object.
so my Question is: would you use
...................................... custom class/object or array?
could you tell why?
It would seem logical to me, at least, to make an object for storing and working with the data that you're scraping. Typically, you'd have instance variables for each of the fields that you have mentioned (author, title, category, views, date_posted) and probably some methods to populate them from the scraped data as well as read/write them.
In terms of storing the data for these objects, using an ORM such as ActiveRecord or DataMapper makes this very easy. An ORM let's you map the data in a data store, such as MySQL, to the corresponding Ruby objects. It will also provide a bunch of convenience methods for saving, updating and querying those objects.
However, it might be a good learning experience to try writing your own methods to map the data to XML files.
Do you mean "would you use an array or a custom-made class" do process this data.
What I would probably do is create a class that stores the data you want internally as an array or hash. You would then have methods of that class you could call to perform the tasks that you describe.
An object encapsulates data with behaviour i.e. functions or operations that can be performed on data. However, array is just a data structure that has a collection of element. Basically data structures expose data and have no meaningful functions.
Since you want to perform save, sort, update, stat, etc operations on your collected data so it makes sense to have a Post object with data/attributes (like author, date posted, title, category, etc.) and the operations/methods you would like to perform on your data. Abstracting the data and behaviour of your object into a class will make your code easy to maintain and understand where you can easily see the responsibility of the class by the methods defined in that class and how those methods change the state of your object by manipulating the object attributes/data.

Resources