Most appropriate database structure to store dichotomous keys?

Most appropriate database structure to store dichotomous keys? - data-structures

In biology, a dichotomous key (see example) is a flowchart of yes/no decisions that help you identify a species. Like "Does it have fur? yes/no" --> "Does it have feathers? yes/no".
Right now, I have a bunch of them in simple rows in a database, but there is no relationship between each key and its parents; it's just a flat list of rows and all the yes/no logic is handled in code.
It feels like there should be some way to store this kind of related information in a "graph" of some sorts, but I'm not sure of the best approach.
I know how to use relational databases, but have minimal experience with GraphQL. Am I even thinking on the right path..? What's the best way to store things like this (essentially taxonomic trees with mutually exclusive siblings) in a sane data store, preferably so that an algorithm would be able to generate a tree by looking at the data?

Any graph database would work well for your needs.
I would generally not recommend storing these in a relational database unless you want to use some of the NoSQL features of the database (IE: JSONB in PostgresSQL). Even in that case, I would probably recommend limiting how deep nesting you do on each record.
If you have relatively low nesting, you can probably get away with it. But keep in mind, every level deep is another database query you need to perform. There are performance things you can do, but its typically not worth it.
Though really, you probably could storing this a hash or something, unless it has a lot of records (like a key has 500 options). I wouldn't store them as separate rows. I would just serialize the data, then serialize it to convert it to something you can use. Really not searchable, if you need searchable look at Bo SQL database structures).
Database table now looks like this (JSON serialized, I think, pretty format for easier reading):
Name | Dichotomous Keys
Vertebrate | {
"Does it have Fur": {"yes": "Mammal", "no": "Does it have feathers?"},
"Does it have Feathers": {"yes": "Bird", "no": "Does it have dry skin?"},
"Does it ..." {"yes": "...", "no": "..."},
...
}
Then use a bit of logic to dig into it. You could compact the hash more, but probably not worth it. The above may be a ram issue, but you could also consider the hash as a form of data cache, as long as you are storing a few KB or so and not MB of data in a field, you should be fine.
``
Note:
GraphQL is definitely not what you are looking for, that's for building APIs. The graph part comes from being able to request data in the form of a graph that explains exactly what data you want rather than a "Here is what you data get take it or leave it" most APIs will give you. Example, though this is a definitely not a valid GraphQL request, but it's similar enough (actual ones are a little bit more "wordy"): `Fetch user(id=100): { [:id, :email, :phone] }` then the GraphQL server would respond with `{ user: { id: 100, email: something#something.com, phone: 123-456-7890} }` instead of the massive amount of data the an API *might* need to send about a user.

Related

Is it possible to get items from DynamoDB where the primary key ends with a given string?

Is it possible, using the AWS Ruby SDK (or just DynamoDB in general), to get an item or items from a table that uses a primary key only, and where that primary key ends with a certain string?
I haven't come across anything in the docs that explicitly answers this question, either in the ruby ddb docs or the general docs for ddb. I'm not saying the question is not answered, but if it is, I can't find it.
If it is possible, could someone provide an example for ruby or link to the docs where an example exists?

Although #Ryan is correct and this can be done with query, just bear in mind that you're doing a "full-table-scan" here. That might be OK for a one-time job but probably not the best practice for a routine task (and of course not as a part of your API calls).
If your use-case involves quickly finding objects based on their suffix in a specific field, consider extracting this suffix (assuming it's a fixed-size suffix) as another field and have a secondary index on that one. If you want to query arbitrary length suffixes, I would create a lookup table and update it with possible suffixes (or some of them, to save some calls, and then filter when querying).

It looks like you would want to use the Query method on the SDK to find the items your looking for. It seems that "EndsWith" is not available as a comparison operator in the SDK though. So you would need to use CONTAINS and then check your results locally.
This should lead to the best performance, letting DynamoDb do the initial heavy lifting and then further pruning the results once you receive them.
http://docs.aws.amazon.com/sdkforruby/api/Aws/DynamoDB/Client.html#query-instance_method

In XQuery, how do I avoid deteriorating performance in a paged simple query?

I have a ML database with a few tens of thousands of documents in it, and a query that returns some simple calculated values for either all or a subset of those documents. The document count has grown to the point that the "all documents" option no longer reliably runs without timing out, and is only going to get worse as the document count grows. The obvious solution is for the client application to use the other form and paginate the results. It's an offline batch process, so overall speed isn't an issue - we'd just like to keep individual requests sane.
The paged version of the query is very simple:
declare namespace ns = "http://some.namespace/here"
declare variable $fromCount external;
declare variable $toCount external;
<response> {
for $doc in fn:doc()/ns:entity[$fromCount to $toCount]
return
<doc> omitted for brevity </doc>
} </response>
The problem is that the query is slower the further through the document set the requested page is; presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response.
One wrinkle is that there are other types of document in the database, so just using fn:doc isn't a realistic option (although, they are in different directories, so xdmp:directory() might be an option; something I'll look into.)
There also isn't currently an index on the ns:entity element; would that help? It's always the root-node of a document, and the documents are quite large, so I'm concerned about the size of the index. Also, (the slow part of) this query isn't interested in the value of the element, just that it exists.
I thought about using the search: api for it's built-in paging, but it seems overkill for a query that is intended to match all documents; surely it's possible to manually construct the query that search:search() would build internally.
It seems like what I really need is an efficient list of all root-nodes of a certain type in the database. Does Marklogic maintain such a thing? If Not would an index solve the problem?
Edit: It turns out that the answer in my case is use the xdmp:directory() option, since ML apparently stores a fast, in-memory list of all documents. Still, if there is a more general solution to the problem, it's bound to be of interest, so I'll leave the question here.

Your analysis is correct:
presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response
The usual answer is cts:search plus the unfiltered option. You found that xdmp:directory was faster, but you should still be able to measure pagination times as O(n) even if the scale is smaller. See http://docs.marklogic.com/guide/performance/unfiltered#chapter - basically the database is guarding against returning false positives, unless you tell it not to.
Another approach might be to use cts:uris and its limit option, but this might require managing pagination state in terms of start values rather than page counts. For example if the last item on page 1 was "cat", you would use "cat" as arg2 when calling cts:uris for the next page. You could still use pagination start-stop values, too. That would still be O(n) - but at a much smaller scale.

dynamically classify categories

I am new at the idea of programming algorithms. I can work with simplistic ideas, but my current project requires that I create something a bit more complicated.
I'm trying to create a categorization system based on keywords and subsets of 'general' categories that filter down into more detailed categories that requires as little work as possible from the user.
I.E.
Sports >> Baseball >> Pitching >> Nolan Ryan
So, if a user decides they want to talk about "Baseball" and they filter the search, I would like to also include 'Sports"
User enters: "baseball"
User is then taken to Sports >> Baseball
Now I understand that this would be impossible without a living - breathing dynamic program that connects those two categories in some way. It would also require 'some' user input initially, and many more inputs throughout the lifetime of the software in order to maintain it and keep it up to date.
But Alas, asking for such an algorithm would be frivolous without detailing very concrete specifics about what I'm trying to do. And i'm not trying to ask for a hand out.
Instead, I am curious if people are aware of similar systems that have already been implemented and if there is documentation out there describing how it has been done. Or even some real life examples of your own projects.
In short, I have a 'plan' but it requires more user input than I really want. I feel getting more info on the subject would be the best course of action before jumping head first into developing this program.
Thanks

IMHO It isn't as hard as you think. What you want is called Tagging and you can do it Automatically just by setting the correlation between tags (i.e. a Tag can have its meaningful information plus its reation with other ones. Then, if user select a Tag well, you related that with others via looking your ADT collection (can be as simple as an array).
Tag:
Sport
Related Tags
Football
Soccer
...
I'm hoping this helps!

It sounds like what you want to do is create a tree/menu structure, and then be able to rapidly retrieve the "breadcrumb" for any given key in the tree.
Here's what I would think:
Create the tree with all the branches. It's okay if you want branches to share keys - as long as you can give the user a "choice" of "Multiple found, please choose which one... ?"
For every key in the tree, generate the breadcrumb. This is time-consuming, and if the tree is very large and updating regularly then it may be something better done offline, in the cloud, or via hadoop, etc.
Store the key and the breadcrumb in a key/value store such as redis, or in memory/cached as desired. You'll want every value to have an array if you want to share keys across categories/branches.
When the user selects a key - the key is looked up in the store, and if the resulting value contains only one match, then you simply construct the breadcrumb to take the user where you want them to go. If it has multiple, you give them a choice.
I would even say, if you need something more organic, say a user can create "new topic" dynamically from anywhere else, then you might want to not use a tree at all after the initial import - instead just update your key/value store in real-time.

associate multiple strings to only one

I'm trying to make an algorithm that easily simplifies and groups synonyms (with mismatches, capitals, acronims, etc) into only one. I supose there should exist a standard way to build such a structure that, looking for a string with possible mismatches, if the string exists in the structure, it returns a normalized string key. In short, sometimes the same concept could be written in several ways, but I only want to keep the concept.
For instance: Supose I want to normalize or simplify the appearances of
"General Director", "General Manager", "G, Dtor", "Gen Dir", ...
into
"GEN_DIR"
and keep only this result for further reference.
By the way, I suppose that building a Hash with key/value pairs like
hash["General Director"]="GEN_DIR"
hash["General Manager"]="GEN_DIR"
hash["G, Dtor"]="GEN_DIR"
hash["G, Dir"]="GEN_DIR"
could be a solution, but I suspect that there are more elegant or adequate solutions to that.
I would also need the way to persist this associative structure easily without any database because it should grow as I find more mismatches of the same word or sentence. A possible approach I think is to define this structure by means of a DSL, but I'm open to suggestions.

Well, there is no rule, at least a clear one.
My aim is to scrap from web some "structured" data that sometimes is incorrectly or incompletely typed. Some fields are descriptions and can be left as is. But some fields are suposedly to be "sets" but aren's correctly typed (as in my example). As a human can read that, he immediatelly knows what it means and can associate that with its meaning.
But I would like to automate as much as possible the process of reducing those possible mismatches to only one "string" (or symbol) before, for instance, saving it into a database. So, what I would need is a kindof hash or dictionary, as sawa correctly stated, that I can use to lookup any of such dirty strings to get the normalized string or symbol.
Also, of course, it would be desirable a way to make this hash (or whatelse it could be) to learn from new mismatches in some way and add a new association automatically (possibly it could be based on a distance measure between mismatched string and normalized string that, if lower than X, a new association is built). The whole association (i.e, hash) should grow as new mismatches and concepts arise and, though, it should be kept anywhere (possibly in an xml file, or something like what Mori answered below) for future uses.
Any new Idea?

Manually specifying how to build an index?

I'm looking into Thinking Sphinx for it's potential to solve an indexing problem. It looks like it has a very specific API for telling it what fields to index on a model. I don't like having this layer of abstraction in my way without being able to sidestep it. The thing is I don't trust Sphinx to be able to interpret my model properly as this model could have any conceivable property. Basically, I want to encode JSON in a RDBMS. In a way, I'm looking to make an RDBMS behave like MongoDB (RDBMSes have features I don't want to do without). If TS or some other index could be made to understand my models this could work. Is it possible to manually provide key/value pairs to TS?
"person.name.first" => "John", "person.name.last" => "Doe", "person.age" => 32,
"person.address" => "123 Main St.", "person.kids" => ["Ed", "Harry"]
Is there another indexing tool that could be used from Ruby to index JSON?
(By the way, I have explored a wide variety of NoSQL databases. I am trying to address a very specific set of requirements.)

As Matchu has pointed out in the comments, Sphinx usually interacts directly with the database. This is why Thinking Sphinx is built like it is.
However, Sphinx (but not Thinking Sphinx) can also accept XML data formats - so if you want to go down that path, feel free. You're going to have to understand the underlying Sphinx structure much more deeply than you would if using a normal relational database/ActiveRecord and Thinking Sphinx approach. Riddle may be useful for building a solution, but you'll still need to understand Sphinx itself first.

Basically, when you're specifying what you want to index--that is, when you want to build your own index--you're using the Map part of Map/Reduce. CouchDB supports exactly this. The only problem I ran into with Couch is that I want to query other document objects as the basis of my Map/Reduce since those documents would contain metadata about how I want to build my indexes. This goes against the grain of Map/Reduce however as you have to map a document in isolation with no external data. If you need external data it would instead be denormalized into your documents.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio