What would be the preferred means of describing the size of 3-dimensional object such as a tumor in FHIR?
Extending one of the existing resources, such as Observation, with 3 fields for x, y, z extents seems reasonable, but not particularly general.
Ideally, one might extend complex data types to include a new type. However, I don't see any provisions for extending data types.
Is there a preferred approach to the creation of new models for concepts such as 3-dimensional measurements?
FHIR has an extensibility model (http://www.hl7.org/fhir/extensibility.html) that not only covert extending the Resources but also the datatypes.
That said, I think you should model this as 4 Observations, one representing the main grouped measurement (with no value) and 3 Observations (for x,y,z) to which the main measurement refers using Observation.related (type=hasComponent).
This pattern is a bit verbose. If we would look at extensions, I don't think this is an extension on Observation, but rather an extension on Quantity. That would raise the question: which of the 3 dimensions goes into Observation.value?
Related
I'm working on a Seq2Seq model to perform abstractive summarization using the Glove pre-trained word embeddings. Is it required I make two embedding matrices? One that covers the source vocabulary and one that covers the summary vocabulary.
No, the common practice is to share the embedding matrices even in machine translation where the words are from different languages.
Sometimes, the embedding matrix is also used as an output projection matrix when generating model output (see e.g., the Attention is all you need paper), however, this is only possible if you use a vocabulary of tens of thousands of (sub)word as opposed to the very large vocabulary of GloVe.
Assume I have a number of pictures. Let’s say 10 pictures which are annotated by 50 people each.
So Pic 1 might be „beach, vacation, relax, sand, sun…“ I now trained word2vec with a domain specific content. I have the vectors of each word and can represent them. But what I want now, is to create ONE final vector representing each picture. So one vector with represents the 50 annotations (beach, vacation, relax, sand, sun…)
Let’s assume each vector is represented with 100 dimensions – do I just add the first dimension (the 100 dimensions) of all 50 vectors, than the 2nd dimension of all 50 vectors… etc.
I am very thankful for any comments that might help me!
I tried this, but I am not sure if this is the right way to do it.
I also tried doc2vec but I guess this is problematic as the word order of the annotations is irrelevant – but relevant for doc2vec.???
A few thoughts:
A list of annotations isn't quite like natural-language narrative text – in either the relative frequencies of tokens, or the importance of neighboring-tokens. So you may want to try out a extra-wide range of training parameters. For example, using a giant window (far larger than each of your texts) could essentially negate the (possibly-arbitrary) ordering of the annotations, putting every word in every other word's context. (That'd increase training time, but might help in other ways.) Also, look into the newly-tunable ns_exponent parameter - the paper referenced from the gensim docs suggests values very different from the default may help in certain recommendation contexts.
That said, the most-simple way to combine many vectors into one is to average them all together. Whether that works well for your purposes, you'd have to test. (It unavoidably loses the information of a larger set of independent vectors, but if other aspects of your modeling are strong enough – enough training data, enough dimensions – the really important shared aspects may be retained in the summary vector.)
(You can see some code for averaging word-vectors in another recent answer.)
You could also try Doc2Vec. It is no more ordering-dependent than Word2Vec – some modes use a window-sized context where neighboring words influence each other. (But other modes don't, or, as mentioned above, an oversized window can essentially make neighboring-distances less-relevant).
I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value.
Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
Search the index with the query as-it-is.
Iterate through results up to some extent and recognize first document with a new dimension.
Problems
Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.
Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.
Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.
Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.
Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.
what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!!
give this algorithm a look aswell
http://en.wikipedia.org/wiki/Okapi_BM25
I have to implement connected components labeling algorithm Fortran. I have a clear idea on how to scan thee matrix, but what about storing and recover equivalence classes? I guess that in many other programming languages is an easy task, but i have to do it in Fortran. How can i do it?
First Edit: Following the pseudo code on wikipedia about connected components algorithm, what i have no idea on how to do in Fortran is
linked[label] = union(linked[label], L)
Here are some fragments of an answer. It looks like you need to implement a data structure which represents a set of labels. The first decision you have to make is to decide how to model a label. I see 3 obvious approaches:
Use integers.
Use character variables of length 1 (or 2 or whatever you want).
Define a type with whatever components you want it to have.
The second decision is how to implement a set of labels. I see 3 obvious approaches:
Use an array of labels (array of integers, array of character(len=2), array of type(label), it doesn't matter) whose size is fixed at compile time. You have to be fairly certain that the size you hard-wire is always going to be large enough. This is not a very appealing approach; I should probably not have mentioned it.
Use an array of labels whose size is set at run-time. This means using an allocatable array. You'll have to figure out how to set this to the right size at run-time, if it is possible at all.
Implement a type representing a set of labels. This type might, for example model a set as a linked list. But that is not the only way to model the set, the type might model the set of labels as an array, and do some fancy footwork to re-size the array if required. By defining a type, of course, you give yourself the freedom to change the internal representation of the set without modifying the code which uses the functionality exposed by the set type.
Depending on the choices you have made it should be quite straightforward to implement a union function to add a new label to an existing set of labels.
Note though, that there are many other ways to tackle this problem. You might, for example, start with a set of already-defined component labels and drop from the set the ones you don't need to use.
Since you seem to be new to Fortran, here's a list of language features you need to be familiar with to implement the foregoing.
How much of the Fortran 2003 standard your compiler implements.
Defining, and using, derived types.
Allocatable arrays, allocating arrays, moving allocations.
Arrays of derived types.
Type-bound procedures.
Pointers, and targets.
I am simulating a NAO robot that has published physical properties for its links and joints (such as dimensions, link mass, center of mass, mass moment of inertia about that COM, etc).
The upper torso will be static, and I would like to get the lumped physical properties for the static upper torso. I have the math lined out (inertia tensors with rotation and parallel axis theorem), but I am wondering what the best method is to structure the data.
Currently, I am just defining everything as rules, a method I got from looking at data Import[]'d from struct's in a MAT file. I refer to attributes / properties with strings so that I don't have to worry about symbols being defined. Additionally, it makes it easier to generate the names for the different degrees of freedom.
Here is an example of how I am defining this:
http://pastebin.com/VNBwAVbX
I am also considering using some OOP package for Mathematica, but I am unsure of how to easily define it.
For this task, I had taken a look at Lichtblau's slides but could not figure out a simple way to apply it towards defining my data structure. I ended up using MathOO which got the job accomplished since efficiency was not too much of a concern and it was more or less of a one-time deal.