Optimizing Data Translation - data-structures

Our business deals with houses and over the years we have created several business objects to represent them. We also receive lots of data from outside sources, and send data to external consumers. Every one of these represents the house in a different way and we spend a lot of time and energy translating one format into another. I'm looking for some general patterns or best practices on how to deal with this situation. How can I write a universal data translator that is flexible, extensible, and fast.
Background: A house generally has 30-40 attributes such as size, number of bedrooms, roof type, construction material, siding material, etc. These are typically represented as key/value pairs. A typical translation problem is that one vendor will represent the number of bedrooms as a single key/value pair: NumBedrooms=3, while a different vendor will have a key/value pair per bedroom: Bedroom=master, Bedroom=small, Bedroom=small.
There's nothing particularly hard about the translation, but we spend a lot of time and energy writing and testing translations. How can I optimize this?
Thanks
(My environment is .Net)

The best place to start is by creating an "internal representation" which is the representation that your processing will always. Then create translators from and to "external representations" as needed. I'd imagine that this is what you are already doing, but it should be mentioned for completeness. The optimization comes from being able to selectively write import and export only when you need them.
A good implementation strategy is to externalize the transformation if you can. If you can get your inputs and outputs into XML documents, then you can write XSLT transforms between your internal and external representations. The goal is to be able to set up a pipeline of transformations from an input XML document to your internal representation. If everything is represented in XML and using a common protocol (say... hmm... HTTP), then the process can be controlled using configuration. BTW - this is essentially the Pipes and Filters design pattern.
Take a look at Yahoo pipes, Apache Cocoon, XML pipeline, and NetKernel for inspiration.

My employer back in the 90s faced this problem. We had a standard format we converted the customers' data to and from, as D.Shawley suggests.
I went further and designed a simple format-description language; we described our standard format in that language and then, for a new dataset, we'd write up its format too. Then a program would take both descriptions and convert the data from one format to the other, with automatic type conversions, safety checks, etc. (This came in handy for some other operations as well, not just these initial/final conversions.)
The particulars probably won't help you -- chances are you deal with completely different kinds of data. You can likely profit from the general principle, though. The "data definition language" needn't necessarily be a fancy thing with a parser and scanner; you might define it directly with a data structure in IronPython, say.

Related

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Format for representing GIS data

Is there open data format for representing such GIS data as roads, localities, sublocalities, countries, buildings, etc.
I expect that format would define address structure and names for components of address.
What I need is a data format to return in response to reverse geocoding requests.
I looked for it on the Internet, but it seems that every geocoding provider defines its own format.
Should I design my own format?
Does my question make any sense at all? (I'm a newbie to GIS).
In case I have not made myself clear I don't look for such data formats as GeoJSON, GML or WKT, since they define geometry and don't define any address structure.
UPD. I'm experimenting with different geocoding services and trying to isolate them into separate module. I need to provide one common interface for all of them and I don't want to make up one more data format (because on the one hand I don't fully understand domain and on the other hand the field itself seems to be well studied). The module's responsibility is to take partial address (or coordinates) like "96, Dubininskaya, Moscow" and to return data structure containing house number (96), street name (Dubininskaya), sublocality (Danilovsky rn), city (Moscow), administrative area (Moskovskaya oblast), country (Russia). The problem is that in different countries there might be more/less division (more/less address components) and I need to unify these components across countries.
Nope there is not unfortunately.
Why you may ask
Beacuse different nations and countries have vastly different formats and requirements for storing addresses.
Here in the UK for example, defining a postcode has quite a complex set of rules, where as ZIP codes in the US, are 4 digit numerical prefixed with a simple 2 letter state code.
Then you have to consider the question what exactly constitutes an address? again this differences not just from country to country, but some times drastically within the same territory.
for example: (Here in the UK)
Smith and Sons Butchers
10 High street
Some town
Mr smith
10 High street
Some town
The Occupier
10 High Street
Some Town
Smith and Sons Butchers
High Street
Some Town
Are all valid addresses in the UK, and in all cases the post would arrive at the correct destination, a GPS however may have trouble.
A GPS database might be set up so that each building is a square bit of geometry, with the ID being the house number.
That, would give us the ability to say exactly where number 10 is, which means immediately the last look up is going to fail.
Plots may be indexed by name of business, again that s fine until you start using person names, or generic titles.
There's so much variation, that it's simply not possible to create one unified format that can encompass every possible rule required to allow any application on the planet to format any geo-coded address correctly.
So how do we solve the problem?
Simple, by narrowing your scope.
Deal ONLY with a specific set of defined entities that you need to work with.
Hold only the information you need to describe what you need to describe (Always remember YAGNI* here)
Use standard data transmission formats such as JSON, XML and CSV this will increase your chances of having to do less work on code you don't control to allow it to read your data output
(* YAGNI = You ain't gonna need it)
Now, to dig in deeper however:
When it comes to actual GIS data, there's a lot of standard format files, the 3 most common are:
Esri Shape Files (*.shp)
Keyhole mark up Language (*.kml)
Comma separated values (*.csv)
All of the main stay GIS packages free and paid for can work with any of these 3 file types, and many more.
Shape files are by far the most common ones your going to come across, just about every bit of Geospatial data Iv'e come across in my years in I.T has been in a shape file, I would however NOT recommend storing your data in them for processing, they are quite a complex format, often slow and sequential to access.
If your geometry files to be consumed in other systems however, you can't go wrong with them.
They also have the added bonus that you can attach attributes to each item of data too, such as address details, names etc.
The problem is, there is no standard as to what you would call the attribute columns, or what you would include, and probably more drastically, the column names are restricted to UPPERCASE and limited to 32 chars in length.
Kml files are another that's quite universally recognized, and because there XML based and used by Google, you can include a lot of extra data in them, that technically is self describing to the machine reading it.
Unfortunately, file sizes can be incredibly bulky even just for a handful of simple geometries, this trade off does mean though that they are pretty easy to handle in just about any programming language on the planet.
and that brings us to the humble CSV.
The main stay of data transfer (Not just geo-spatial) ever since time began.
If you can put your data in a database table or a spreadsheet, then you can put it in a CSV file.
Again, there is no standards, other than how columns may or may not be quoted and what the separation points are, but readers have to know ahead of time what each column represents.
Also there's no "Pre-Made" geographic storage element (In fact there's no data types at all) so your reading application, also will need to know ahead of time what the column data types are meant to be so it can parse them appropriately.
On the plus side however, EVERYTHING can read them, whether they can make sense of them is a different story.

Performance interaction in SCORM

I understood almost all types of interactions specified by the scorm data model element cmi.interactions.n.type(true_false, multiple_choice, fill_in, long_fill_in, matching, performance, sequencing, likert, numeric, other) ,it remains to understand the type performance. I found an explanation of Ostyn but it remains ambiguous .
The Performance interaction is the most flexible and rich of the
standard interaction types in SCORM. It allows the capture of a number
of arbitrary steps performed by a learner, along with information
about every step. (Claud Ostyn)
AFAIK it does exactly that, i.e. stores arbitrary data related to an ambiguous non-standard interaction (e.g. a 3D simulation). LMS are not supposed to do anything with interaction data anyway, at least not regarding completion and grading, so it is mostly used by instructional designers who need a deeper insight into what the learners are doing and then adjust the training, e.g. exercise difficulty.

Method for runtime comparison of two programs' objects

I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.

How do you represent music in a data structure?

How would you model a simple musical score for a single instrument written in regular standard notation? Certainly there are plenty of libraries out there that do exactly this. I'm mostly curious about different ways to represent music in a data structure. What works well and what doesn't?
Ignoring some of the trickier aspects like dynamics, the obvious way would be a literal translation of everything into Objects - a Scores is made of Measures is made of Notes. Synthesis, I suppose, would mean figuring out the start/end time of each note and blending sine waves.
Is the obvious way a good way? What are other ways to do this?
Many people doing new common Western music notation projects use MusicXML as a starting point. It provides a complete representation of music notation that you can subset to meet your needs. There is now an XSD schema definition that projects like ProxyMusic use to create MusicXML object models. ProxyMusic creates these in Java, but you should be able to do something similar with other XML data binding tools in other languages.
As one MusicXML customer put it:
"A very important benefit of all of your hard work on MusicXML as far as I am concerned is that I use it as a clear, structured and very ‘real-world practical’ specification of what music ‘is’ in order to design and implement my application’s internal data structures."
There's much more information available - XSDs and DTDs, sample files, a tutorial, a list of supported applications, a list of publications, and more - at
http://www.makemusic.com/musicxml
MIDI is not a very good model for a simple musical score in standard notation. MIDI lacks many of the basic concepts of music notation. It was designed to be a performance format, not a notation format.
It is true that music notation is not hierarchical. Since XML is hierarchical, MusicXML uses paired start-stop elements for representing non-hierarchical information. A native data structure can represent things more directly, which is one reason that MusicXML is just a starting point for the data structure.
For a more direct way of representing music notation that captures its simultaneous horizontal and vertical structure, look at the Humdrum format, which uses more of a spreadsheet/lattice model. Humdrum is especially used in musicology and music analysis applications where its data structure works particularly well.
MIDI files would be the usual way to do this. MIDI is a standard format for storing data about musical notes, including start and end times, note volume, which instrument it's played on, and various special characteristics; you can find plenty of prewritten libraries (including some open source) for reading and writing the files and representing the data in them in terms of arrays or objects, though they don't usually do it by having an object for each note, which would add up to a lot of memory overhead.
The instruments defined in MIDI are just numbers from 1 to 128 which have symbolic names, like violin or trumpet, but MIDI itself doesn't say anything about what the instruments should actually sound like. That is the job of a synthesizer, which takes the high-level MIDI data an converts it into sound. In principle, yes, you can create any sound by superposing sine waves, but that doesn't work that well in practice because it becomes computationally intensive once you get to playing a few tracks in parallel; also, a simple Fourier spectrum (the relative intensities of the sine waves) is just not adequate when you're trying to reproduce the real sound of an instrument and the expressiveness of a human playing it. (I've written a simple synthesizer to do just that so I know hard it can be produce a decent sound) There's a lot of research being done in the science of synthesis, and more generally DSP (digital signal processing), so you should certainly be able to find plenty of books and web pages to read about it if you'd like.
Also, this may only be tangentially related to what the question, but you might be interested in an audio programming language called ChucK. It was designed by people at the crossroads of programming and music, and you can probably get a good idea of the current state of sound synthesis by playing around with it.
Music in a data structure, standard notation, ...
Sounds like you would be interested in LilyPond.
Most things about musical notation are almost purely mechanical (there are rules and guidelines even for the complex, non-trivial parts of notation), and LilyPond does a beautiful job of taking care of all those mechanical aspects. What's left is input files that are simple to write in any text editor. In addition to PDFs, LilyPond can also produce Midi files.
If you felt so inclined, you could generate the text files algorythimically with a program and call LilyPond to convert it to notation and a midi file for you.
I doubt you could find a more complete and concise way to express music than an input file for LilyPond.
Please understand that music and musical notation is not hierarchical and can't be modelled(well) by strict adherence to hierarchical thinking. Read this for mor information on that subject.
Have fun!
Hmmm, fun problem.
Actually, I'd be tempted to turn it into Command pattern along with Composite. This is kind of turning the normal OO approach on its head, as you are in a sense making the modeled objects verbs instead of nouns. It would go like this:
a Note is a class with one method, play(), and a ctor takinglengthandtone`.
you need an Instrument which defines the behavior of the synth: timbre, attack, and so on.
You would then have a Score, which has a TimeSignature, and is a Composite pattern containing Measures; the Measures contain the Notes.
Actually playing it means interpreting some other things, like Repeats and Codas, which are other Containers. To play it, you interpret the hierarchical structure of the Composite, inserting a note into a queue; as the notes move through the queue based on the tempi, each Note has its play() method called.
Hmmm, might invert that; each Note is given as input to the Instrument, which interprets it by synthesizing the wave form as required. That comes back around to something like your original scheme.
Another approach to the decomposition is to apply Parnas' Law: you decompose in order to keep secret places where requirements could change. But I think that ends up with a similar decomposition; You can change the time signature and the tuning, you can change the instrument --- a Note doesn't care if you play it on a violin, a piano, or a marimba.
Interesting problem.
My music composition software (see my profile for the link) uses Notes as the primary unit (with properties like starting position, length, volume, balance, release duration etc.). Notes are grouped into Patterns (which have their own starting positions and repetition properties) which are grouped into Tracks (which have their own instrument or instruments).
Blending sine waves is one method of synthesizing sounds, but it's pretty rare (it's expensive and doesn't sound very good). Wavetable synthesis (which my software uses) is computationally inexpensive and relatively easy to code, and is essentially unlimited in the variety of sounds it can produce.
The usefulness of a model can only be evaluated within a given context. What is it you are trying to do with this model?
Many respondents have said that music is non-hierarchical. I sort of agree with this, but instead suggest that music can be viewed hierarchically from many different points of view, each giving rise to a different hierarchy. We may want to view it as a list of voices, each of which has notes with on/off/velocity/etc attributes. Or we may want to view it as vertical sonorities for the purpose of harmonic analysis. Or we may want to view it in a way suitable for contrapuntal analysis. Or many other possibilities. Worse still, we may want to see it from these different points of view for a single purpose.
Having made several attempts to model music for the purposes of generating species counterpoint, analysing harmony and tonal centers, and many other things, I have been continuously frustrated by music's reluctance to yield to my modelling skills. I'm beginning to think that the best model may be relational, simply because to a large extent, models based on the relational model of data strive not to take a point of view about the context of use. However, that may simply be pushing the problem somewhere else.

Resources