Questions within questions for tin can api? - tin-can-api

Does Tin Can API support questions within questions?
If so, what would be the specification for passing data to an LRS?
I was thinking of adding ID's to each sub question.

This would be much easier to answer if you could provide an example, but the flexibility of the Tin Can API is such that you can literally capture anything (which is also part of the complexity) with more or less grace.
Some immediate options come to mind:
Use a single interaction activity statement (likely with type choice) and use the formatting allowed to have multi-value responses (i.e. golf[,]tetris).
Use multiple statements where there is a combined statement (necessary if there is an overall result) such that there is a single main activity and each sub-question has its own statement where the sub-question has its own activity and the main activity would be stored in the context.contextActivities.parent list. When there is a combined statement in this case I would include a reference to the combined statement in the sub-question statements' context.statement property such that you can tie them all together.
Use result, context, and activity definition extensions to capture anything. This should be a last resort option, it usually makes setting things up simple but adds significant complexity on the reporting side. Though tempting because of the simplicity, unless you are trying to capture a specific type of data point (like geo-location data, math equations, etc.) usually you should try to avoid the use of extensions.
Which of the above makes the most sense is probably determined by what sort of response is being given, and whether or not questions are nested such that there is an overall result and sub-results or whether there is just overall results.

Related

Is there a process for munging data from many different formats in RapidMiner?

I'm trying to help my team streamline a data ingestion process that is taking up a substantial amount of time. We receive data in multiple formats and with attributes arranged differently. Is there a way using RapidMiner to create a process that:
Processes files on a schedule that are dropped into a folder (this
one I think I know but I'd love tips on this as scheduled processes
are new to me)
Automatically identifies input filetype and routes to the correct operator ("Read CSV" for example)
Recognizes a relatively small number of attributes and arranges them accordingly. In some cases, attributes are named the same way as our ingestion format and in others they are not (phone vs phone # vs Phone for example)
The attributes we process mostly consist of name, id, phone, email, address. Also, in some cases names are split first/last and in some they are full name.
I recognize that munging files for such simple attributes shouldn't be that hard but the number of files we receive and lack of order makes it very difficult to streamline a process without a bit of automation. I'm also going to move to a standardized receiving format but for a number of reasons that's on the horizon and not an immediate solution.
I appreciate any tips or guidance you can share.
Your question is relative broad, so unfortunately I can't give you complete answer. But here are some ideas on how I would tackle the points you mentioned:
For a full process scheduling RapidMiner Server is what you are
looking for. In that case you can either define a schedule (e.g.,
check regularly for new files) or even define a web service to
trigger the process.
For selecting the correct operator depending on file type, you could
use a combination of "Loop Files" and macro extraction to get the
correct type and the use either "Branch" or "Select Subprocess" for
switching to different input routes.
The "Select Attributes" operator has some very powerful options to
select specific subsets only. In your example I would go for a
regular expression akin to [pP]hone.* to get the different spelling
variants. Also very helpful in that case would be the "Reorder
Attributes" operator and "Rename by Replacing" to create a common
naming schema.
A general tip when building more complex process pipelines is to organize your different tasks in sub-processes and use the "Execute Process" operator. This makes everything much more readable and maintainable. Also a good error handling strategy is important to handle unforeseen data formats.
For more elaborate answers and tips from many adavanced RapidMiner users, I also highly recommend the RapidMiner community.
I hope this gives a good starting point for your project.

Algorithm of Hanning Window in DigitalMicrograph

Since we cannot specify the "parameters" associated with the build-in filter functions in DM (see previous question), I would like to write my own scripts to construct the filter I need.
However, I cannot figure out the algorithm of the "Hanning Window" used in DM, specifically the "strength" parameter (when strength=1.0, it is just a typical Hann function).
Does someone know the underlying algorithm?
I do not know the exact internal computation of the current filters. However, before you jump into re-creation you may want to check if the following script-command exists and works in your GMS version.
BasicImage IUHanningWindowFilter( BasicImage im, Number power )
This is not an officially supported command, but I believe it has been in the software for quite some time now.

TDD strategy when implementing a multi-stage process?

At the moment I'm developing a piece of code which first gathers sentences from a set of documents, then tokenises these, then uses the results to analyse recurring frequencies of token sequences, including case variations (upper case/lower case/leading cap/other), then prints out the results.
Now I want to introduce two more stages before printing out the results:
1. firstly, removing "stop words" (i.e. words or short sequences the frequency of which can never be of interest, such as, in English, "the", "of the", "of which", etc.) - these stop words/"stop sequences" to be taken from a database table
2. secondly, bringing up a dialog enabling the user to identify sequences of new stop words, which would then remove the token sequences involved and also add the sequence in question to the database table.
The thing is, this is a multi-stage process, and I'm just wondering what TDD experts do faced with a situation like this: do I create a new test method for each individual stage...? The problem being that each individual stage requires the use of "live memory data" from the previous stage: another possibility could be to somehow serialise this data and then deserialise it when testing for the next stage... but then this would involve the app code doing things which were of benefit only for the testing code, i.e. it would mean tweaking ("distorting"?) the app code for the benefit of the testing code, which seems wrong in principle...
Also, if anyone can point me in the direction of a book or site which helps TDD newbs like myself go to "the next level" I would be very grateful.
later
To the person who marked this as "favorite": I've now got hold of a book called "Growing Object-Oriented Software, Guided by Tests", which is well-reviewed and appears to be for someone wanting to move from beginner to intermediate. First impressions good.
Any views on this book by experts also welcome, of course...
On the face of it, you seem to be building a pipeline. From what I can tell, you're currently implementing all of it within a single class, which stores both the data that's being worked on and implements the methods that do the processing. One approach that you could take would be to break down the problem into smaller chunks. Rather than having a single class, you have a class for each stage of the pipeline and another class for orchestrating the process which is responsible for plugging the stages together in the correct order.
So, scanning through what you've described, you appear to have the following processors:
DocumentReader (reads documents from somewhere into in memory document)
SentenceExtractor (document/list of documents in, list of sentences out)
1 or more SentenceAnalysers (sentences in, statistics out), you might want to break this down depending on the type of analysis and how complex it is.
StopWordExtractor (StopWordProvider and sentences in, sentences out)
There are additional supporting classes that would be needed, to support writing of new stopwords to the database and depending on how the stopwordprovider was implemented keeping it in sync as the user selects new ones.
Essentially, what I'm saying is that you appear to be doing too much in a single location. If you're really happy that the code as you've described it is a single unit, then there is nothing wrong with you testing it all in one place, but then your inputs will be your starting documents/sentences and your outputs will be the end of the process. If you agree with me that really, there are several distinct components involved in the process that could change independently, then I would suggest breaking the process down into smaller classes and testing that those perform as expected for given sets of inputs/outputs...

Mongo Db design (embed vs references)

I've read a lot of documents, Q&A etc about that topic (embed or to use references).
I understand the points why you should use one or another approach, but I can't see that someone discuss (asked) similar case:
I have 2 (A and B) entities and relation between them is ONE_TO_MANY (A could belongs to many B), I can use embed (denormalization approach) and it's ok (I clearly understand it), but what if I would like (later) to modify one of used, into many B documents, A document field ? Modify it does not mean replace A by A', it means some changes into exactly A record. It means that (in embed case) I have to apply such changes in all B documents which had A version already.
based on description here http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
What If later we would like to change used in many documents address:name field ?
What If we need the list of available addresses in the system ?
How fast that operations will be done in MongoDb ?
It's based on what operations are used mostly. If you are inserting and selecting lot of documents and there is a possibility, that e.g. once a month you will need to modify many nested sub-documents, I think that storing A inside B is good practice, it's what mongodb is supposed to be. You will save lot of time just selecting one document without needing to join another ones and slower update once a time you can stand without any problems.
How fast the update ops will be is obviously dependent on volume of data.
Other considerations as to whether to use embedded docs or references is whether the volume of data in a single document would exceed 16mb. That's a lot of documents mind.
In some cases however, it simply doesn't make sense to denormalise entire documents especially where they're used/referenced elsewhere.
Take a User document for example, you wouldn't usually denormalise all user attributes across each collection that needs to reference a user. Instead you reference the user [with maybe some denormalised user detail].
Obviously each additional denormalised value (unless it was an audit) would need to be updated when the referenced User changes, but you could queue the updates for a background process to deal with - rather than making the caller wait.
I'll throw in some more advice as to speed.
If you have a sub-document called A that is embedded in lots of documents - and you want to change instances of A ...
Careful that the documents don't grow too much with a change. That will hurt performance if A grows too big because it will force Mongo to move the document in memory.
It obviously depends on how many embedded instances you have. The more you have, the slower it will be.
It depends on how you match the sub-document. If you are finding A without an index, it's going to be slow. If you are using range operators to identify it, it will be slow.
Someone already mentioned the size of documents will most likely affect the speed.
The best advice I heard about whether to link or embed was this ... if the entity (A in this case) is mutable ... if it is going to mutate/change often ... then link it, don't embed it.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources