For example, if I want to develop a chitchat bot, and I already have a corpus with dialog like the following:
-- Are you a student?
-- No, I am a scientist.
...
Can I train a model directly on this data without going through the regular NLU and Core processes, such as create the NLU data and stories data? In chitchat, it's hard to define the intent, if my corpus is a little large. There can be potentially too many intents. So it seems not to be good to use intent-slot-dm flow to develop the model?
Does RASA provide a way to directly train on the data?
This is currently not possible with the Rasa Stack.
If you model goal oriented use cases, this should be not a big problem. However, for chitchat this can indeed be a bit tricky. Try to see whether you can break down the most frequent chitchat messages in intents, and then handle them in stories.
Related
I am going to build a chat bot from scratch with rasa.The biggest difficulty now is how to automate production training data.Training data includes nlu.md and stories.md .
I have tried rasa-nlu-trainer and Chatito,But there are still a lot of manual operations,If there are tens of thousands of corpora in the future.How to mark the data to make the data meet the data format of nlu.md and stories.md
Is there an automated tool or program to do this? Thanks a lot!
Well, if you're doing anything ML related, your data is the most important thing that you'll need for the model to learn from. And because we want the model to learn from that data, we create the data and then train the model with it. What you're asking for is for something to somehow create the data for it. It's precisely because there doesn't exist anything like that that we create datasets to train the AI on, by ourselves, so that the model learns form it. So, if you automate the data creation process, what do you expect the model to learn?
So, you can't create the data automatically because if that were possible, we would already have had Artificial General Intelligence (AGI) by now.
But if your goal is to just format the data then you can just write a script for that.
Events in an event store (event sourcing) are most often persisted in a serialized format with versions to represent a changed in the model or schema for an event type. I haven't been able to find good documentation showing the actual model or schema for an actual event (often data table in event store schema if using a RDBMS) but understand that ideally it should be generic.
What are the most basic fields/properties that should exist in an event?
I've contemplated using json-api as a specification for my events but perhaps that's too "heavy". The benefits I see are flexibility and maturity.
Am I heading down the "wrong path"?
Any well defined examples would be greatly appreciated.
I've contemplated using json-api as a specification for my events but perhaps that's too "heavy". The benefits I see are flexibility and maturity.
Am I heading down the "wrong path"?
Don't overlook forward and backward compatibility.
You should plan to review Greg Young's book on event versioning; it doesn't directly answer your question, but it does cover a lot about the basics of interpreting an event.
Short answer: pretty much everything is optional, because you need to be able to change it later.
You should also review Hohpe's Enterprise Integration Patterns, in particular his work on messaging, which details a lot of cases you may care about.
de Graauw's Nobody Needs Reliable Messaging helped me to understan an important point.
To summarize: if reliability is important on the business level, do it on the business level.
So while there are some interesting bits of meta data tracking that you may want to do, the domain model is really only going to look at the data; and that is going to tend to be specific to your domain.
You also have the fun that the representation of events that you use in the service that produces them may not match the representation that it shares with other services, and in particular may not be the same message that gets broadcast.
I worked through an exercise trying to figure out what the minimum amount of information necessary for a subscriber to look at an event to understand if it cares. My answers were an id (have I seen this specific event before?), a token that tells you the semantic meaning of the message (is that something I care about?), and a location (URI) to get a richer representation if it is something I care about.
But outside of the domain -- for example, when you are looking at the system as a whole trying to figure out what is going on, having correlation identifiers and causation identifiers, time stamps, signatures of the source location, and so on stored in a consistent location in the meta data can be a big help.
Just modelling with basic types that map to Json to write as you would for an API can go a long way.
You can spend a lot of time generating overly complex models if you throw too much tooling at it - things like Apache Thrift and/or Protocol Buffers (or derived things) will provide all sorts of IDL mechanisms for you to generate incidental complexity with.
In .NET land and many other platforms, if you namespace the types you can do various projections from the types
Personally, I've used records and DUs in F# as a design and representation tool
you get intellisense, syntax hilighting, and types you can use from F# or C# for free
if someone wants to look, types.fs has all they need
I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Thanks!
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.
We are currently in the process of evaluating FHIR for use as part of our medical record infrastructure. For the EHR data (Allergies, Visits, Rx, etc..) the HL7 FHIR seems to have an appropriate mapping.
However, lots of data that we deal with is related to personal Fitness - think Fitbit or Apple HealthKit:
Active exercise (aerobic or workout): quantity, energy, heart-rate
Routine activities such as daily steps or water consumption
Sleep patterns/quality (odd case of inter-lapping states within the same timespan)
Other user-provided: emotional rating, eating activity, women's health, UV
While there is the Observation resource, this still seems best fit (!) for the EHR domain. In particular, the user fitness data is not collected during a visit and is not human-verified.
The goal is to find a "standardized FIHR way" to model this sort of data.
Use an Observation (?) with Extensions? Profiles? Domain-specific rules?
FHIR allows extraordinary flexibility, but each extension/profile may increase the cost of being able to exchange the resource directly later.
An explanation on the appropriate use of an FHIR resource - including when to Extend, use Profiles/tags, or encode differentiation via Coded values - would be useful.
Define a new/custom Resource type?
FHIR DSTU2 does not define a way to define a new Resource type. Wanting to do so may indicate that the role of resources - logical concept vs. an implementation interface? - is not understood.
Don't use FHIR at all? Don't use FHIR except on summary interchanges?
It could also be the case that FHIR is not suitable for our messaging format. But would it be any "worse" to go FIHRa <-> FIHRb than x <-> FIHRc when dealing with external interoperability?
The FHIR Registry did not seem to contain any User-Fitness specific Observation Profiles and none of the Proposed Resources seem to add appropriate resource-refinements.
At the end of the day, it would be nice to be able to claim to be able to - with minimal or no translation, ie. in a "standard manner" - be able to exchange User Fitness data as an FHIR stream.
Certainly the intent is to use Observation, and there's lots of projects already doing this.
There's no need for extensions, it's just a straight forward use. Note that this: " In particular the user fitness data is not collected during a visit and is not human-verified" doesn't matter. There's lots of EHR data of dubious provenance...
You just need to use the right codes, and bingo, it all works. I've provided a bit more detail to the answer here:
http://www.healthintersections.com.au/?p=2487
Systems have to sometimes accommodate the possibility of real world bad data. Consider that some data originates with paper forms. And forms inherently have a limited means of validating data.
Example 1: On one form users are expected to enter an integer distance (in miles) into a blank. We capture the information as written as a string since we don't always end up getting integer values.
Example 2: On another form we capture a code. That code should map to one of the codes in our system. However, sometimes the code written on the form is incorrect. We capture the code and allow it to exist with an invalid value until some future time of resolution. That is, we temporarily allow bad data since it's important to record the record even if some of it is invalid.
I'm interested in learning more about how systems accommodate bad data, that is, human error. Databases are supposed to be bastions of data integrity, but the real world is messy and people make mistakes. Systems must allow us to reflect those mistakes.
What are some ways systems you've developed accommodate human error? What practices have you used? What lessons have you learned?
Any further reading on the topic? (I had trouble Googling it.)
I agree with you, whatever we do there's no guarantee that we can get rid of bad or incorrect data. Especially, but not only, if it comes to user input. In my experience the same problems exist in complex integration projects, in which you have to integrate and merge (often inconsistent) data retrieved from different systems.
A good strategy is to decouple the input from the operational system itself. First, place user (or external system) provided data in a separate datastore (e.g. different schema). In a second step load this data into your operational datastore, but only if it confirms to strict rules (e.g. use address verification software to verify a given address). This Extract, Transform, Load (ETL) approach is fairly common in Data Warehousing (DWH) solutions, but can be applied programmatically in transactional systems as well (in my experience).
The above approach often leads to asynchronous processes in which the input is subitted first and (maybe) at a later time the external entity (user or system) retrives feedback whether its data was correct or not.
EDIT: For further readings I recommend to have a look at DWH concepts. Alhtough, you may not want to build such a thing, you could partially apply those concepts:
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Data_warehouse
http://en.wikipedia.org/wiki/Data_cleansing
A government department I worked in does a lot of surveys, most of which are (were) still paper based.
All the results were OCR'd into the system.
As part of the OCR process a digital scan of the forms is kept.
Data is then validated, data that is undecipherable or which fails validation is flagged.
When a human operator reviews the digital data they can modify the data if they are confident that they can correctly interpret what the code could not; they (here's the cool bit) can also bring up the scan of the paper based original, and use that to determine what the user was trying to say.
On a different thread; at some point you want to validate the data coming in against any expected data ranges that you want it to conform to; buy rejecting it at the point of entry you give the user a chance to correct it - the trade off is that every time you reject it you increase the chance of them abandoning the whole process.
At some point in your system you need to specify the rules which will be used for validation. At the end of the day a system is only going to be as smart as those rules. You can develop these yourself into the code (probably the business logic) or you might use a 3rd party component.
having flexible control over the validation is pretty important as they are likely to change overtime.
To be honest with you, one point of migrating from paper-based systems to IT is to remove these errors and make sure all data is always correct. I doubt any correctly planned and developed IT system (especially business financial systems) would allow such errors. Not in the company I am working for anyway...
There are lots of software tools that address the kinds of problems you mention. There are platforms and tools that let you define rules for scrubbing and transforming data and handling validation errors. Those techniques are widely used for Data Integration and Business Intelligence applications. Google for "Data Quality" or "Data Integration".
The easiest thing to do is to (this is not always possible) design the interface where users enter the data to limit as much as possible the amount of text that they need to enter. In my experience this seems to be where a lot of problems come from. One simple example of this is to provide a select, or auto-complete select field
One thing that you could do is do everything possible to determine if the data is correct before going into the db. I try to give the user entering the data as much feedback as possible so they can (ideally) fix some of the issues before the data gets persisted. For example, it is a very quick check to determine if the data being entered is of the correct type.
I got started in legal systems before the PC era. Litigation support databases routinely have to accommodate factually incorrect, incomplete, and contradictory information. It takes a different way of thinking.
The short version . . .
Instead of recording a single fact, you record multiple assertions about a fact. It boils down to designing a database to store data from assertions like these.
In an interview at 2011-01-03 08:13, Neil Rimes told Officer Cane
that he was at home from 2011-01-02 20:00 until 2011-01-03 08:13.
In an interview at 2011-01-03 08:25, Liza Nevers told Officer Cane
that Neil Rimes came home at 2011-01-02 23:45.
In a deposition at 2011-05-13 10:22, Cody Maxon told attorney Kurt
Schlagel that he saw Neil Rimes at Kroger at 2011-01-03 03:00