I'm creating a class that needs to parse user contact info to determine if the presented user already exists in the db. Because the source is unvalidated, user generated data I have to test for matches under a variety of conditions.
The content is presented in 3 fields - Name (first & last are combined); Company Name; Email
I need to return a result based on each of these possible match conditions:
Exact Match
Email Match
Domain Name Only
Full Name Exact
Last Name Only
Institution Match
I have a rough idea of how I'd go about coding this and am sure that the result would be inferior to what would be produced by a formal TDD approach. My TDD learning curve is just past the very basics but I don't have the depth to see how the above scenario is staged and developed thru the full lifecycle.
I'd like some help structuring the project from an architectural point of view.
thx
Seems like tou already listed the primary positive test cases in your list of match types. So take those from the top, write a small test for the first case (exact match), warch it fail, make it pass, iterate until exact match works. Then do the same for the other match types.
Related
I'm trying to figure out how the patient searching on EPIC FHIR is working.
Testing all on the sandbox here: https://fhir.epic.com/Documentation?docId=testpatients.
The docs:
Starting in May 2019, Patient.Search requests require one of the following minimum data sets by > default in order to match and return a patient record:
FHIR ID
{IDType}|{ID}
SSN identifier
Given name, family name, and birthdate
Given name, family name, legal sex, and phone number/email
This is working correctly (returning one patient):
/api/FHIR/R4/Patient?family=Lin&given=Derrick&birthdate=1973-06-03
But this is also returning same record (extra character in family, wrong gender)
:
/api/FHIR/R4/Patient?family=Lina&given=Derrick&birthdate=1973-06-03&gender=female
Also this one is returning one record (extra character in family, no given name):
/api/FHIR/R4/Patient?family=Lina&birthdate=1973-06-03
Not sure what I am doing wrong, or is it expected behaviour?
There is a bunch of history here, but Epic's current Patient.search behaves more like Patient.$match. Specifically, the criteria provided to Patient.search are combined using (approximately) OR logic rather than AND logic. Behind the scenes, it is actually more of weighted score, but ultimately, the more criteria you provide, the more possible results you might get. This is often counterintuitive if you are used to how REST API query params normally work. Technically it is spec legal though, since FHIR has a blurb in there about servers being able to return other appropriate results as it sees fit.
https://build.fhir.org/search.html#Introduction
However, the server has the prerogative to return additional search results if it believes them to be relevant.
We don't have any specific updates right now, but there may be changes coming in Soon(tm).
I'm surprised the last one is returning any results, but regarding the first two searches, this is quite possible or even expected with Epic. Epic has special logic in the background that evaluates the parameter values you pass in against certain criteria, such as whether the name matches exactly, the name is similar, the birthdate matches exactly, etc. As a result, oftentimes not only exact matches but also similar matches will be returned by the Patient.Search API. The weighting of the criteria is customizable by Epic customer, so some may have stricter logic than others.
I'd recommend always validating the returned result against your input parameters to verify you are working with an exact match.
I am writing an app that has a sign-up form. This article made me doubt everything I knew about human names. My question is: does a person's name necessarily have positive length? Or can I validate names in this way and be confident that I have not denied anyone their identity?
P.S.: one might ask why am I validating at all. The answer is that this is for a school project and proper validation is a part of the mark. The article above proves that person's name can be pretty much any string of positive length but I don't know if zero length is OK.
With all types of programming, you have to draw a distinction between what is meaningful in the real world, and what is meaningful for your software solution.
How the data is to be used will validate what type of validation is required.
For instance, if your software interfaces with a government API, and the government API requires a first name and surname, you should do the same.
If you're interacting with bank accounts, you may have a single string which represents that account name, which many or may not be a human name or not, but may have other constraints around length.
If the name is only to be used for display purposes, maybe there is no point to capture the name at all, and instead you should capture a preferred display name (which doesn't needlessly assume a certain number of name components).
When writing software, you should target to make as few assumptions as possible, unless those assumptions will cause an increase in complexity of your software solution. If the software requires people to have non-empty names, then you should validate at the border that this is true.
In addition, if you were my student, you would have already lost marks for conflating null, and an empty string. In this instance, null would represent you lack data about the name, and an empty string would indicate that user has specified that their name is empty.
Also, if you decide not to validate something, you should at least leave a comment to indicate that you thought of it. If you do something unusual, it's possible a future developer may come along and fix the "bug". In addition, this helps you avoid losing marks.
I am new at the idea of programming algorithms. I can work with simplistic ideas, but my current project requires that I create something a bit more complicated.
I'm trying to create a categorization system based on keywords and subsets of 'general' categories that filter down into more detailed categories that requires as little work as possible from the user.
I.E.
Sports >> Baseball >> Pitching >> Nolan Ryan
So, if a user decides they want to talk about "Baseball" and they filter the search, I would like to also include 'Sports"
User enters: "baseball"
User is then taken to Sports >> Baseball
Now I understand that this would be impossible without a living - breathing dynamic program that connects those two categories in some way. It would also require 'some' user input initially, and many more inputs throughout the lifetime of the software in order to maintain it and keep it up to date.
But Alas, asking for such an algorithm would be frivolous without detailing very concrete specifics about what I'm trying to do. And i'm not trying to ask for a hand out.
Instead, I am curious if people are aware of similar systems that have already been implemented and if there is documentation out there describing how it has been done. Or even some real life examples of your own projects.
In short, I have a 'plan' but it requires more user input than I really want. I feel getting more info on the subject would be the best course of action before jumping head first into developing this program.
Thanks
IMHO It isn't as hard as you think. What you want is called Tagging and you can do it Automatically just by setting the correlation between tags (i.e. a Tag can have its meaningful information plus its reation with other ones. Then, if user select a Tag well, you related that with others via looking your ADT collection (can be as simple as an array).
Tag:
Sport
Related Tags
Football
Soccer
...
I'm hoping this helps!
It sounds like what you want to do is create a tree/menu structure, and then be able to rapidly retrieve the "breadcrumb" for any given key in the tree.
Here's what I would think:
Create the tree with all the branches. It's okay if you want branches to share keys - as long as you can give the user a "choice" of "Multiple found, please choose which one... ?"
For every key in the tree, generate the breadcrumb. This is time-consuming, and if the tree is very large and updating regularly then it may be something better done offline, in the cloud, or via hadoop, etc.
Store the key and the breadcrumb in a key/value store such as redis, or in memory/cached as desired. You'll want every value to have an array if you want to share keys across categories/branches.
When the user selects a key - the key is looked up in the store, and if the resulting value contains only one match, then you simply construct the breadcrumb to take the user where you want them to go. If it has multiple, you give them a choice.
I would even say, if you need something more organic, say a user can create "new topic" dynamically from anywhere else, then you might want to not use a tree at all after the initial import - instead just update your key/value store in real-time.
To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.
This is silly, but I haven't found this information. If you have names of concepts and suitable references, just let me know.
I'd like to understand how should I validate a given named id for a generic entity, like, say, an email login, just like Yahoo, Google and Microsoft do.
I mean... If you do have an user named foo, trying to create foo2 will be denied, as it is likely to be someone trying to mislead users by using a fake id.
Coming to mind:
Levenshtein Distance
Hamming Distance
You're going to have to take a two pass approach.
The first is a potential RegEx expression to validate that the entity name meets your specifications as much as possible. For example, disallowing certain characters.
The second is to perform some type of fuzzy search during the name creation. This could be as simple as a LIKE '%value%' where clause or as complicated as using some type of full-text search and limiting hits to a certain relevance rating.
That said, I would guess the failure rate (both false positives and false negatives ) match would be high enough to justify not doing this.
Good luck.