How to differentiate code terminology in MEDICAL SERVICE LINES? - sentinel

In the MEDICAL_SERVICE_LINES table, there is a field ‘PROCEDURE’. The data dictionary notes that this is ‘CPT, HCPCS, or ICD-10-PCS (less commonly)’. Is there a field that indicates which of these terminologies the code is from?
Can you use modifiers to help identify? Or are the code formats the best tool like:
5 numbers or 4 numbers and a letter (in that order)
1 letter and 4 numbers (in that order).
This customer receives PLAID and is not in Sentinel. (data dictionary here)

The code formats would be the best to distinguish definitively what type of code it is. The modifiers are not filled out all the time (some claims may not have modifiers attached to the procedure).
Your layout of the code format is correct (see section HCPCS Coding here for additional confirmation). HCPCS Level 1 is comprised of CPT codes. HCPCS Level 2/3 is what we typically regard as just "HCPCS"


create a URL shortener with Base 62?

I understood the process to shorten the URL with base 62 at How do I create a URL shortener?.
Steps given are
Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).
For this example, I will use 12510 (125 with a base of 10).
Now you have to convert 12510 to X62 (base 62)
My question is why not just create unique numerical key and return it ? What is the advantage of concerting numerical key > Base 62 > then Finally some alphanumeric number ?
Is it because final alphanumeric number will be much smaller than unique numerical key ?
Yes. The idea is to make it short and usable in a URL. A number in base 62 will use fewer characters than the same number in base 10. Notice also that URL shorteners use short hosts, such as
I can see you understand that, yes, a number written in base 62 takes less characters than a number in base 10 just like a number in base 10 takes less characters than a number in base 2 (e.g. 0101 is 3 characters longer than just '5').
So, I'll answer specifically "Why".
Sometimes a link is shortened to be more visually pleasing. A company worried about their public perception likely doesn't want their links to look like an error code due to how long they are so they resort to shortening. That's why some url shortening services allow you to add your own "vanity url" which customizes the domain name, so that a link can be shortened and branded.
Other times a link is shortened to minimize character count when working with constraints, like Twitter. For example, at my company we shortened the links in our automated Twilio messages because SMS messages that contain more than 160 characters are technically 2 concatenated messages so it is more expensive to send.
And finally if the link is being shared through a medium that cannot be directly clicked on (e.g. verbally, on paper), making it shorter makes it much easier to type into an address bar manually. (Imagine trying to type the url to this SO question when someone is reading it to you.) I assume this is also at least partially why the base used for these links usually stop at around 62. If you start including other arbitrary characters to higher the base and consequentially make the link marginally shorter, it'll become harder to communicate, read and type. ("" vs "🈲}♠ "

Algorithm for translating MLB play-by-play records into descriptive text

I'm trying to collect a dataset that could be used for automatically generating baseball articles.
I have play-by-play records of MLB games from that I would like to be written out to plain text, as those that could possibly appear as part of a recap news article.
Here are some examples of the play-by-play records:
The following is what I would like to achieve:
For the first example
I want it to be written out as something like:
"semim001 (Marcus Semien) was on three balls and two strikes in the second inning as the away player. He hit the ball into play after one called strike, one ball, three fouls, and another two balls. The fly ball was caught by the right outfielder."
The plays are formatted in the following way:
The first field is the inning, an integer starting at 1.
The second field is either 0 (for visiting team) or 1 (for home team).
The third field is the Retrosheet player id of the player at the plate.
The fourth field is the count on the batter when this particular event (play) occurred. Most Retrosheet games do not have this information, and in such cases, "??" appears in this field.
The fifth field is of variable length and contains all pitches to this batter in this plate appearance and is described below. If pitches are unknown, this field is left empty, nothing is between the commas.
The sixth field describes the play or event that occurred.
Explanations for all the symbols in the fifth and sixth field can be found on this Retrosheet page.
With Python 3, I've been able to format all the info of invariable length into a formatted sentence, which is all but the last two fields. I'm having difficulty in thinking of an efficient way to unparse (correct me if this is the wrong term to use here) the fifth and sixth fields, the pitches and the events that occurred, due to their variable length and wide variety of things that can occur.
I think I could write out all the rules based on the info on the Retrosheet website, but I'm looking for suggestions for a smarter way to do this. I wrote natural language processing as tags, hoping this could be a trivial problem in that field. Any pointers will be greatly appreciated!

iphone's phone number splitting algorithm?

iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example:
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."

Algorithms or Patterns for reading text

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.
These companies email the prices to our client each day, and of course the emails are all formatted differently. It is impossible to have any of the companies change their format - they will not do it.
Some look sort of like this:
This is example text that could be many lines long...
Location 1
Product 1 Product 2 Product 3
$20.99 $21.99 $33.79
Location 2
Product 1 Product 2 Product 3
$24.99 $22.88 $35.59
Others look sort of like this:
PRODUCT       PRICE   + / -
------------  -------- -------
Location 1
1             2007.30 +048.20
2             2022.50 +048.20
Maybe some multiline text here about a holiday or something...
Location 2
1             2017.30 +048.20
2             2032.50 +048.20
Currently we have individual parsers written for each company's email format. But these formats change slightly pretty frequently. We can't count on the prices being on the same row or column each time.
It's trivial for us to look at the emails and determine which price goes with which product at which location. But not so much for our code. So I'm trying to find a more flexible solution and would like your suggestions about what approaches to take. I'm open to anything from regex to neural networks - I'll learn what I need to to make this work, I just don't know what I need to learn. Is this a lex/parsing problem? More similar to OCR?
The code doesn't have to figure out the formats all on its own. The emails fall into a few main 'styles' like the ones above. We really need the code to just be flexible enough that a new product line or whitespace or something doesn't make the file unparsable.
Thanks for any suggestions about where to start.
I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.
For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.
from pyparsing import *
aaa =""" This is example text that could be many lines long...
another line
Location 1
Product 1 Product 2 Product 3
$20.99 $21.99 $33.79
stuff in here you want to ignore
Location 2
Product 1 Product 2 Product 3
$24.99 $22.88 $35.59 """
result = SkipTo("Location").suppress() \
# in place of "location" could be any type of match like a re.
+ OneOrMore(Word(alphas) + Word(nums)) \
+ OneOrMore(Word(nums+"$.")) \
all_results = OneOrMore(Group(result))
parsed = all_results.parseString(aaa)
for block in parsed:
print block
This returns a list of lists.
['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']
You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.
I do not know if there are equivalents in other languages.
You have given two pattern samples for text files.
I think these can be handled with scripting.
Something like: AWK, sed, grep with bash scripting.
One pattern in the first sample,
Section starts with keyword Location [Number]
second line of section has columns describing product names
third line of section has columns with prices for the products
There can be variable number of products per section.
There can be variable number of sections per file.
Products and prices are always on their designated lines of a section.
Whitespace separation identifies the (product,price) column-association.
Number of products in a section matches the number of prices in that section.
The collected data would probably be assimilated in a database.
The one thing I know I would use here is regular expressions. Three or four expressions could drive the parse logic for each e-mail format.
Trying to write the parse engine more generally than that would, I think, be skirting the edge of overprogramming it.

Creating an id from name and address data. Hash/Digest

My problem:
I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?
In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?
>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
Application Context (for those that are interested):
This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.
I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.
Use a cryptographic hash for its collision resistance, not its other qualities
Use as many bytes from the hash as you want (truncate)
convert to alpha-numeric characters
You can also truncate the alpha-numeric string instead of the hash
An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.
import hashlib, re
def digest(name, address):
hash = hashlib.md5(name + "|" + address).digest().encode("base64")
alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
return alnum_hash[:N_HASH_CHARS]
How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).
A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.
I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:
Create a SHA1 hash from the data, and store it in a table with a serial number field.
Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.
I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?
So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).
If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.
EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.
Is that feasible in your situation?
Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).
but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...
Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...
You could use AAAAA01 for first person at first address,
AAAAA02 for second person at first address,
AAAAB07 for the seventh resident at the second adresss, etc.
If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...
