Algorithms or Patterns for reading text - algorithm

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.
These companies email the prices to our client each day, and of course the emails are all formatted differently. It is impossible to have any of the companies change their format - they will not do it.
Some look sort of like this:
This is example text that could be many lines long...
Location 1
Product 1 Product 2 Product 3
$20.99 $21.99 $33.79
Location 2
Product 1 Product 2 Product 3
$24.99 $22.88 $35.59
Others look sort of like this:
PRODUCT       PRICE   + / -
------------  -------- -------
Location 1
1             2007.30 +048.20
2             2022.50 +048.20
Maybe some multiline text here about a holiday or something...
Location 2
1             2017.30 +048.20
2             2032.50 +048.20
Currently we have individual parsers written for each company's email format. But these formats change slightly pretty frequently. We can't count on the prices being on the same row or column each time.
It's trivial for us to look at the emails and determine which price goes with which product at which location. But not so much for our code. So I'm trying to find a more flexible solution and would like your suggestions about what approaches to take. I'm open to anything from regex to neural networks - I'll learn what I need to to make this work, I just don't know what I need to learn. Is this a lex/parsing problem? More similar to OCR?
The code doesn't have to figure out the formats all on its own. The emails fall into a few main 'styles' like the ones above. We really need the code to just be flexible enough that a new product line or whitespace or something doesn't make the file unparsable.
Thanks for any suggestions about where to start.

I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.
For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.
from pyparsing import *
aaa =""" This is example text that could be many lines long...
another line
Location 1
Product 1 Product 2 Product 3
$20.99 $21.99 $33.79
stuff in here you want to ignore
Location 2
Product 1 Product 2 Product 3
$24.99 $22.88 $35.59 """
result = SkipTo("Location").suppress() \
# in place of "location" could be any type of match like a re.
+ OneOrMore(Word(alphas) + Word(nums)) \
+ OneOrMore(Word(nums+"$.")) \
all_results = OneOrMore(Group(result))
parsed = all_results.parseString(aaa)
for block in parsed:
print block
This returns a list of lists.
['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']
You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.
I do not know if there are equivalents in other languages.

You have given two pattern samples for text files.
I think these can be handled with scripting.
Something like: AWK, sed, grep with bash scripting.
One pattern in the first sample,
Section starts with keyword Location [Number]
second line of section has columns describing product names
third line of section has columns with prices for the products
There can be variable number of products per section.
There can be variable number of sections per file.
Products and prices are always on their designated lines of a section.
Whitespace separation identifies the (product,price) column-association.
Number of products in a section matches the number of prices in that section.
The collected data would probably be assimilated in a database.

The one thing I know I would use here is regular expressions. Three or four expressions could drive the parse logic for each e-mail format.
Trying to write the parse engine more generally than that would, I think, be skirting the edge of overprogramming it.

Related

How to differentiate code terminology in MEDICAL SERVICE LINES?

In the MEDICAL_SERVICE_LINES table, there is a field ‘PROCEDURE’. The data dictionary notes that this is ‘CPT, HCPCS, or ICD-10-PCS (less commonly)’. Is there a field that indicates which of these terminologies the code is from?
Can you use modifiers to help identify? Or are the code formats the best tool like:
CPT:
5 numbers or 4 numbers and a letter (in that order)
HCPCS:
1 letter and 4 numbers (in that order).
This customer receives PLAID and is not in Sentinel. (data dictionary here)
The code formats would be the best to distinguish definitively what type of code it is. The modifiers are not filled out all the time (some claims may not have modifiers attached to the procedure).
Your layout of the code format is correct (see section HCPCS Coding here for additional confirmation). HCPCS Level 1 is comprised of CPT codes. HCPCS Level 2/3 is what we typically regard as just "HCPCS"

Algorithm for translating MLB play-by-play records into descriptive text

I'm trying to collect a dataset that could be used for automatically generating baseball articles.
I have play-by-play records of MLB games from retrosheet.org that I would like to be written out to plain text, as those that could possibly appear as part of a recap news article.
Here are some examples of the play-by-play records:
play,2,0,semim001,32,.CBFFFBBX,9/F
play,2,0,phegj001,01,FX,S7/G
play,2,0,martn003,01,CX,3/G
play,2,1,youne003,00,,NP
The following is what I would like to achieve:
For the first example
play,2,0,semim001,32,.CBFFFBBX,9/F,
I want it to be written out as something like:
"semim001 (Marcus Semien) was on three balls and two strikes in the second inning as the away player. He hit the ball into play after one called strike, one ball, three fouls, and another two balls. The fly ball was caught by the right outfielder."
The plays are formatted in the following way:
The first field is the inning, an integer starting at 1.
The second field is either 0 (for visiting team) or 1 (for home team).
The third field is the Retrosheet player id of the player at the plate.
The fourth field is the count on the batter when this particular event (play) occurred. Most Retrosheet games do not have this information, and in such cases, "??" appears in this field.
The fifth field is of variable length and contains all pitches to this batter in this plate appearance and is described below. If pitches are unknown, this field is left empty, nothing is between the commas.
The sixth field describes the play or event that occurred.
Explanations for all the symbols in the fifth and sixth field can be found on this Retrosheet page.
With Python 3, I've been able to format all the info of invariable length into a formatted sentence, which is all but the last two fields. I'm having difficulty in thinking of an efficient way to unparse (correct me if this is the wrong term to use here) the fifth and sixth fields, the pitches and the events that occurred, due to their variable length and wide variety of things that can occur.
I think I could write out all the rules based on the info on the Retrosheet website, but I'm looking for suggestions for a smarter way to do this. I wrote natural language processing as tags, hoping this could be a trivial problem in that field. Any pointers will be greatly appreciated!

Reporting Multiple Values & Sorting

Having a bit of an issue and unsure if it's actually possible to do.
I'm working on a file that I will enter target progression vs actual target reporting the % outcome.
PAGE 1
¦NAME ¦TAR 1 %¦TAR 2 %¦TAR 3 %¦TAR 4 %¦OVERALL¦SUB 1¦SUB 2¦SUB 3¦
¦NAME1¦ 114%¦ 121%¦ 100%¦ 250%¦ 146%¦ 2¦ 0¦ 0%¦
¦NAME2¦ 88%¦ 100%¦ 90%¦ 50%¦ 82%¦ 0¦ 1¦ 0%¦
¦NAME3¦ 82%¦ 54%¦ 64%¦ 100%¦ 75%¦ 6¦ 6¦ 15%¦
¦NAME4¦ 103%¦ 64%¦ 56%¦ 43%¦ 67%¦ 4¦ 4¦ 24%¦
¦NAME5¦ 87%¦ 63%¦ 89%¦ 0%¦ 60%¦ 3¦ 2¦ 16%¦
Now I already have it sorting all rows by the Overall % column so I can quickly see at a glance but I am creating a second page that I need to reference points.
So on the second page I would like to somehow sort and reference different columns for example
PAGE 2
TOP TAR 1¦Name of top %¦Top %¦
TOP TAR 2¦Name of top %¦Top %¦
Is something like this possible to do?
Essentially I'm creating an Employee of the Month form that automatically works out who has topped what.
I'm willing to drop a paypal donation for whoever can figure this out for me as I've been doing it manually every month and would appreciate the time saved
I don't think a complicated array formula is necessary for this - I am suggesting a fairly standard Index/Match approach.
First set up the row titles - you can just copy and transpose them from Page 1, or use a formula in A2 of Page 2 like
=transpose('Page 1'!B1:E1)
The use them in an index/match to get the data in the corresponding column of the main sheet and find its maximum (in C2)
=max(index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)))
Finally look up the maximum in the main sheet to find the corresponding name:
=index('Page 1'!A:A,match(C2,index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)),0))
If you think there could be a tie for first place with two or more people getting the same score, you could use a filter to get the different names:
So if the max score is in B8 this time (same formula)
=max(index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0)))
the different names could be spread across the corresponding row using transpose (in C8)
=ArrayFormula(TRANSPOSE(filter('Page 1'!A:A,index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0))=B8)))
I have changed the test data slightly to show these different scenarios
Results

Summing XML Data in MSWord Mail Merge

I have a report card written in Word that uses an XML file for its input. In the XML file, if a student remains in the same section all three trimesters there will be one node for that class; if they change sections at the trimester they'll have one node for each section. The nodes look something like this (greatly simplified):
<ReportCardSectionFB Abs1="2" Abs2="11" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Jones, Jennifer" TermCode="Year" SectionID="ELMATH1-4" />
<ReportCardSectionFB Abs1="1.50" Abs2="6" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Smith, Tina" TermCode="Year" SectionID="ELMATH1-3" />
There is no indicator within the XML as to which trimester the node belongs to.
In the Word document, we're pulling the absence data with the following mail merge command:
{MERGEFIELD "ReportCardSectionFB[#PeriodStart='3']/ #Abs1" \# 0.# \* MERGEFORMAT }
That's not working in this situation: it only gets the absence data from the first node it comes across, i.e.: 2.0. Is there a way to get the sum of #Abs1 for all period 3 classes, i.e.: 3.5? If not, is there a way to only get the last #Abs1 for period 3, i.e.: 1.5?
I recommend you to use this 3rd party product, which can use xml as input and is capable of merging it with MS Word template. I is also much more powerful than the built-in Word's mail merge. You can see some examples here.
You could also try summing the absences in Synergy - there's a new checkbox under AttDef1, 2, etc. that adds up all the absences for the data range - Include all day data for the entire date range regardless of section enrollment or section timeframe. That way the absences should be the same for each section, if that works for your district.
You can also try the SET function in Word to nest the MERGEFIELDS as bookmarks and use the Word operator functions to then add the bookmarks.

iphone's phone number splitting algorithm?

iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example: http://the-lost-beauty.blogspot.com/2010/01/locale-sensitive-phone-number.html
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
...
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."

Resources