Household mail merge (code golf) - code-golf

I wrote some mail merge code the other day and although it works I'm a turned off by the code. I'd like to see what it would look like in other languages.
So for the input the routine takes a list of contacts
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Erica,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Marge,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Ted,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Raoul,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
It will then merge lines with the same address and surname into one record. Assume the rows are unsorted). The code should also be flexible enough that fields can be supplied in any order (so it will need to take field indexes as parameters). For a family of two it concatenates both first name fields. For a family of three or more the first name is set to "the" and the lastname is set to "surname family".
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
My C# implementation of this is:
var source = File.ReadAllLines(#"sample.csv").Select(l => l.Split(','));
var merged = HouseholdMerge(source, 0, 1, new[] {1, 2, 3, 4, 5});
public static IEnumerable<string[]> HouseholdMerge(IEnumerable<string[]> data, int fnIndex, int lnIndex, int[] groupIndexes)
{
Func<string[], string> groupby = fields => String.Join("", fields.Where((f, i) => groupIndexes.Contains(i)));
var groups = data.OrderBy(groupby).GroupBy(groupby);
foreach (var group in groups)
{
string[] result = group.First().ToArray();
if (group.Count() == 2)
{
result[fnIndex] += " and " + group.ElementAt(1)[fnIndex];
}
else if (group.Count() > 2)
{
result[fnIndex] = "The";
result[lnIndex] += " Family";
}
yield return result;
}
}
I don't like how I've had to do the groupby delegate. I'd like if C# had some way to convert a string expression to a delegate. e.g. Func groupby = f => "f[2] + f[3] + f[4] + f[5] + f[1];" I have a feeling something like this can probably be done in Lisp or Python. I look forward to seeing nicer implementation in other languages.
Edit: Where did the community wiki checkbox go? Some mod please fix that.

Ruby — 181 155
Name/surname indexes are in code:a and b. Input data is from ARGF.
a,b=0,1
[*$<].map{|i|i.strip.split ?,}.group_by{|i|i.rotate(a).drop 1}.map{|i,j|k,l,m=j
k[a]+=' and '+l[a]if l
(k[a]='The';k[b]+=' Family')if m
puts k*','}

Python - not golfed
I'm not sure what the order of the rows should be if the indices are not 0 and 1 for the input file
import csv
from collections import defaultdict
class HouseHold(list):
def __init__(self, fn_idx, ln_idx):
self.fn_idx = fn_idx
self.ln_idx = ln_idx
def append(self, item):
self.item = item
list.append(self, item[self.fn_idx])
def get_value(self):
fn_idx = self.fn_idx
ln_idx = self.ln_idx
item = self.item
addr = [j for i,j in enumerate(item) if i not in (fn_idx, ln_idx)]
if len(self) < 3:
fn, ln = " and ".join(self), item[ln_idx]
else:
fn, ln = "The", item[ln_idx]+" Family"
return [fn, ln] + addr
def source(fname):
with open(fname) as in_file:
for item in csv.reader(in_file):
yield item
def household_merge(src, fn_idx, ln_idx, groupby):
res = defaultdict(lambda:HouseHold(fn_idx, ln_idx))
for item in src:
key = tuple(item[x] for x in groupby)
res[key].append(item)
return res.values()
data = household_merge(source("sample.csv"), 0, 1, [1,2,3,4,5,6,7])
with open("result.csv", "w") as out_file:
csv.writer(out_file).writerows(item.get_value() for item in data)

Python - 178 chars
import sys
d={}
for x in sys.stdin:F,c,A=x.partition(',');d[A]=d.get(A,[])+[F]
print"".join([" and ".join(v)+c+A,"The"+c+A.replace(c,' Family,',1)][2<len(v)]for A,v in d.items())
Output
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004

Python 2.6.6 - 287 Characters
This assumes you can hard code a filename (named i). If you want to take input from command line, this goes up ~16 chars.
from itertools import*
for z,g in groupby(sorted([l.split(',')for l in open('i').readlines()],key=lambda x:x[1:]), lambda x:x[2:]):
l=list(g);r=len(l);k=','.join(z);o=l[0]
if r>2:print'The,'+o[1],"Family,"+k,
elif r>1:print o[0],"and",l[1][0]+","+o[1]+","+k,
else:print','.join(o),
Output
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
I'm sure this could be improved upon, but it is getting late.

Haskell - 341 321
(Changes as per comments).
Unfortunately Haskell has no standard split function which makes this rather long.
Input to stdin, output on stdout.
import List
import Data.Ord
main=interact$unlines.e.lines
s[]=[]
s(',':x)=s x
s l#(x:y)=let(h,i)=break(==k)l in h:(s i)
t[]=[]
t x=tail x
h=head
m=map
k=','
e l=m(t.(>>=(k:)))$(m c$groupBy g$sortBy(comparing t)$m s l)
c(x:[])=x
c(x:y:[])=(h x++" and "++h y):t x
c x="The":((h$t$h x)++" Family"):(t$t$h x)
g a b=t a==t b

Lua, 434 bytes
x,y=1,2 s,p,r,a=string.gsub,pairs,io.read,{}for j,b,c,d,e,f,g,h,i in r('*a'):gmatch('('..('([^,]*),'):rep(7)..'([^,]*))\n')
do k=s(s(s(j,b,''),c,''),'[,%s]','')for l,m in p(a)do if not m.f and (m[y]:match(c) and m[9]==k) then z=1
if m.d then m[x]="The"m[y]=m[y]..' family'm.f=1 else m[x]=m[x].." and "..b m.d=1 end end end if not z then
a[#a+1]={b,c,d,e,f,g,h,i,k} end z=nil end for k,v in p(a)do v[9]=nil print(table.concat(v,','))end

Related

Automated Scheduling

When running the Function with no names on the input lists, it gives everyone the approtriate time based on the varible listed. If we have the input varibale have a name it gives everyone time off instead of just that indivudual.
There are some lists associated with this as well and that part works fine.
this is the required resources:
from openpyxl import Workbook
from datetime import timedelta, datetime
import random
def add_agents_w1():
num = 4
c = ["B","C","D","E","F"]
tow1 = input("IF anyone taking time off enter 1st
person now: \n")
tow12 = input("If someone else is taking time off enter
2nd person now: \n")
for x in tl:
ws1[f"A{num}"] = x
ws1[f"I{num}"] = x
for f in c:
if x in tow1 or tow12:
ws1[f'{f}{num}'] = "OFF"
else:
ws1[f"{f}{num}"] = s2t
num += 1
wb.save(dest_filename)

Extract multiple protein sequences from a Protein Data Bank along with Secondary Structure

I want to extract protein sequences and their corresponding secondary structure from any Protein Data bank, say RCSB. I just need short sequences and their secondary structure. Something like,
ATRWGUVT Helix
It is fine even if the sequences are long, but I want a tag at the end that denotes its secondary structure. Is there any programming tool or anything available for this.
As I've shown above I want only this much minimal information. How can I achieve this?
from Bio.PDB import *
from distutils import spawn
Extract sequence:
def get_seq(pdbfile):
p = PDBParser(PERMISSIVE=0)
structure = p.get_structure('test', pdbfile)
ppb = PPBuilder()
seq = ''
for pp in ppb.build_peptides(structure):
seq += pp.get_sequence()
return seq
Extract secondary structure with DSSP as explained earlier:
def get_secondary_struc(pdbfile):
# get secondary structure info for whole pdb.
if not spawn.find_executable("dssp"):
sys.stderr.write('dssp executable needs to be in folder')
sys.exit(1)
p = PDBParser(PERMISSIVE=0)
ppb = PPBuilder()
structure = p.get_structure('test', pdbfile)
model = structure[0]
dssp = DSSP(model, pdbfile)
count = 0
sec = ''
for residue in model.get_residues():
count = count + 1
# print residue,count
a_key = list(dssp.keys())[count - 1]
sec += dssp[a_key][2]
print sec
return sec
This should print both sequence and secondary structure.
You can use DSSP.
The output of DSSP is explained extensively under 'explanation'. The very short summary of the output is:
H = α-helix
B = residue in isolated β-bridge
E = extended strand, participates in β ladder
G = 3-helix (310 helix)
I = 5 helix (π-helix)
T = hydrogen bonded turn
S = bend

What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog!
class MyLabeledSentences(object):
def __init__(self, dirname, dataDct={}, sentList=[]):
self.dirname = dirname
self.dataDct = {}
self.sentList = []
def ToArray(self):
for fname in os.listdir(self.dirname):
with open(os.path.join(self.dirname, fname)) as fin:
for item_no, sentence in enumerate(fin):
self.sentList.append(LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]))
return sentList
class MyTaggedDocument(object):
def __init__(self, dirname, dataDct={}, sentList=[]):
self.dirname = dirname
self.dataDct = {}
self.sentList = []
def ToArray(self):
for fname in os.listdir(self.dirname):
with open(os.path.join(self.dirname, fname)) as fin:
for item_no, sentence in enumerate(fin):
self.sentList.append(TaggedDocument([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]))
return sentList
sentences = MyLabeledSentences(some_dir_name)
model_l = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7)
sentences_l = sentences.ToArray()
model_l.build_vocab(sentences_l )
for epoch in range(15): #
random.shuffle(sentences_l )
model.train(sentences_l )
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model_l.alpha
sentences = MyTaggedDocument(some_dir_name)
model_t = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7)
sentences_t = sentences.ToArray()
model_l.build_vocab(sentences_t)
for epoch in range(15): #
random.shuffle(sentences_t)
model.train(sentences_t)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model_l.alpha
My question is model_l.docvecs['some_word'] is same as model_t.docvecs['some_word']?
Can you provide me weblink of good sources to get a grasp on how TaggedDocument or LabeledSentence works.
LabeledSentence is an older, deprecated name for the same simple object-type to encapsulate a text-example that is now called TaggedDocument. Any objects that have words and tags properties, each a list, will do. (words is always a list of strings; tags can be a mix of integers and strings, but in the common and most-efficient case, is just a list with a single id integer, starting at 0.)
model_l and model_t will serve the same purposes, having trained on the same data with the same parameters, using just different names for the objects. But the vectors they'll return for individual word-tokens (model['some_word']) or document-tags (model.docvecs['somefilename_NN']) will likely be different – there's randomness in Word2Vec/Doc2Vec initialization and training-sampling, and introduced by ordering-jitter from multithreaded training.

Positional Argument Undefined

I am working on a larger project to write a code so the user can play Connect 4 against the computer. Right now, the user can choose whether or not to go first and the board is drawn. While truing to make sure that the user can only enter legal moves, I have run into a problem where my function legal_moves() takes 1 positional argument, and 0 are given, but I do not understand what I need to do to male everything agree.
#connect 4
#using my own formating
import random
#define global variables
X = "X"
O = "O"
EMPTY = "_"
TIE = "TIE"
NUM_ROWS = 6
NUM_COLS = 8
def display_instruct():
"""Display game instructions."""
print(
"""
Welcome to the second greatest intellectual challenge of all time: Connect4.
This will be a showdown between your human brain and my silicon processor.
You will make your move known by entering a column number, 1 - 7. Your move
(if that column isn't already filled) will move to the lowest available position.
Prepare yourself, human. May the Schwartz be with you! \n
"""
)
def ask_yes_no(question):
"""Ask a yes or no question."""
response = None
while response not in ("y", "n"):
response = input(question).lower()
return response
def ask_number(question,low,high):
"""Ask for a number within range."""
#using range in Python sense-i.e., to ask for
#a number between 1 and 7, call ask_number with low=1, high=8
low=1
high=NUM_COLS
response = None
while response not in range (low,high):
response=int(input(question))
return response
def pieces():
"""Determine if player or computer goes first."""
go_first = ask_yes_no("Do you require the first move? (y/n): ")
if go_first == "y":
print("\nThen take the first move. You will need it.")
human = X
computer = O
else:
print("\nYour bravery will be your undoing... I will go first.")
computer = X
human = O
return computer, human
def new_board():
board = []
for x in range (NUM_COLS):
board.append([" "]*NUM_ROWS)
return board
def display_board(board):
"""Display game board on screen."""
for r in range(NUM_ROWS):
print_row(board,r)
print("\n")
def print_row(board, num):
"""Print specified row from current board"""
this_row = board[num]
print("\n\t| ", this_row[num], "|", this_row[num], "|", this_row[num], "|", this_row[num], "|", this_row[num], "|", this_row[num], "|", this_row[num],"|")
print("\t", "|---|---|---|---|---|---|---|")
# everything works up to here!
def legal_moves(board):
"""Create list of column numbers where a player can drop piece"""
legals = []
if move < NUM_COLS: # make sure this is a legal column
for r in range(NUM_ROWS):
legals.append(board[move])
return legals #returns a list of legal columns
#in human_move function, move input must be in legal_moves list
print (legals)
def human_move(board,human):
"""Get human move"""
legals = legal_moves(board)
print("LEGALS:", legals)
move = None
while move not in legals:
move = ask_number("Which column will you move to? (1-7):", 1, NUM_COLS)
if move not in legals:
print("\nThat column is already full, nerdling. Choose another.\n")
print("Human moving to column", move)
return move #return the column number chosen by user
def get_move_row(turn,move):
move=ask_number("Which column would you like to drop a piece?")
for m in range (NUM_COLS):
place_piece(turn,move)
display_board()
def place_piece(turn,move):
if this_row[m[move]]==" ":
this_row.append[m[move]]=turn
display_instruct()
computer,human=pieces()
board=new_board()
display_board(board)
move= int(input("Move?"))
legal_moves()
print ("Human:", human, "\nComputer:", computer)
Right down the bottom of the script, you call:
move= int(input("Move?"))
legal_moves()
# ^ no arguments
This does not supply the necessary board argument, hence the error message.

How to extract string from large file only if specific string appears previous using Ruby?

I am trying to extract information from a large file and cannot figure out how to extract strings from file lines only when a previous line in the same record within the file has been matched by regex. An example of one record in the file is as follows:
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent: coordinate IM with MENTAL COMPETENCY (IM)
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization, by a patient or research subject, etc,...
This file contains over 20,000 records like this example. I want to identify a small percent of those records using the "MH" field. In this example, I want to find "Informed Consent", and then use regex to extract the information in the FX, AN, and MS fields only within that record. So far, I have opened the file, accessed the hash that the MH terms are stored in, and been able to extract those terms from the records in the file. I also have a functioning regex that identifies the content in the "FX" field.
File.open('mesh_descriptor.bin').each do |file_line|
file_line = file_line.chomp
# read each key of candidate_descriptor_keys
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
mesh_header = $1
puts "MH from Mesh Descriptor file is: #{mesh_header}"
if file_line =~ /^FX\s=\s(.*)$/
see_also = $1
puts " See_Also from Descriptor file is: #{see_also}"
end
end
end
end
The hash contains the following MH (keys):
candidate_descriptor_keys = ["Body Weight", "Obesity", "Thinness", "Fetal Weight", "Overweight"]
I had success extracting "FX" when I put the statement outside of the "if" statement to extract "MH", but all of the "FX" from the whole file were retrieved - not what I need. I thought putting the "if" statement for "FX" within the previous "if" statement would restrict the results to only those found when the first statement is true, but I am getting no results (also no errors) with this strategy. What I would like as a result is:
> Informed Consent
> Disclosure
> Mental Competency
> Therapeutic Misconception
> Treatment Refusal
as well as the strings within the "AN" and "MS" fields for only those records matching "MH". Any suggestions would be helpful!
I think this may be what you are looking for, but if not, let me know and I will change it. Look especially at the very end to see if that is the sort of output (for input having two records, both with a "MH" field) you want. I will also add a "explanation" section at the end once I have understood your question correctly.
I have assumed that each record begins
*NEW_RECORD
and you wish to identify all lines beginning "MH" whose field is one of the elements of:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
and for each match, you would like to print the contents of the lines for the same record that begin with "FX", "AN" and "MS".
Code
NEW_RECORD_MARKER = "*NEW RECORD"
def getem(fname, candidate_descriptor_keys)
line = 0
found_mh = false
File.open(fname).each do |file_line|
file_line = file_line.strip
case
when file_line == NEW_RECORD_MARKER
puts # space between records
found_mh = false
when found_mh == false
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
found_mh = true
puts "MH from line #{line} of file is: #{cand_term}"
break
end
end
when found_mh
["FX", "AN", "MS"].each do |des|
if file_line =~ /^#{des}\s=\s(.*)$/
see_also = $1
puts " Line #{line} of file is: #{des}: #{see_also}"
end
end
end
line += 1
end
end
Example
Let's begin be creating a file, starging with a "here document that contains two records":
records =<<_
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization
*NEW RECORD
MH = Obesity
AQ = ES HI LJ PX SN ST
ENTRY = Obesity
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = 1st FX
FX = 2nd FX
AN = Only AN
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Only MS
_
If you puts records you will see it is just a string. (You'll see that I shortened two of them.) Now write it to a file:
File.write('mesh_descriptor', records)
If you wish to confirm the file contents, you could do this:
puts File.read('mesh_descriptor')
We also need to define define the array candidate_descriptor_keys:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
We can now execute the method getem:
getem('mesh_descriptor', candidate_descriptor_keys)
MH from line 2 of file is: Informed Consent
Line 7 of file is: FX: Disclosure
Line 8 of file is: FX: Mental Competency
Line 9 of file is: FX: Therapeutic Misconception
Line 10 of file is: FX: Treatment Refusal
Line 13 of file is: AN: competency to consent
Line 16 of file is: MS: Voluntary authorization
MH from line 18 of file is: Obesity
Line 23 of file is: FX: 1st FX
Line 24 of file is: FX: 2nd FX
Line 25 of file is: AN: Only AN
Line 28 of file is: MS: Only MS

Resources