Sorting multiple columns from a CSV file in Python - sorting

I am a secondary school teacher attempting to find an appropriate way to teach KS4 pupils techniques that will allow them to write data to a CSV file and then read that same data from a file and display it back in Python in an organised structure.
Pupils and I have a clear understanding of how to write the data to the file however in terms of getting the data from the file into Python and then sorting it is becoming rather tricky and complex to explain.
I've created a program that allows the user to input their name followed by 3 separate numbers all of which get written to the CSV file in the format shown below...
James , 3 , 7 , 4
David , 5 , 5 , 9
Steven , 8 , 3 , 9
These results are saved in a file "7G1.csv"
So far I have a functioning program that sorts alphabetically, highest to lowest and by average highest to lowest. I have been able to establish through research and piecing together techniques from numerous sources the following program but can anybody suggest an easier more efficient approach that could be understood by 16 year olds.
import csv
G1 = open('7G1.csv')
csv_G1 = csv.reader(G1)
list7G1 = []
for column in csv_G1:
column[1] = int(column[1])
column[2] = int(column[2])
column[3] = int(column[3])
minimum = min(column[1:4])
column.append(minimum)
maximum = max(column[1:4])
column.append(maximum)
average = round(sum(column[1:4])/3)
column.append(average)
list7G1.append(column[0:7])
group_menu = 0
while group_menu != 4:
group_menu = int(input("Which group / class do you want to look at?\n1.7G1?\n2.7G2?\n3.7G3?\n4.Quit? "))
if group_menu == 1:
print ("You have chosen to focus on group 7G1.")
menu = int(input("\nDo you want to...\n1.Sort Alphabetically?\n2.Sort Highest to Lowest?\n3.Sort Average Highest to Lowest?\n4.Exit Group? "))
while menu != 4:
if menu == 1:
print("You have chosen to Sort Alphabetically...")
namesList = [[x[0], x[5]] for x in list7G1]
print("\nSorted Alphabetically with Highest Scores \n")
for names in sorted(namesList):
print (names)
elif menu == 2:
print("You have chosen to Sort Highest to Lowest...")
highestScore = [[x[5], x[0]] for x in list7G1]
print("\nScores Highest to Lowest \n")
for hightolow in sorted(highestScore, reverse = True):
print (hightolow)
elif menu == 3:
print("You have chosen to Sort Average Highest to Lowest")
averageScore = [[x[6], x[0]] for x in list7G1]
print("\nAverage Scores \n")
for average in sorted(averageScore, reverse = True):
print(average)
elif menu == 4:
print("You have chosen to exit this group")
else:
print("This is not a valid option")
menu = int(input("\nDo you want to...\n1.Sort Alphabetically?\n2.Sort Highest to Lowest?\n3.Sort Average Highest to Lowest?\n4.Exit Group? "))
Any suggestions on how to simplify this program would be hugely appreciated.

Related

word2vec recommendation system KeyError: "word '21883' not in vocabulary"

The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])
If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.

COUNTIF over a moving window

I have a column wherein datapoints have been assigned a "1" or "2". I would like to use a function similar to COUNTIF in Excel, but over a moving window, e.g. =COUNTIF(G2:G31, 2) to determine how many "2"s exist in that given window
You might be able to use tibbletime.
1) Since you are interested in state being 1 or 2, we can recode it into a logical (boolean). Assuming your data.frame is named df,
df$state <- df$state == 2
2) Logicals are cool, because we can simply sum them, and get the number of TRUE values:
# total number of rows with state == 2:
sum(df$state)
3) Make a rollify function, cf. the link:
library(tibbletime)
rolling_sum <- rollify(sum, window = 30)
df$countif = rolling_sum(df$state)
This approach does however not solve the leading 29 rows. For those you can in your case use:
df$countif[1:29] <- cumsum(df$state[1:29])

Asking how to sort in a csv being repeated

I have a quiz with results that are sorted into a .csv file, after the questions are answered it asks the user if the results should be sorted alphabetically or by highest to lowest. This question is then repeated no matter the answer entered, however if highest to lowest is entered a few times after alphabetically it works.
print ("Would you like to see the results alphabetically or by highest to lowest?")
alpha = input()
while alpha != "alphabetically":
alpha = str(input ("Would you like to see the results alphabetically or by highest to lowest? "))
break
while alpha != "highest to lowest":
alpha = str(input ("Would you like to see the results alphabetically or by highest to lowest? "))
break
def updatefile(file,sortby,Classnumber): #this shortens the code by about 3 lines per file update
if Class == Classnumber:
with open(file,'a') as f:
file_writer = csv.writer(f, delimiter = ',', lineterminator='\n')
file_writer.writerow((name,score))
sortcsv(file,sortby)
if alpha == "alphabetically":
updatefile('Class 1 Results.csv',0,"1") #saves space using shortened code, makes the code use alphabetical sorting
updatefile('Class 2 Results.csv',0,"2")
updatefile('Class 3 Results.csv',0,"3")
elif alpha == "highest to lowest":
updatefile('Class 1 Results.csv',1,"1") #makes the code use highest to lowest sorting
updatefile('Class 2 Results.csv',1,"2")
updatefile('Class 3 Results.csv',1,"3")
Ok, let's step through that input block and see what's going on.
print ("Would you like to see the results alphabetically or by highest to lowest?")
alpha = input()
Ok so far, but it's puzzling why you printed the prompt this time, and put it in the input() statement the rest.
while alpha != "alphabetically":
alpha = str(input ("Would you like to see the results alphabetically or by highest to lowest? "))
Now if the user didn't enter "alphabetically" at the first prompt, they will be prompted over and over until they do. That's probably not what you're going for. (Also, you don't need str() around input().)
while alpha != "highest to lowest":
highesttolowest = str(input ("Would you like to see the results alphabetically or by highest to lowest? "))
break
Now that your user finally entered "alphabetically", you prompt them again (since "alphabetically" != "highest to lowest"). Two major problems here, though. First, the unconditional break renders the while loop pointless, as it will always exit after one loop. Second, you're assigning the input to a new variable highesttolowest, but later on you're still testing against alpha, so the results of this prompt won't ever be checked.
The simpest way to do what you seem to want to be doing is to use membership testing to check for both conditions at the same time. Replace that whole block with:
alpha = ''
while alpha not in ("alphabetically", "highest to lowest"):
alpha = input("Would you like to see the results alphabetically or by highest to lowest? ")

In a CSV file, how can a Python coder remove all but an X number of duplicates across rows?

Here is an example CSV file for this problem:
Jack,6
Sam,10
Milo,9
Jacqueline,7
Sam,5
Sam,8
Sam,10
Let's take the context to be the names and scores of a quiz these people took. We can see that Sam has taken this quiz 4 times but I want to only have an X number of the same person's result (They also need to be the most recent entries). Let's assume we wanted no more than 3 of the same person's results.
I realised it probably wouldn't be possible to achieve having no more than 3 of each person's result without some extra information. Here is the updated CSV file:
Jack,6,1793
Sam,10,2079
Milo,9,2132
Jacqueline,7,2590
Sam,5,2881
Sam,8,3001
Sam,10,3013
The third column is essentially the number of seconds from the "Epoch", which is a reference point for time. With this, I thought I could simply sort the file in terms of lowest to highest for the epoch column and use set() to remove all but a certain number of duplicates for the name column while also removing the removed persons score as well.
In theory, this should leave me with the 3 most recent results per person but in practice, I have no idea how I could adapt the set() function to do this unless there is some alternative way. So my question is, what possible methods are there to achieve this?
You could use a defaultdict of a list, and each time you add an entry check the length of the list: if it's more than three items pop the first one off (or do the check after cycling through the file). This assumes the file is in time sequence.
from collections import defaultdict
# looping over a csv file gives one row at a time
# so we will emulate that
raw_data = [
('Jack', '6'),
('Sam', '10'),
('Milo', '9'),
('Jacqueline', '7'),
('Sam', '5'),
('Sam', '8'),
('Sam', '10'),
]
# this will hold our information, and works by providing an empty
# list for any missing key
student_data = defaultdict(list)
for row in raw_data: # note 1
# separate the row into its component items, and convert
# score from str to int
name, score = row
score = int(score)
# get the current list for the student, or a brand-new list
student = student_data[name]
student.append(score)
# after addeng the score to the end, remove the first scores
# until we have no more than three items in the list
if len(student) > 3:
student.pop(0)
# print the items for debugging
for item in student_data.items():
print(item)
which results in:
('Milo', [9])
('Jack', [6])
('Sam', [5, 8, 10])
('Jacqueline', [7])
Note 1: to use an actual csv file you want code like this:
raw_file = open('some_file.csv')
csv_file = csv.reader(raw_file)
for row in csv_file:
...
To handle the timestamps, and as an alternative, you could use itertools.groupby:
from itertools import groupby, islice
from operator import itemgetter
raw_data = [
('Jack','6','1793'),
('Sam','10','2079'),
('Milo','9','2132'),
('Jacqueline','7','2590'),
('Sam','5','2881'),
('Sam','8','3001'),
('Sam','10','3013'),
]
# Sort by name in natural order, then by timestamp from highest to lowest
sorted_data = sorted(raw_data, key=lambda x: x[0], -int(x[2]))
# Group by user
grouped = groupby(sorted_data, key=itemgetter(0))
# And keep only three most recent values for each user
most_recent = [(k, [v for _, v, _ in islice(grp, 3)]) for k, grp in grouped]

Formula for calculating Exotic wagers such as Trifecta and Superfecta

I am trying to create an application that will calculate the cost of exotic parimutuel wager costs. I have found several for certain types of bets but never one that solves all the scenarios for a single bet type. If I could find an algorithm that could calculate all the possible combinations I could use that formula to solve my other problems.
Additional information:
I need to calculate the permutations of groups of numbers. For instance;
Group 1 = 1,2,3
Group 2 = 2,3,4
Group 3 = 3,4,5
What are all the possible permutation for these 3 groups of numbers taking 1 number from each group per permutation. No repeats per permutation, meaning a number can not appear in more that 1 position. So 2,4,3 is valid but 2,4,4 is not valid.
Thanks for all the help.
Like most interesting problems, your question has several solutions. The algorithm that I wrote (below) is the simplest thing that came to mind.
I found it easiest to think of the problem like a tree-search: The first group, the root, has a child for each number it contains, where each child is the second group. The second group has a third-group child for each number it contains, the third group has a fourth-group child for each number it contains, etc. All you have to do is find all valid paths from the root to leaves.
However, for many groups with lots of numbers this approach will prove to be slow without any heuristics. One thing you could do is sort the list of groups by group-size, smallest group first. That would be a fail-fast approach that would, in general, discover that a permutation isn't valid sooner than later. Look-ahead, arc-consistency, and backtracking are other things you might want to think about. [Sorry, I can only include one link because it's my first post, but you can find these things on Wikipedia.]
## Algorithm written in Python ##
## CodePad.org has a Python interpreter
Group1 = [1,2,3] ## Within itself, each group must be composed of unique numbers
Group2 = [2,3,4]
Group3 = [3,4,5]
Groups = [Group1,Group2,Group3] ## Must contain at least one Group
Permutations = [] ## List of valid permutations
def getPermutations(group, permSoFar, nextGroupIndex):
for num in group:
nextPermSoFar = list(permSoFar) ## Make a copy of the permSoFar list
## Only proceed if num isn't a repeat in nextPermSoFar
if nextPermSoFar.count(num) == 0:
nextPermSoFar.append(num) ## Add num to this copy of nextPermSoFar
if nextGroupIndex != len(Groups): ## Call next group if there is one...
getPermutations(Groups[nextGroupIndex], nextPermSoFar, nextGroupIndex + 1)
else: ## ...or add the valid permutation to the list of permutations
Permutations.append(nextPermSoFar)
## Call getPermutations with:
## * the first group from the list of Groups
## * an empty list
## * the index of the second group
getPermutations(Groups[0], [], 1)
## print results of getPermutations
print 'There are', len(Permutations), 'valid permutations:'
print Permutations
This is the simplest general formula I know for trifectas.
A=the number of selections you have for first; B=number of selections for second; C=number of selections for third; AB=number of selections you have in both first and second; AC=no. for both first and third; BC=no. for both 2nd and 3rd; and ABC=the no. of selections for all of 1st,2nd, and third.
the formula is
(AxBxC)-(ABxC)-(ACxB)-(BCxA)+(2xABC)
So, for your example ::
Group 1 = 1,2,3
Group 2 = 2,3,4
Group 3 = 3,4,5
the solution is:: (3x3x3)-(2x3)-(1x3)-(2x3)+(2x1)=14. Hope that helps
There might be an easier method that I am not aware of. Now does anyone know a general formula for First4?
Revised after a few years:-
I re logged into my SE account after a while and noticed this question, and realised what I'd written didn't even answer you:-
Here is some python code
import itertools
def explode(value, unique):
legs = [ leg.split(',') for leg in value.split('/') ]
if unique:
return [ tuple(ea) for ea in itertools.product(*legs) if len(ea) == len(set(ea)) ]
else:
return [ tuple(ea) for ea in itertools.product(*legs) ]
calling explode works on the basis that each leg is separated by a /, and each position by a ,
for your trifecta calculation you can work it out by the following:-
result = explode('1,2,3/2,3,4/3,4,5', True)
stake = 2.0
cost = stake * len(result)
print cost
for a superfecta
result = explode('1,2,3/2,4,5/1,3,6,9/2,3,7,9', True)
stake = 2.0
cost = stake * len(result)
print cost
for a pick4 (Set Unique to False)
result = explode('1,2,3/2,4,5/3,9/2,3,4', False)
stake = 2.0
cost = stake * len(result)
print cost
Hope that helps
AS a punter I can tell you there is a much simpler way:
For a trifecta, you need 3 combinations. Say there are 8 runners, the total number of possible permutations is 8 (total runners)* 7 (remaining runners after the winner omitted)* 6 (remaining runners after the winner and 2nd omitted) = 336
For an exacta (with 8 runners) 8 * 7 = 56
Quinellas are an exception, as you only need to take each bet once as 1/2 pays as well as 2/1 so the answer is 8*7/2 = 28
Simple
The answer supplied by luskin is correct for trifectas. He posed another question I needed to solve regarding First4. I looked everywhere but could not find a formula. I did however find a simple way to determine the number of unique permutations, using nested loops to exclude repeated sequences.
Public Function fnFirst4PermCount(arFirst, arSecond, arThird, arFourth) As Integer
Dim intCountFirst As Integer
Dim intCountSecond As Integer
Dim intCountThird As Integer
Dim intCountFourth As Integer
Dim intBetCount As Integer
'Dim arFirst(3) As Integer
'Dim arSecond(3) As Integer
'Dim arThird(3) As Integer
'Dim arFourth(3) As Integer
'arFirst(0) = 1
'arFirst(1) = 2
'arFirst(2) = 3
'arFirst(3) = 4
'
'arSecond(0) = 1
'arSecond(1) = 2
'arSecond(2) = 3
'arSecond(3) = 4
'
'arThird(0) = 1
'arThird(1) = 2
'arThird(2) = 3
'arThird(3) = 4
'
'arFourth(0) = 1
'arFourth(1) = 2
'arFourth(2) = 3
'arFourth(3) = 4
intBetCount = 0
For intCountFirst = 0 To UBound(arFirst)
For intCountSecond = 0 To UBound(arSecond)
For intCountThird = 0 To UBound(arThird)
For intCountFourth = 0 To UBound(arFourth)
If (arFirst(intCountFirst) <> arSecond(intCountSecond)) And (arFirst(intCountFirst) <> arThird(intCountThird)) And (arFirst(intCountFirst) <> arFourth(intCountFourth)) Then
If (arSecond(intCountSecond) <> arThird(intCountThird)) And (arSecond(intCountSecond) <> arFourth(intCountFourth)) Then
If (arThird(intCountThird) <> arFourth(intCountFourth)) Then
' Debug.Print "First " & arFirst(intCountFirst), " Second " & arSecond(intCountSecond), "Third " & arThird(intCountThird), " Fourth " & arFourth(intCountFourth)
intBetCount = intBetCount + 1
End If
End If
End If
Next intCountFourth
Next intCountThird
Next intCountSecond
Next intCountFirst
fnFirst4PermCount = intBetCount
End Function
this function takes four string arrays for each position. I left in test code (commented out) so you can see how it works for 1/2/3/4 for each of the four positions

Resources