In a CSV file, how can a Python coder remove all but an X number of duplicates across rows?

In a CSV file, how can a Python coder remove all but an X number of duplicates across rows? - sorting

Here is an example CSV file for this problem:
Jack,6
Sam,10
Milo,9
Jacqueline,7
Sam,5
Sam,8
Sam,10
Let's take the context to be the names and scores of a quiz these people took. We can see that Sam has taken this quiz 4 times but I want to only have an X number of the same person's result (They also need to be the most recent entries). Let's assume we wanted no more than 3 of the same person's results.
I realised it probably wouldn't be possible to achieve having no more than 3 of each person's result without some extra information. Here is the updated CSV file:
Jack,6,1793
Sam,10,2079
Milo,9,2132
Jacqueline,7,2590
Sam,5,2881
Sam,8,3001
Sam,10,3013
The third column is essentially the number of seconds from the "Epoch", which is a reference point for time. With this, I thought I could simply sort the file in terms of lowest to highest for the epoch column and use set() to remove all but a certain number of duplicates for the name column while also removing the removed persons score as well.
In theory, this should leave me with the 3 most recent results per person but in practice, I have no idea how I could adapt the set() function to do this unless there is some alternative way. So my question is, what possible methods are there to achieve this?

You could use a defaultdict of a list, and each time you add an entry check the length of the list: if it's more than three items pop the first one off (or do the check after cycling through the file). This assumes the file is in time sequence.
from collections import defaultdict
# looping over a csv file gives one row at a time
# so we will emulate that
raw_data = [
('Jack', '6'),
('Sam', '10'),
('Milo', '9'),
('Jacqueline', '7'),
('Sam', '5'),
('Sam', '8'),
('Sam', '10'),
]
# this will hold our information, and works by providing an empty
# list for any missing key
student_data = defaultdict(list)
for row in raw_data: # note 1
# separate the row into its component items, and convert
# score from str to int
name, score = row
score = int(score)
# get the current list for the student, or a brand-new list
student = student_data[name]
student.append(score)
# after addeng the score to the end, remove the first scores
# until we have no more than three items in the list
if len(student) > 3:
student.pop(0)
# print the items for debugging
for item in student_data.items():
print(item)
which results in:
('Milo', [9])
('Jack', [6])
('Sam', [5, 8, 10])
('Jacqueline', [7])
Note 1: to use an actual csv file you want code like this:
raw_file = open('some_file.csv')
csv_file = csv.reader(raw_file)
for row in csv_file:
...

To handle the timestamps, and as an alternative, you could use itertools.groupby:
from itertools import groupby, islice
from operator import itemgetter
raw_data = [
('Jack','6','1793'),
('Sam','10','2079'),
('Milo','9','2132'),
('Jacqueline','7','2590'),
('Sam','5','2881'),
('Sam','8','3001'),
('Sam','10','3013'),
]
# Sort by name in natural order, then by timestamp from highest to lowest
sorted_data = sorted(raw_data, key=lambda x: x[0], -int(x[2]))
# Group by user
grouped = groupby(sorted_data, key=itemgetter(0))
# And keep only three most recent values for each user
most_recent = [(k, [v for _, v, _ in islice(grp, 3)]) for k, grp in grouped]

Related

Hashmap (O(1)) supporting joker/match-all keys

The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?

Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.
From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.
For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.
For each searchable column, you'll build a specific index, dedicated to this column.
The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
Same as MAINHASH, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in MAINHASH. If found, return it, search done.
Upgrade (see below): search also in CACHE before going to step #2.
For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).
With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.
For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.
Once all columns are done, you'll have to analyze TMPHASH.
Analyzing TMPHASH
First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:
Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.

I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value += 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i+1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i+1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id += 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id += 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))

As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.
You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.
Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:
map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");
Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:
map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG
If the answer in the third step was null, you would continue as follows:
map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG
And there still remain two options:
map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG
Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.

CSV - Processing each group of contiguous rows having the same values for certain fields

I have a large CSV file with the following headers: "sku", "year", "color", "price", "discount", "inventory", "published_on", "rate", "demographic" and "tags".
I would like to perform various calculations for each contiguous group of rows having the same values for "sku", "year" and "color". I will refer to this partition of the file as each group of rows. For example, if the file looked like this:
sku,year,color,price,discount,...
100,2019,white,24.61,2.3,...
100,2019,white,29.11,2.1,...
100,2019,white,33.48,2.9,...
100,2019,black,58.12,1.3,...
200,2018,brown,44.15,3.1,...
200,2018,brown,53.07,3.2,...
100,2019,white,16.91,2.9,...
there would be four groups of rows: rows 1, 2 and 3 (after the header row), row 4 alone, rows 5 and 6 and row 7 alone. Notice that the last row is not included in the first group even though it has the same values for the first three fields. That it is because it is not contiguous with the first group.
An example of a calculation that might be performed for each group of rows would be to determine the total inventory for the group. In general, the measure to be computed is some function of the values contained in all the rows of the group of rows. The specific calculations for each group of rows is not central to my question. Let us simply assume that each group of rows is passed to some method which returns the measure of interest.
I wish to return an array containing one element per group of rows, each element (perhaps an array or hash) containing the common values of "sku", "year" and "color" and the calculated measure of interest.
Because the file is large it must be read line-by-line, rather than gulping it into an array.
What's the best way to do this?

Enumerator#chunk is perfect for this.
CSV.foreach('path/to/csv', headers: true).
chunk { |row| row.values_at('sku', 'year', 'color') }.
each do |(sku, year, color), rows|
# process `rows` with the current `[sku, year, color]` combination
end
Obviously, that last each can be replaced by map or flat_map, as needed.

Here is an example of how that might be done. I will read the CSV file line-by-line to minimize memory requirements.
Code
require 'csv'
def doit(fname, common_headers)
CSV.foreach(fname, headers: true).
slice_when { |csv1,csv2| csv1.values_at(*common_headers) !=
csv2.values_at(*common_headers) }.
each_with_object({}) { |arr,h|
h[arr.first.to_h.slice(*common_headers)] = calc(arr) }
end
def calc(arr)
arr.sum { |csv| csv['price'].to_f }.fdiv(arr.size).round(2)
end
The method calc needs to be customized for the application. Here I am computing the average price for each contiguous group of records having the same values for "sku", "year" and "color".
See CSV::foreach, Enumerable#slice_when, CSV::Row#values_at, CSV::Row#to_h and Hash#slice.
Example
Now let's construct a CSV file.
str =<<~END
sku,year,color,price
1,2015,red,22.41
1,2015,red,33.61
1,2015,red,12.15
1,2015,blue,36.18
2,2015,yellow,9.08
2,2015,yellow,13.71
END
fname = 't.csv'
File.write(fname, str)
#=> 129
The common headers must be given:
common_headers = ['sku', 'year', 'color']
The average prices are obtained by executing doit:
doit(fname, common_headers)
#=> {{"sku"=>"1", "year"=>"2015", "color"=>"red"}=>22.72,
# {"sku"=>"1", "year"=>"2015", "color"=>"blue"}=>36.18,
# {"sku"=>"2", "year"=>"2015", "color"=>"yellow"}=>11.4}
Note:
((22.41 + 33.61 + 12.15)/3).round(2)
#=> 22.72
((36.18)/1).round(2)
#=> 36.18
((9.08 + 13.71)/2).round(2)
#=> 11.4
The methods foreach and slice_when both return enumerators. Therefore, for each contiguous block of lines from the file having the same values for the keys in common_headers, memory is acquired, calculations are performed for those lines and then that memory is released (by Ruby). In addition, memory is needed to hold the hash that is returned at the end.

Input to different attributes values from a random.sample list

so this is what I'm trying to do, and I'm not sure how cause I'm new to python. I've searched for a few options and I'm not sure why this doesn't work.
So I have 6 different nodes, in maya, called aiSwitch. I need to generate random different numbers from 0 to 6 and input that value in the aiSiwtch*.index.
In short the result should be
aiSwitch1.index = (random number from 0 to 5)
aiSwitch2.index = (another random number from 0 to 5 different than the one before)
And so on unil aiSwitch6.index
I tried the following:
import maya.cmds as mc
import random
allswtich = mc.ls('aiSwitch*')
for i in allswitch:
print i
S = range(0,6)
print S
shuffle = random.sample(S, len(S))
print shuffle
for w in shuffle:
print w
mc.setAttr(i + '.index', w)
This is the result I get from the prints:
aiSwitch1 <-- from print i
[0,1,2,3,4,5] <--- from print S
[2,3,5,4,0,1] <--- from print Shuffle (random.sample results)
2
3
5
4
0
1 <--- from print w, every separated item in the random.sample list.
Now, this happens for every aiSwitch, cause it's in a loop of course. And the random numbers are always a different list cause it happens every time the loop runs.
So where is the problem then?
aiSwitch1.index = 1
And all the other aiSwitch*.index always take only the last item in the list but the time I get to do the setAttr. It seems to be that w is retaining the last value of the for loop. I don't quite understand how to
Get a random value from 0 to 5
Input that value in aiSwitch1.index
Get another random value from 0 to 6 different to the one before
Input that value in aiSwitch2.index
Repeat until aiSwitch5.index.
I did get it to work with the following form:
allSwitch = mc.ls('aiSwitch')
for i in allSwitch:
mc.setAttr(i + '.index', random.uniform(0,5))
This gave a random number from 0 to 5 to all aiSwitch*.index, but some of them repeat. I think this works cause the value is being generated every time the loop runs, hence setting the attribute with a random number. But the numbers repeat and I was trying to avoid that. I also tried a shuffle but failed to get any values from it.
My main mistake seems to be that I'm generating a list and sampling it, but I'm failing to assign every different item from that list to different aiSwitch*.index nodes. And I'm running out of ideas for this.
Any clues would be greatly appreciated.
Thanks.
Jonathan.

Here is a somewhat Pythonic way: shuffle the list of indices, then iterate over it using zip (which is useful for iterating over structures in parallel, which is what you need to do here):
import random
index = list(range(6))
random.shuffle(index)
allSwitch = mc.ls('aiSwitch*')
for i,j in zip(allSwitch,index):
mc.setAttr(i + '.index', j)

How do I modify multiple columns in a CSV, and then copy them to a new CSV using Ruby?

Out of the 10 columns there in the original CSV, I have 4 columns which I need to make integers (to process with MATLAB later; the other 6 columns already contain integer values). These 4 columns are: (1) platform (2) push (3) timestamp, and (4) udid.
An example input is: #other_column, Android, Y, 10-05-2015 3:59:59 PM, #other_column, d0155049772de9, #other_columns
The corresponding output should be: #other_column, 2, 1, 1431273612198, #other_column, 17923, #other_columns
So, I wrote the following code:
require 'csv'
CSV.open('C:\Users\hp1\Desktop\Datasets\NewColumns2.csv', "wb") do |csv|
CSV.foreach('C:\Users\hp1\Desktop\Datasets\NewColumns.csv', :headers=>true).map do |row|
if row['platform']=='Android'
row['platform']=2
elsif row['platform']=='iPhone'
row['platform']=1
end
if row['push']=='Y'
row['push']=1
elsif row['push']=='N'
row['push']=0
end
row['timestamp'].to_time.to_i
row['udid'].to_i
csv<<row
end
end
Now, the first 3 columns, weekday, platform and push, are having a small number of unique values for the whole file (i.e., 7, 2 and 2 respectively), which is why I used the above approach. However, the other 2 columns, timestamp and udid, are different - they have several values, a few of them common to some rows in the CSV, but there are thousands of unique values. And hence I thought of converting them to integers in the manner I showed above.
Anyhow, none of the columns are getting converted at all. Plus, there is another problem with the datetime column as it is in a format which Ruby apparently does not recognize as a legitimate time format (a sample looks like this: 10-05-2015 3:59:59 PM). So, what should I do? Thanks.

Edit - Redo, I missed part of the problem with the udids
Problems
You are using map when you don't need to, CSV#foreach already iterates through all of the rows - remove this
Date - include the ruby standard Time library
Unique ids - it sounds like you want to convert the udid into a shorter unique id since there may be more than one entry per mobile device - use an array to make a collection without repeats and use the index of the device udid in the array as your new shorter unique id
I used this as my input csv:
othercol1,platform,push,timestamp,othercol2,udid,othercol3,othercol4,othercol5,othercol6
11,Android, N, 10-05-2015 3:59:59 PM,22, d0155049772de9,33,44,55,66
11,iPhone, N, 10-05-2015 5:59:59 PM,22, d0155044772de9,33,44,55,66
11,iPhone, Y, 10-06-2015 3:59:59 PM,22, d0155049772de9,33,44,55,66
11,Android, Y, 11-05-2015 3:59:59 PM,22, d0155249772de9,33,44,55,66
Here is my output csv:
11,2,0,1431298799,22,1,33,44,55,66
11,1,0,1431305999,22,2,33,44,55,66
11,1,1,1433977199,22,1,33,44,55,66
11,2,1,1431385199,22,3,33,44,55,66
Here is the script I used:
require 'time' # use ruby standard time library to parse for you
require 'csv'
udids = [] # turn the udid in to a shorter unique id
CSV.open('new.csv', "wb") do |csv|
CSV.foreach('old.csv', headers: true) do |row|
if row['platform']=='Android'
row['platform']=2
elsif row['platform']=='iPhone'
row['platform']=1
end
if row['push'].strip =='Y'
row['push']=1
elsif row['push'].strip =='N'
row['push']=0
end
row['timestamp'] = Time.parse(row['timestamp']).to_i
# turn the udid in to a shorter unique id
unless udids.include?(row['udid'])
udids << row['udid']
end
row['udid'] = udids.index(row['udid']) + 1
csv << row
end
end

This is a wrong usage of map, this is not the function you need. Map is if you want to apply a function to all values in the array, and return the array. What you are doing is iterate, doing some changes, then pushing the modified row into a new array - you can just iterate, no need for the map function to be there:
CSV.foreach('C:\Users\hp1\Desktop\Datasets\NewColumns.csv', :headers=>true) instead of CSV.foreach('C:\Users\hp1\Desktop\Datasets\NewColumns.csv', :headers=>true).map
About the date, you can use strptime to transform string into date: DateTime.strptime("10-05-2015 3:59:59 PM", "%d-%m-%Y %l:%M:%S %p"). Here the docs: http://ruby-doc.org/stdlib-1.9.3/libdoc/date/rdoc/DateTime.html

add :converters => :all to your options, so that the dates and numbers are automatically converted. Then, instead of
row['timestamp'].to_time.to_i
which does the conversion but doesn't put it anywhere (it is not in-place), do this:
row['timestamp'] = row['timestamp'].to_time.to_i
note that this only works with converters, otherwise row['timestamp'] is a string and there is no .to_time method.

What is the pythonic way to detect the last element in a 'for' loop?

How can I treat the last element of the input specially, when iterating with a for loop? In particular, if there is code that should only occur "between" elements (and not "after" the last one), how can I structure the code?
Currently, I write code like so:
for i, data in enumerate(data_list):
code_that_is_done_for_every_element
if i != len(data_list) - 1:
code_that_is_done_between_elements
How can I simplify or improve this?

Most of the times it is easier (and cheaper) to make the first iteration the special case instead of the last one:
first = True
for data in data_list:
if first:
first = False
else:
between_items()
item()
This will work for any iterable, even for those that have no len():
file = open('/path/to/file')
for line in file:
process_line(line)
# No way of telling if this is the last line!
Apart from that, I don't think there is a generally superior solution as it depends on what you are trying to do. For example, if you are building a string from a list, it's naturally better to use str.join() than using a for loop “with special case”.
Using the same principle but more compact:
for i, line in enumerate(data_list):
if i > 0:
between_items()
item()
Looks familiar, doesn't it? :)
For #ofko, and others who really need to find out if the current value of an iterable without len() is the last one, you will need to look ahead:
def lookahead(iterable):
"""Pass through all values from the given iterable, augmented by the
information if there are more values to come after the current one
(True), or if it is the last value (False).
"""
# Get an iterator and pull the first value.
it = iter(iterable)
last = next(it)
# Run the iterator to exhaustion (starting from the second value).
for val in it:
# Report the *previous* value (more to come).
yield last, True
last = val
# Report the last value.
yield last, False
Then you can use it like this:
>>> for i, has_more in lookahead(range(3)):
... print(i, has_more)
0 True
1 True
2 False

Although that question is pretty old, I came here via google and I found a quite simple way: List slicing. Let's say you want to put an '&' between all list entries.
s = ""
l = [1, 2, 3]
for i in l[:-1]:
s = s + str(i) + ' & '
s = s + str(l[-1])
This returns '1 & 2 & 3'.

if the items are unique:
for x in list:
#code
if x == list[-1]:
#code
other options:
pos = -1
for x in list:
pos += 1
#code
if pos == len(list) - 1:
#code
for x in list:
#code
#code - e.g. print x
if len(list) > 0:
for x in list[:-1]:
#process everything except the last element
for x in list[-1:]:
#process only last element

The 'code between' is an example of the Head-Tail pattern.
You have an item, which is followed by a sequence of ( between, item ) pairs. You can also view this as a sequence of (item, between) pairs followed by an item. It's generally simpler to take the first element as special and all the others as the "standard" case.
Further, to avoid repeating code, you have to provide a function or other object to contain the code you don't want to repeat. Embedding an if statement in a loop which is always false except one time is kind of silly.
def item_processing( item ):
# *the common processing*
head_tail_iter = iter( someSequence )
head = next(head_tail_iter)
item_processing( head )
for item in head_tail_iter:
# *the between processing*
item_processing( item )
This is more reliable because it's slightly easier to prove, It doesn't create an extra data structure (i.e., a copy of a list) and doesn't require a lot of wasted execution of an if condition which is always false except once.

If you're simply looking to modify the last element in data_list then you can simply use the notation:
L[-1]
However, it looks like you're doing more than that. There is nothing really wrong with your way. I even took a quick glance at some Django code for their template tags and they do basically what you're doing.

you can determine the last element with this code :
for i,element in enumerate(list):
if (i==len(list)-1):
print("last element is" + element)

This is similar to Ants Aasma's approach but without using the itertools module. It's also a lagging iterator which looks-ahead a single element in the iterator stream:
def last_iter(it):
# Ensure it's an iterator and get the first field
it = iter(it)
prev = next(it)
for item in it:
# Lag by one item so I know I'm not at the end
yield 0, prev
prev = item
# Last item
yield 1, prev
def test(data):
result = list(last_iter(data))
if not result:
return
if len(result) > 1:
assert set(x[0] for x in result[:-1]) == set([0]), result
assert result[-1][0] == 1
test([])
test([1])
test([1, 2])
test(range(5))
test(xrange(4))
for is_last, item in last_iter("Hi!"):
print is_last, item

We can achieve that using for-else
cities = [
'Jakarta',
'Surabaya',
'Semarang'
]
for city in cities[:-1]:
print(city)
else:
print(' '.join(cities[-1].upper()))
output:
Jakarta
Surabaya
S E M A R A N G
The idea is we only using for-else loops until n-1 index, then after the for is exhausted, we access directly the last index using [-1].

You can use a sliding window over the input data to get a peek at the next value and use a sentinel to detect the last value. This works on any iterable, so you don't need to know the length beforehand. The pairwise implementation is from itertools recipes.
from itertools import tee, izip, chain
def pairwise(seq):
a,b = tee(seq)
next(b, None)
return izip(a,b)
def annotated_last(seq):
"""Returns an iterable of pairs of input item and a boolean that show if
the current item is the last item in the sequence."""
MISSING = object()
for current_item, next_item in pairwise(chain(seq, [MISSING])):
yield current_item, next_item is MISSING:
for item, is_last_item in annotated_last(data_list):
if is_last_item:
# current item is the last item

Is there no possibility to iterate over all-but the last element, and treat the last one outside of the loop? After all, a loop is created to do something similar to all elements you loop over; if one element needs something special, it shouldn't be in the loop.
(see also this question: does-the-last-element-in-a-loop-deserve-a-separate-treatment)
EDIT: since the question is more about the "in between", either the first element is the special one in that it has no predecessor, or the last element is special in that it has no successor.

I like the approach of #ethan-t, but while True is dangerous from my point of view.
data_list = [1, 2, 3, 2, 1] # sample data
L = list(data_list) # destroy L instead of data_list
while L:
e = L.pop(0)
if L:
print(f'process element {e}')
else:
print(f'process last element {e}')
del L
Here, data_list is so that last element is equal by value to the first one of the list. L can be exchanged with data_list but in this case it results empty after the loop. while True is also possible to use if you check that list is not empty before the processing or the check is not needed (ouch!).
data_list = [1, 2, 3, 2, 1]
if data_list:
while True:
e = data_list.pop(0)
if data_list:
print(f'process element {e}')
else:
print(f'process last element {e}')
break
else:
print('list is empty')
The good part is that it is fast. The bad - it is destructible (data_list becomes empty).
Most intuitive solution:
data_list = [1, 2, 3, 2, 1] # sample data
for i, e in enumerate(data_list):
if i != len(data_list) - 1:
print(f'process element {e}')
else:
print(f'process last element {e}')
Oh yes, you have already proposed it!

There is nothing wrong with your way, unless you will have 100 000 loops and wants save 100 000 "if" statements. In that case, you can go that way :
iterable = [1,2,3] # Your date
iterator = iter(iterable) # get the data iterator
try : # wrap all in a try / except
while 1 :
item = iterator.next()
print item # put the "for loop" code here
except StopIteration, e : # make the process on the last element here
print item
Outputs :
1
2
3
3
But really, in your case I feel like it's overkill.
In any case, you will probably be luckier with slicing :
for item in iterable[:-1] :
print item
print "last :", iterable[-1]
#outputs
1
2
last : 3
or just :
for item in iterable :
print item
print iterable[-1]
#outputs
1
2
3
last : 3
Eventually, a KISS way to do you stuff, and that would work with any iterable, including the ones without __len__ :
item = ''
for item in iterable :
print item
print item
Ouputs:
1
2
3
3
If feel like I would do it that way, seems simple to me.

Use slicing and is to check for the last element:
for data in data_list:
<code_that_is_done_for_every_element>
if not data is data_list[-1]:
<code_that_is_done_between_elements>
Caveat emptor: This only works if all elements in the list are actually different (have different locations in memory). Under the hood, Python may detect equal elements and reuse the same objects for them. For instance, for strings of the same value and common integers.

Google brought me to this old question and I think I could add a different approach to this problem.
Most of the answers here would deal with a proper treatment of a for loop control as it was asked, but if the data_list is destructible, I would suggest that you pop the items from the list until you end up with an empty list:
while True:
element = element_list.pop(0)
do_this_for_all_elements()
if not element:
do_this_only_for_last_element()
break
do_this_for_all_elements_but_last()
you could even use while len(element_list) if you don't need to do anything with the last element. I find this solution more elegant then dealing with next().

For me the most simple and pythonic way to handle a special case at the end of a list is:
for data in data_list[:-1]:
handle_element(data)
handle_special_element(data_list[-1])
Of course this can also be used to treat the first element in a special way .

Better late than never. Your original code used enumerate(), but you only used the i index to check if it's the last item in a list. Here's an simpler alternative (if you don't need enumerate()) using negative indexing:
for data in data_list:
code_that_is_done_for_every_element
if data != data_list[-1]:
code_that_is_done_between_elements
if data != data_list[-1] checks if the current item in the iteration is NOT the last item in the list.
Hope this helps, even nearly 11 years later.

if you are going through the list, for me this worked too:
for j in range(0, len(Array)):
if len(Array) - j > 1:
notLast()

Instead of counting up, you can also count down:
nrToProcess = len(list)
for s in list:
s.doStuff()
nrToProcess -= 1
if nrToProcess==0: # this is the last one
s.doSpecialStuff()

I will provide with a more elegant and robust way as follows, using unpacking:
def mark_last(iterable):
try:
*init, last = iterable
except ValueError: # if iterable is empty
return
for e in init:
yield e, True
yield last, False
Test:
for a, b in mark_last([1, 2, 3]):
print(a, b)
The result is:
1 True
2 True
3 False

If you are looping the List,
Using enumerate function is one of the best try.
for index, element in enumerate(ListObj):
# print(index, ListObj[index], len(ListObj) )
if (index != len(ListObj)-1 ):
# Do things to the element which is not the last one
else:
# Do things to the element which is the last one

Delay the special handling of the last item until after the loop.
>>> for i in (1, 2, 3):
... pass
...
>>> i
3

There can be multiple ways. slicing will be fastest. Adding one more which uses .index() method:
>>> l1 = [1,5,2,3,5,1,7,43]
>>> [i for i in l1 if l1.index(i)+1==len(l1)]
[43]

If you are happy to be destructive with the list, then there's the following.
We are going to reverse the list in order to speed up the process from O(n^2) to O(n), because pop(0) moves the list each iteration - cf. Nicholas Pipitone's comment below
data_list.reverse()
while data_list:
value = data_list.pop()
code_that_is_done_for_every_element(value)
if data_list:
code_that_is_done_between_elements(value)
else:
code_that_is_done_for_last_element(value)
This works well with empty lists, and lists of non-unique items.
Since it's often the case that lists are transitory, this works pretty well ... at the cost of destructing the list.

Assuming input as an iterator, here's a way using tee and izip from itertools:
from itertools import tee, izip
items, between = tee(input_iterator, 2) # Input must be an iterator.
first = items.next()
do_to_every_item(first) # All "do to every" operations done to first item go here.
for i, b in izip(items, between):
do_between_items(b) # All "between" operations go here.
do_to_every_item(i) # All "do to every" operations go here.
Demo:
>>> def do_every(x): print "E", x
...
>>> def do_between(x): print "B", x
...
>>> test_input = iter(range(5))
>>>
>>> from itertools import tee, izip
>>>
>>> items, between = tee(test_input, 2)
>>> first = items.next()
>>> do_every(first)
E 0
>>> for i,b in izip(items, between):
... do_between(b)
... do_every(i)
...
B 0
E 1
B 1
E 2
B 2
E 3
B 3
E 4
>>>

The most simple solution coming to my mind is:
for item in data_list:
try:
print(new)
except NameError: pass
new = item
print('The last item: ' + str(new))
So we always look ahead one item by delaying the the processing one iteration. To skip doing something during the first iteration I simply catch the error.
Of course you need to think a bit, in order for the NameError to be raised when you want it.
Also keep the `counstruct
try:
new
except NameError: pass
else:
# continue here if no error was raised
This relies that the name new wasn't previously defined. If you are paranoid you can ensure that new doesn't exist using:
try:
del new
except NameError:
pass
Alternatively you can of course also use an if statement (if notfirst: print(new) else: notfirst = True). But as far as I know the overhead is bigger.
Using `timeit` yields:
...: try: new = 'test'
...: except NameError: pass
...:
100000000 loops, best of 3: 16.2 ns per loop
so I expect the overhead to be unelectable.

Count the items once and keep up with the number of items remaining:
remaining = len(data_list)
for data in data_list:
code_that_is_done_for_every_element
remaining -= 1
if remaining:
code_that_is_done_between_elements
This way you only evaluate the length of the list once. Many of the solutions on this page seem to assume the length is unavailable in advance, but that is not part of your question. If you have the length, use it.

One simple solution that comes to mind would be:
for i in MyList:
# Check if 'i' is the last element in the list
if i == MyList[-1]:
# Do something different for the last
else:
# Do something for all other elements
A second equally simple solution could be achieved by using a counter:
# Count the no. of elements in the list
ListLength = len(MyList)
# Initialize a counter
count = 0
for i in MyList:
# increment counter
count += 1
# Check if 'i' is the last element in the list
# by using the counter
if count == ListLength:
# Do something different for the last
else:
# Do something for all other elements

Just check if data is not the same as the last data in data_list (data_list[-1]).
for data in data_list:
code_that_is_done_for_every_element
if data != data_list[- 1]:
code_that_is_done_between_elements

So, this is definitely not the "shorter" version - and one might digress if "shortest" and "Pythonic" are actually compatible.
But if one needs this pattern often, just put the logic in to a
10-liner generator - and get any meta-data related to an element's
position directly on the for call. Another advantage here is that it will
work wit an arbitrary iterable, not only Sequences.
_sentinel = object()
def iter_check_last(iterable):
iterable = iter(iterable)
current_element = next(iterable, _sentinel)
while current_element is not _sentinel:
next_element = next(iterable, _sentinel)
yield (next_element is _sentinel, current_element)
current_element = next_element
In [107]: for is_last, el in iter_check_last(range(3)):
...: print(is_last, el)
...:
...:
False 0
False 1
True 2

This is an old question, and there's already lots of great responses, but I felt like this was pretty Pythonic:
def rev_enumerate(lst):
"""
Similar to enumerate(), but counts DOWN to the last element being the
zeroth, rather than counting UP from the first element being the zeroth.
Since the length has to be determined up-front, this is not suitable for
open-ended iterators.
Parameters
----------
lst : Iterable
An iterable with a length (list, tuple, dict, set).
Yields
------
tuple
A tuple with the reverse cardinal number of the element, followed by
the element of the iterable.
"""
length = len(lst) - 1
for i, element in enumerate(lst):
yield length - i, element
Used like this:
for num_remaining, item in rev_enumerate(['a', 'b', 'c']):
if not num_remaining:
print(f'This is the last item in the list: {item}')
Or perhaps you'd like to do the opposite:
for num_remaining, item in rev_enumerate(['a', 'b', 'c']):
if num_remaining:
print(f'This is NOT the last item in the list: {item}')
Or, just to know how many remain as you go...
for num_remaining, item in rev_enumerate(['a', 'b', 'c']):
print(f'After {item}, there are {num_remaining} items.')
I think the versatility and familiarity with the existing enumerate makes it most Pythonic.
Caveat, unlike enumerate(), rev_enumerate() requires that the input implement __len__, but this includes lists, tuples, dicts and sets just fine.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

In a CSV file, how can a Python coder remove all but an X number of duplicates across rows? - sorting

Related

Hashmap (O(1)) supporting joker/match-all keys

CSV - Processing each group of contiguous rows having the same values for certain fields

Input to different attributes values from a random.sample list

How do I modify multiple columns in a CSV, and then copy them to a new CSV using Ruby?

What is the pythonic way to detect the last element in a 'for' loop?

Categories

Resources