Sorting lines by number of patterns in each line - sorting

I have the following file
How do I sort based on number of "/" I find in each line
a/b/c/d/e/f
a/k/l
a/m/m/p
b/h/s/i/l/m
b/h
b/e/f/g/p/l/i
a/p/t
a/k/s
b/p/t
b/k/s
The lines sorted as follows.
b/e/f/g/p/l/i
a/b/c/d/e/f
b/h/s/i/l/m
a/m/m/p
a/k/l
a/p/t
a/k/s
b/p/t
b/k/s
b/h
What is the best way to achieve this?
Thanks,
Ravi

One approach is to read a line at a time and count how many '/' are in that line.
Then insert an element into a hashmap where the key is the id of the line (could be sequential (1 to N)) and then sort by the hashmap by value. If you're using Python, this would look like this. Iterating through the keys in order will give you the index of the corresponding line.
Example in Python that prints out lines in sorted order:
import operator
def sort_by_char(lines, ch):
""" Given a list of strings and a character, print out the lines sorted by how many characters (ch) it contains """
hashmap = {}
for idx in range(len(lines)):
line = lines[idx]
count = line.count(ch)
hashmap[idx] = count
sorted_lines = sorted(hashmap.items(), key=operator.itemgetter(1))
for i in range(len(lines)):
print lines[sorted_lines[i][0]]
Alternatively, you could use the count as the key and the line as the value.

Related

How to print top 5 scores from a text file

I have an external text file where I store names and scores in the form:
(name) has (score - integer) points
An example would be:
Bob has 25 points
I would like to print the lines as they are but only in descending order from highest score.
In other words, I would like to print the same lines as they are in the text file, but sorted from highest to lowest integer(score) in the line. I would also like to limit the printed lines, so that would mean only the scores in descending order are printed.
I have tried many ways but all I could end up with is separated names and scores which have quotes and parenthesis, but what I am aiming for is to print the lines as they are.
Could someone please help me with this?
Directly sort the lines, using the score as a key for sort:
path = "leaderboard.txt"
with open(path, 'r') as f:
file_lines = f.readlines()
file_lines.sort(key=lambda line: int(line.split()[2]), reverse=True)
print(''.join(file_lines[:5]))

how do i make my code read random lines 37 different times?

def pick_random_line
chosen_line = nil
File.foreach("id'sForCascade.txt").each_with_index do |line, id|
chosen_line = line if rand < 1.0/(id+1)
end
return chosen_line
end`enter code here
Hey, i'm trying to make that code pick 37 different lines. So how would I do that i'm stuck and confused.
Assuming you don't want the same line to repeat more than once, I would do it in one line like this:
File.read("test.txt").split("\n").shuffle.first(37)
File.read("test.txt") reads the entire file.
split("\n") splits the file to lines based on the \n delimiter (I assume your file is textual and have lines separated by new line character).
shuffle is a very convenient method of Array that shuffles the lines randomly. You can read about it here:
http://docs.ruby-lang.org/en/2.0.0/Array.html#method-i-shuffle
Finally, first(37) gives you the first 37 lines out of the shuffled array. These are guaranteed to be random from the shuffle operation.
You can do something like this:
input_lines = File.foreach("test.txt").map(&:to_s)
output_lines = []
37.times do
output_lines << input_lines.delete_at(rand(input_lines.length))
end
puts output_lines
This will ensure that you aren't grabbing duplicate lines and you don't need to do any fancy checking.
However, if your file is less than 37 lines this may cause a problem, it also assumes that your file exists.
EDIT:
What is happening is the rand call is now changing the range on which it is called based on the size of the input lines. And since you are deleting at an index when you take the line out, the length shrinks and you do not risk duplicating lines.
If you want to save relatively few lines from a large file, reading the entire file into an array (and then randomly selecting lines) could be costly. It might be better to count the number of lines in the file, randomly select line offsets and then save the lines at those offsets to an array. This approach is no more difficult to implement than the former one, but makes the method more robust, even if the files in the current application are not overly large.1
Suppose your filename were given by FName. Here are three ways to count the numbers of lines in the file:
Count lines, literally
cnt = File.foreach(FName).reduce(0) { |c,_| c+1 }
Use $.
File.foreach(FName) {}
cnt = $.
On Unix-family computers, shell-out to the operating system
cnt = %x{wc -l #{FName}}.split.first.to_ii
The third option is very fast.
Random offsets (base 1) for n lines to be saved could be computed as follows:
lines = (1..cnt).to_a.sample(n).sort
Saving the lines at those offsets to an array is straightforward; for example:
File.foreach(FName).with_object([]) do |line,a|
if lines.first == $.
a << line
lines.shift
break a if lines.empty?
end
end
Note that $. #=> 1 after the first line is first line is read, and $. is incremented by 1 after each successive line is read. (Hence base 1 for line offsets.)
1 Moreover, many programmers, not just Rubiests, are repelled by the idea of amassing large numbers of anything and then discarding all but a few.

Multiple sequence alignment. Convert multi-line format to single-line format?

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
TGFb3_human_used_for_docking LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA LRSADTTHST-
Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
(In this particular examples the sequences are almost identical, but in general they aren't!)
How can I convert from multi-line multiple sequence alignment format to single-line?
Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done
#Create a dictionary to accumulate full sequences
full_sequences = {}
#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
for line in infile:
line = [element.strip() for element in line.split()]
if len(line) < 2:
continue
full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]
#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
for seq in full_sequences:
outstr += seq + "\t\t" + full_sequences[seq] + "\n"
outfile.write(outstr)

copy the lines of a file into hashmap in ruby

I have a file with multiple lines. In each line, there two words and a number, split by a comma - for example a, b, 1. It means that string A and string B have the key as 1. I wrote the below piece of code
File.open(ARGV[0], 'r') do |f1|
while line = f1.gets
puts line
end
end
i'm looking for an idea of how to split and copy the characters and number in such a way that the first two words have the last number as key in the hashmap.
Does this work for you?
hash = {}
File.readlines(ARGV[0]).each do |line|
var = line.gsub(' ','').split(',')
hash[var[2]] = var[0],var[1]
end
This would give:
hash['1'] = ['a','b']
I don't know if you want to store number one as an integer or a string, if it's a integer you're looking for, just do var[2].to_i before storing.
Modified your code a little bit, i think it's shorter this way, if i'm in any way wrong, do let me know.

String that can contain multiple numbers - how do I extract the longest number?

I have a string that
contains at least one number
can contain multiple numbers
Some examples are:
https://www.facebook.com/permalink.php?story_fbid=53199604568&id=218700384
https://www.facebook.com/username_13/posts/101505775425651120
https://www.facebook.com/username/posts/101505775425699820
I need a way to extract the longest number from the string. So for the 3 strings above, it would extract
53199604568
101505775425651120
101505775425699820
How can I do this?
#get the lines first
text = <<ENDTEXT
https://www.facebook.com/permalink.php?story_fbid=53199604568&id=218700384
https://www.facebook.com/username_13/posts/101505775425651120
https://www.facebook.com/username/posts/101505775425699820
ENDTEXT
lines = text.split("\n")
#this bit is the actual answer to your question
lines.collect{|line| line.scan(/\d+/).sort_by(&:length).last}
Note that i'm returning the numbers as strings here. You could convert them to numbers with to_i
parse the list (to get an int array), then use the Max function. array.Max for syntax.
s = "https://www.facebook.com/permalink.php?story_fbid=53199604568&id=218700384"
s.scan(/\d+/).max{|a,b| a.length <=> b.length}.to_i

Resources