sorting numerically by first row - bash

I have a file with almost 900 lines in excel that I've saved as a tab deliminated .txt file. I'd like to sort the text file by the numbers given in the first column (they range between 0 and 2250). The other columns are both numbers and letters of varying length eg.
myfile.txt:
0251 abcd 1234,24 bcde
2240 efgh 2345,98 ikgpppm
0001 lkjsi 879,09 ikol
I've tried
sort -k1 -n myfile.txt > myfile_num.txt
but I just get an identical file with new name. I'd like to get:
myfile_num.txt
0001 lkjsi 879,09 ikol
0251 abcd 1234,24 bcde
2240 efgh 2345,98 ikgpppm
What am I doing wrong? I'm guessing that it's quite simple, but I'd appreciate any help I can get! I only know a little bash scripting, so it'd be nice if the script is a very simple one-liner that I can understand :)
Thanks :)

Use this to convert old Mac OS carriage return to newline:
tr '\r' '\n' < myfile.txt | sort

As stated here you can have problems with this (and in the other pseudo-follow-up-duplicate question you asked, yes, you did)
tr '\r' '\n' < myfile.txt | sort -n
It works fine here on MSYS but on some platforms you may have to add:
export LC_CTYPE=C
or tr will consider the file as a text file, and probably will tag it as corrupt after having reached the max line limit.
Obviously I could not test it, but I'm confident it will solve the problem given what I read on the linked answer.

A python approach (python 2 & 3 compatible), immune to all shell problems. Works great, and portable. I noticed that the input file has some '0x8C' chars (exotic dots), probably confusing tr command.
That is handled properly below:
import csv,sys
# read the file as binary, as it is not really text
with open("Proteins.txt","rb") as f:
data = bytearray(f.read())
# replace 0x8c char by classical dots
for i,c in enumerate(data):
if c>0x7F: # non-ascii: replace by dot
data[i] = ord(".")
# convert to list of ASCII strings (split using the old MAC separator)
lines = "".join(map(chr,data)).split("\r")
# treat our lines as input for CSV reader
cr = csv.reader(lines,delimiter='\t',quotechar='"')
# read all the lines in a list
rows = list(cr)
# perform the sort (tricky)
# on first row, numerical, removing the leading 0 which is illegal
# in python 3, and if not numerical, put it at the top
rows = sorted(rows,key=lambda x : x[0].isdigit() and int(x[0].strip("0")))
# write back the file as a nice, legal, ASCII tsv file
if sys.version_info < (3,):
f = open("Proteins_sorted_2.txt","wb")
else:
f = open("Proteins_sorted_2.txt","w",newline='')
cw = csv.writer(f,delimiter='\t',quotechar='"')
cw.writerows(rows)
f.close()

Related

Sorting the contents within a column using Shell Script Line by Line in a File

I am Sorting a File using a column using the command -
cat myFile | sort -u -k3
Now i want to Sort Data within a Column of a File. Can anyone please help and tell me how can i achieve it?
My Data Looks like this in the File names Student.csv -
Name,Age,Marks,Grades
Sam,21,"34,56,21,67","C,B,D,A"
Josh,25,"90,89,78,45","A,A,B,C"
Output-
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
Will Appreciate the help, Thanks
You should export your CSV with a field separator that does not exist within the texts. Otherwise it becomes hugely cumbersome to deal with this.
Afterwards you can easily sort by specifying the separator and the field.
Example if you would use | as separator:
Name|Age|Marks|Grades
Sam|21|"34,56,21,67"|"C,B,D,A"
Josh|25|"90,89,78,45"|"A,A,B,C"
Then execute:
cat myFile | sort -u -k3 -t\|
or:
sort -u -k3 -t\| <myFile
Afterwards you could be putting your semi-colons back:
sort -u -k3 -t\| <myFile | sed 's/|/;/g'
Did it, but I'm too tired to explain how; brain's hitting a brick wall. There's a lot to unpack there, and it'll take half-a-day to explain. I'll write all the steps in a couple hours after I get a nap in, otherwise there's gonna be 50 typos in that description.
cat Student.csv | head -n1 && cat Student.csv | tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}'
Alternatively:
cat Student.csv | tee >(head -n1) >(tail -n+2 | awk -F \" '{split($2,a,",");asort(a);b="";for(i in a)b=b a[i] ",";split($4,c,",");asort(c);d="";for(i in c)d=d c[i] ",";printf "%s\"%s\",\"%s\"\n",$1,substr(b,1,length(b)-1),substr(d,1,length(d)-1)}') >/dev/null ; sleep 0.1
Output:
Name,Age,Marks,Grades
Sam,21,"21,34,56,67","A,B,C,D"
Josh,25,"45,78,89,90","A,A,B,C"
https://www.tutorialspoint.com/awk/index.htm
Edit -- 'kay, the explaination:
cat concatenates (glues) files together, but when you just give it one arg, then that's what it prints out.
You can do the next part in one or two steps, I'll explain the first method. | pipe directs the output to another command. We all know this, or we wouldn't be here right now... however someday, someone will come across this post, and wonder what it does.
head prints out the first few lines of what you give it. Here, I specified -n1 number of lines = one, so it would print out the header:
Name,Age,Marks,Grades
&& continues to the next command, so long as that initial instruction was a success.
cat Student.csv again, but this time piped into tail, which prints the last few lines, of whatever you give it. -n+2 specifies to spit out everything from line number 2, and beyond.
We then pipe those contents into AWK https://en.wikipedia.org/wiki/AWK ...I'm sure you could do it with sed https://en.wikipedia.org/wiki/Sed, and I started with that, but sed tends to be more simple than awk, so you'd need to do far more chained-commands to achieve the same thing. Lisp might be able to do it more concicely, but it sounded like you were asking for shell builtins. Python's also decent with strings, but again, sh.
-F \" delegates a literal " as the field separator, so that we can group the contents into 3 categories:
Sam,21, " 34,56,21,67 " , "C,B,D,A"
$1 = Sam,21,
$2 = 34,56,21,67
$3 = ,
$4 = C,B,D,A
You actually get 4, but I'm throwing out that comma in the third position. It's easy enough to put it back in.
We now need to sort those numbers, so split($2,a,",") returns an array, in this case, named a, from the contents of $2, which has been delimited by the , symbol.
a = [ 34, 56, 21, 67 ]
; separates AWK commands, you can mostly ignore those. If there were simply a space, awk would try to concatenate items together, and we don't want that yet.
Next, array sort asort( a ), the contents of a -- https://www.tutorialspoint.com/awk/awk_string_functions.htm
a = [ 21, 34, 56, 67 ]
Here would be a perfect time for Python's string .join() method https://www.w3schools.com/python/ref_string_join.asp
However, we don't have that available to us, and AWK doens't seem to have it, as far as I know, so we have to roll our own here. So construct string, b, whose contents will be appended by each item in a. Single-quotes often won't do in commandline, so you'll see double-quotes.
b=""
for( i in a ) b=b a[i] ","
b begins empty. Iterating a for-loop over a's contents, we arrive at an appending which includes commas. Leave the trailing comma for now, it'll get trimmed off in a bit.
21,34,56,67,
Exact same procedure for $4, but we name the array c this time, and the string in which those contents are contatenaded with commas, d -- split( $4, c, "," ) ; asort( c ) ; d="" ; for( i in c ) d=d c[i] "," You can name them anything you like, just happened to have ABCD staring me in the face from those grade listings, so that's what I went with.
OK, now we have everything we need.
$1 = Sam,21,
b = 21,34,56,67,
d = A,B,C,D,
Let's format a string so they're all together.
printf "%s\"%s\",\"%s\"\n"
This will print $1 in the first %s string position, then a literal double-quote,
b into the second %s string position, next ",",
followed by d in the third %s position,
all wrapped up with a final double-quote and a newline.
However, b and d both have trailing commas, so we trim those off with AWK's substr() command. -- https://www.tutorialspoint.com/awk/awk_string_functions.htm Knowing where to begin is easy enough, but we need to chop those at one-from-the-end.
substr( b, 1, length(b) -1 )
substr( d, 1, length(d) -1 )
It'd be nice if you could just specify -2, and have it count backwards, like you can in Lua, Python, et al... but that doesn't seem to do in AWK, so whatevs. Ya live, ya learn. And there you have it, all your ducks in a row.
Sam,21,"21,34,56,67","A,B,C,D"
This does, maybe not elegantly, but it's within the required guidelines. I'm sure there's possibilities of code-golfing in there somewhere, but it's solid logic you can follow.

Transliteration in sed

I'm trying to convert Arabic numerals to Roman numerals using sed (just as a learning exercise), but I'm not getting the expected output.
The sed manual says
y/source/dest/
Transliterate the characters in the pattern space which appear in source to the corresponding character in dest.
Input
echo "1 5 15 20" | sed 'y/151520/IVXVXX/'
Output
I V IV XX
Expected output
I V XV XX
I've tried replacing first X with any character, and the output is the same for each, so I gather that 1 is mapped to I by the sed program. However, according to the description of the y command, shouldn't the program be transliterating character by character? How would I do this?
I think you misunderstand the description. What y does is whenever any character from the left-hand side occurs in the input, replace it with the corresponding character from the right-hand side.
Specifying one character multiple times doesn't really make sense and I'm not sure the behavior of sed is defined in this case, although your version apparently takes the first occurrence and uses that.
To illustrate:
$ echo HELLO WORLD | sed 'y/L/x/'
HExxO WORxD
$ echo HELLO WORLD | sed 'y/LL/xy/'
HExxO WORxD
Fundamentally, your problem is that it's impossible to accomplish this task with just transliteration.
Your case illustrates that quite nicely: 15 is really 1 and 5 and sed has no way of distinguishing between the two.
You already got the "why" answer, but FYI here's how to do what you wanted using a standard UNIX tool:
$ echo "1 5 15 20" | awk '
BEGIN { split("1 5 15 20",a); split("I V XV XX",r); for (i in a) map[a[i]]=r[i] }
{ for (i=1; i<=NF; i++) $i=map[$i] }
1'
I V XV XX

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

Slow bash script to execute sed expression on each line of an input file

I have a simple bash script as follows
#!/bin/bash
#This script reads a file of row identifiers separated by new lines
# and outputs all query FASTA sequences whose headers contain that identifier.
# usage filter_fasta_on_ids.sh fasta_to_filter.fa < seq_ids.txt; > filtered.fa
while read SEQID; do
sed -n -e "/$SEQID/,/>/ p" $1 | head -n -1
done
A fasta file has the following format:
> HeadER23217;count=1342
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
>ANotherName;count=3221
GGGTACAGACCTACAC
CAACTAGGGGACCAAT
edit changed header names to better show their actual structure in the files
The script I made above does filter the file correctly, but it is very slow. My input file has ~ 20,000,000 lines containing ~ 4,000,000 sequences, and I have a list of 80,000 headers that I want to filter on. Is there a faster way to do this using bash/sed or other tools (like python or perl?) Any ideas why the script above is taking hours to complete?
You're scanning the large file 80k times. I'll suggest a different approach with a different tool: awk. Load the selection list into an hashmap (awk array) and while scanning the large file if any sequence matches print.
For example
$ awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) a["Sequence ID " $i]; next}
$1 in a' headers fasta
The -F"\n" flag sets the field separator in the input file to be a new line. -v RS=">" sets the record separator to be a ">"
Sequence ID 1
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
Sequence ID 4
GGGTACAGACCTACAT
CAACTAGGGGACCAAT
the headers file contains
$ cat headers
1
4
and the fasta file includes some more records in the same format.
If your headers already includes the "Sequence ID" prefix, adjust the code as such. I didn't test this for large files but should be dramatically faster than your code as long as you don't have memory restrictions to hold 80K size array. In that case, splitting the headers to multiple sections and combining the results should be trivial.
To allow any format of header and to have the resulting file be a valid FASTA file, you can use the following command:
awk -F"\n" -v RS=">" -v ORS=">" -v OFS="\n" 'NR==FNR{for(i=1;i<=NF;i++) a[$i]; next} $1 in a' headers fasta > out
The ORS and OFS flags set the output field and record separators, in this case to be the same as the input fasta file.
You should take advantage of the fact (which you haven't explicitly stated, but I assume) that the huge fasta file contains the sequences in order (sorted by ID).
I'm also assuming the headers file is sorted by ID. If it isn't, make it so - sorting 80k integers is not costly.
When both are sorted it boils down to a single simultaneous linear scan through both files. And since it runs in constant memory it can work with any size unlike the other awk example. I give an example in python since I'm not comfortable with manual iteration in awk.
import sys
fneedles = open(sys.argv[1])
fhaystack = open(sys.argv[2])
def get_next_id():
while True:
line = next(fhaystack)
if line.startswith(">Sequence ID "):
return int(line[len(">Sequence ID "):])
def get_next_needle():
return int(next(fneedles))
try:
i = get_next_id()
j = get_next_needle()
while True:
if i == j:
print(i)
while i <= j:
i = get_next_id()
while i > j:
j = get_next_needle()
except StopIteration:
pass
Sure it's a bit verbose, but it finds 80k of 4M sequences (339M of input) in about 10 seconds on my old machine. (It could also be rewritten in awk which would probably be much faster). I created the fasta file this way:
for i in range(4000000):
print(">Sequence ID {}".format(i))
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
And the headers ("needles") this way:
import random
ids = list(range(4000000))
random.shuffle(ids)
ids = ids[:80000]
ids.sort()
for i in ids:
print(i)
It's slow because you are reading several times the same file when you could have sed read it once and process all patterns. So you need to generate a sed script with a statement for each ID and the />/b to replace your head -n -1.
while read ID; do
printf '/%s/,/>/ { />/b; p }\n' $ID;
done | sed -n -f - data.fa

Getting one line in a huge file with bash

How can I get a particular line in a 3 gig text file. All the lines have:
the same length, and
are delimited by \n.
And I need to be able to get any line on demand.
How can this be done? Only one line need be returned.
If all the lines have the same length, the best way by far will be to use dd(1) and give it a skip parameter.
Let the block size be the length of each line (including the newline), then you can do:
$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null
The idea is to seek past all the previous lines (skip=<line_no - 1>) and read a single line (count=1). Because the block size is set to the line length (bs=<line-length>), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.
That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd will seek to the position you want in the file and read only one line of data from the file.
head -10 file | tail -1 returns line 10 probably slow though.
from here
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
An awk alternative, where 3 is the line number.
awk 'NR == 3 {print; exit}' file.txt
If it's not a fixed-record-length file and you don't do some sort of indexing on the line starts, your best bet is to just use:
head -n N filespec | tail -1
where N is the line number you want.
This isn't going to be the best-performing piece of code for a 3Gb file unfortunately but there are ways to make it better.
If the file doesn't change too often, you may want to consider indexing it. By that I mean having another file with the line offsets in it as fixed length records.
So the file:
0000000000
0000000017
0000000092
0000001023
would give you an fast way to locate each line. Just multiply the desired line number by the index record size and seek to there in the index file.
Then use the value at that location to seek in the main file so you can read until the next newline character.
So for line 3, you would seek to 33 in the index file (index record length is 10 characters plus one more for the newline). Reading the value there, 0000000092, would give you the offset to use into the main file.
Of course, that's not so useful if the file changes frequently although, if you can control what happens when things get appended, you can still add offsets to the index efficiently. If you don't control that, you'll have to re-index whenever the last-modified date of the index is earlier than that of the main file.
And, based on your update:
Update: If it matters, all the lines have the same length.
With that extra piece of information, you don't need the index - you can just seek immediately to the right location in the main file by multiplying the record length by the record length (assuming the values fit into your data types).
So something like the pseudo-code:
def getline(fhandle,reclen,recnum):
seek to position reclen*recnum for file fhandle.
read reclen characters into buffer.
return buffer.
Use q with sed to make the search stop after the line has been printed.
sed -n '11723{p;q}' filename
Python (minimal error checking):
#!/usr/bin/env python
import sys
# by Dennis Williamson - 2010-05-08
# for http://stackoverflow.com/questions/2794049/getting-one-line-in-a-huge-file-with-bash
# seeks the requested line in a file with a fixed line length
# Usage: ./lineseek.py LINE FILE
# Example: ./lineseek 11723 data.txt
EXIT_SUCCESS = 0
EXIT_NOT_FOUND = 1
EXIT_OPT_ERR = 2
EXIT_FILE_ERR = 3
EXIT_DATA_ERR = 4
# could use a try block here
seekline = int(sys.argv[1])
file = sys.argv[2]
try:
if file == '-':
handle = sys.stdin
size = 0
else:
handle = open(file,'r')
except IOError as e:
print >> sys.stderr, ("File Open Error")
exit(EXIT_FILE_ERR)
try:
line = handle.readline()
lineend = handle.tell()
linelen = len(line)
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
# it would be really weird if this happened
if lineend != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
handle.seek(linelen * (seekline - 1))
try:
line = handle.readline()
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
if len(line) != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
print(line)
Argument validation should be a lot better and there is room for many other improvements.
A quick perl one liner would work well for this too...
$ perl -ne 'if (YOURLINENUMBER..YOURLINENUMBER) {print $_; last;}' /path/to/your/file

Resources