Append count after each match - bash

Sample input:
>Sample GJVT7LS03DEUKL
AAACTCCGCAATGCGCGCAAGC
>Sample GJVT7LS03CXJ53
AAACTCCGCAATGCGCGCAAGCGTGACGGGG
>Sample GJVT7LS03DJOYJ
AAACTCC
>Sample GJVT7LS03DMERH
AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC
>Sample GJVT7LS03DN2RB
AAACTCCGCAATGCGCGCAAGCGTGACGG
What I want out:
>Sample_1 GJVT7LS03DEUKL
AAACTCCGCAATGCGCGCAAGC
>Sample_2 GJVT7LS03CXJ53
AAACTCCGCAATGCGCGCAAGCGTGACGGGG
>Sample_3 GJVT7LS03DJOYJ
AAACTCC
>Sample_4 GJVT7LS03DMERH
AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC
>Sample_5 GJVT7LS03DN2RB
AAACTCCGCAATGCGCGCAAGCGTGACGG
In other words, I want to append count (preceded by "_") for each line that matches pattern ("Sample" in this case). Any sed/awk/etc. one-liners for this task?

One way:
$ awk '/^>/{$1=$1"_"++i}1' file
>Sample_1 GJVT7LS03DEUKL
AAACTCCGCAATGCGCGCAAGC
>Sample_2 GJVT7LS03CXJ53
AAACTCCGCAATGCGCGCAAGCGTGACGGGG
>Sample_3 GJVT7LS03DJOYJ
AAACTCC
>Sample_4 GJVT7LS03DMERH
AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC
>Sample_5 GJVT7LS03DN2RB
AAACTCCGCAATGCGCGCAAGCGTGACGG

One possible attempt is as follows:
$ awk 'BEGIN{a=1}/Sample/ {$1=$1"_"a; a++}1' file
>Sample_1 GJVT7LS03DEUKL
AAACTCCGCAATGCGCGCAAGC
>Sample_2 GJVT7LS03CXJ53
AAACTCCGCAATGCGCGCAAGCGTGACGGGG
>Sample_3 GJVT7LS03DJOYJ
AAACTCC
>Sample_4 GJVT7LS03DMERH
AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC
>Sample_5 GJVT7LS03DN2RB
AAACTCCGCAATGCGCGCAAGCGTGACGG
For each file containing "Sample" we update the first field with "_"$variable. This variable is a initially set to 1 and that we then increment in one.

Related

How to extract a string following a pattern using unix(mac OSX)

I have file with these columns and tab separated.
Jun-AP1(bZIP)/K562-cJun-ChIP-Seq(GSE31477)/Homer 12.88% 4926.5 9.08%
Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)/Homer 52.08% 25510.3 47.00%
Bach2(bZIP)/OCILy7-Bach2-ChIP-Seq(GSE44420)/Homer 10.81% 4377 8.06%
Atf3(bZIP)/GBM-ATF3-ChIP-Seq(GSE33912)/Homer 28.73% 13346.9 24.59%
TEAD4(TEA)/Tropoblast-Tead4-ChIP-Seq(GSE37350)/Homer 40.43% 19549.3 36.01%
In first column, I want to extract the string upto first bracket and keep rest of the columns same.
For instance, I need the output as shown below.
Jun-AP1 12.88% 4926.5 9.08%
Maz 52.08% 25510.3 47.00%
Bach2 10.81% 4377 8.06%
Atf3 28.73% 13346.9 24.59%
TEAD4 40.43% 19549.3 36.01%
Thank you.
I would start with
sed 's/([^ ]*//'
where that's an actual tab character in [^ ].
awk '{sub(/\(.*Homer/,"")}{print $1,$2,$3,$4}' file
Jun-AP1 12.88% 4926.5 9.08%
Maz 52.08% 25510.3 47.00%
Bach2 10.81% 4377 8.06%
Atf3 28.73% 13346.9 24.59%
TEAD4 40.43% 19549.3 36.01%

Parsing CSV file with \n in double quoted fields

I'm parsing a CSV file that has a break line in double quoted fields. I'm reading the file line by line with a groovy script but I get an ArrayIndexOutBoundException when I tried to get access the missing tokens.
I was trying to pre-process the file to remove those characters and I was thinking to do that with some bash script or with groovy itself.
Could you, please suggest any approach that I can use to resolve the problem?
This is how the CSV looks like:
header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"
This is the groovy script I'm using
def csv = new File(args[0]).text
def bufferString = ""
def parsedFile = new File("Parsed_" + args[0]);
csv.eachLine { line, lineNumber ->
def splittedLine = line.split(',');
retString += new Date(splittedLine[0]) + ",${splittedLine[1]},${splittedLine[2]},${splittedLine[3]}\n";
if(lineNumber % 1000 == 0){
parsedFile.append(retString);
retString = "";
}
}
parsedFile.append(retString);
UPDATE:
Finally I did this and it works, (I needed format the first column from timestamp to a human readable date):
gawk -F',' '{print strftime("%Y-%m-%d %H:%M:%S", substr( $1, 0, length($1)-3 ) )","($2)","($3)","($4)}' TobeParsed.csv > Parsed.csv
Thank you #karakfa
If you use a proper CSV parser rather than trying to do it with split (which as you can see doesn't work with any form of quoting), then it works fine:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"'''
def data = parseCsv(csv)
data.eachWithIndex { line, index ->
println """Line $index:
| 1:$line.header1
| 2:$line.header2
| 3:$line.header3
| 4:$line.header4""".stripMargin()
}
Which prints:
Line 0:
1:timestamp
2:abcdefghi
3:abcdefghi
4:sdsd
Line 1:
1:timestamp
2:zxcvb
fffffgfg
3:asdasdasadsd
4:sdsdsd
awk to the rescue!
this will merge the newline split fields together, you process can take it from there
$ awk -F'"' '!(NF%2){getline remainder;$0=$0 OFS remainder}1' splitted.csv
header1,header2,header3
xxxxxx, "abcdefghi", "abcdefghi"
yyyyyy, "zxcvb fffffgfg","asdasdasadsd"
assumes that odd number of quotes mean split field and replace new line with OFS. If you want to simple delete new line (the split parts will combine) remove OFS.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

alphanumeric sort in VIM

Suppose I have a list in a text file which is as follows -
TaskB_115
TaskB_19
TaskB_105
TaskB_13
TaskB_10
TaskB_0_A_1
TaskB_17
TaskB_114
TaskB_110
TaskB_0_A_5
TaskB_16
TaskB_12
TaskB_113
TaskB_15
TaskB_103
TaskB_2
TaskB_18
TaskB_106
TaskB_11
TaskB_14
TaskB_104
TaskB_112
TaskB_107
TaskB_0_A_4
TaskB_102
TaskB_100
TaskB_109
TaskB_101
TaskB_0_A_2
TaskB_0_A_3
TaskB_116
TaskB_1_A_0
TaskB_111
TaskB_108
If I sort in vim with command %sort, it gives me output as -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_10
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_11
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_1_A_0
TaskB_2
But I would like to have the output as follows -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_1_A_0
TaskB_2
TaskB_10
TaskB_11
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
Note I just wrote this list to demonstrate the problem. I could generate the list in VIM. But I want to do it for other things as well in VIM.
With [n] sorting is done on the first decimal number
in the line (after or inside a {pattern} match).
One leading '-' is included in the number.
try this command:
sor n
and you don't need the %, sort sorts all lines if no range was given.
EDIT
as commented by OP, if you have:
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_4
TaskB_0_A_3
TaskB_0_A_5
TaskB_1_A_0
you could try:
sor n /.*_\ze\d*/
or
sor nr /\d*$/
EDIT2
for newly edited question, this line may give you expected output based on your example data:
sor nr /\d*$/|sor n

how to replace last comma in a line with a string in unix

I trying to insert a string in every line except for first and last lines in a file, but not able to get it done, can anyone give some clue how to achieve? Thanks in advance.
How to replace last comma in a line with a string xxxxx (except for first and last rows)
using unix
Original File
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,01
99,SRI,FF,28
Expected File
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,xxxxx231100,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,xxxxx231300,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,xxxxx231900,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,xxxxx232200,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,xxxxx232400,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,xxxxx232700,01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
99,SRI,FF,28
awk can be quite useful for manipulating data files like this one. Here's a one-liner that does more-or-less what you want. It prepends the string "xxxxx" to the twelfth field of each input line that has at least twelve fields.
$ awk 'BEGIN{FS=OFS=","}NF>11{$12="xxxxx"$12}{print}' 16006747.txt
00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR."
10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO
10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU
10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01
10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09
10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU
10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01
10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09
10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU
10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01
10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01
99,SRI,FF,28

Resources