Bash: Split with 2 factors / variables - bash

I want to split a file with the following algorithm.
This CSV has a 3600 lines previously ordered by Name alpabetically ( sort -k2 -n file.csv )
Currently I can run this command to split the file in equal number of lines:
split -l ${MAX_NUMBER_OF_LINES} filename.csv ${new_file_pattern}.
But the original requirement is:
Split into chunks of ${MAX_NUMBER_OF_LINES} UNLESS no more records with the first letter of the column 2 exists.
For example:
if I have ${MAX_NUMBER_OF_LINES} = 3, I can split the file in chunk of 300 lines if no more occurrencies of the last first letter of the column are found.
If the LINE 301 has a record with "Arboreal Peaches" the script has to add to the current chunk no matter the ${MAX_NUMBER_OF_LINE} was already reach.
Is sort of confusing explanation.. I hope any of you can help me (I already spent 2 days in this algorithm)
UPDATE
${MAX_NUMBER_OF_LINES} = 3
Example CSV (with fewer lines for exaple purpose).
Split command reaches ${MAX_NUMBER_OF_LINES}, but the line 4 already has a record with the letter A
'Aberdeen Research", 'Los Angeles', 'California'
'Aplueyo Labs", 'Los Angeles', 'US'
'Acar Media Group", 'Los Angeles', 'US'
'Aberdeen Research", 'San Jose', 'US'
'Beethoven Inc", 'San Jose', 'US'
EXPECTED RESULT
Splitted Files
1
'Aberdeen Research", 'Los Angeles', 'California'
'Aplueyo Labs", 'Los Angeles', 'US'
'Acar Media Group", 'Los Angeles', 'US'
'Aberdeen Research", 'San Jose', 'US'
2
'Beethoven Inc", 'San Jose', 'US'

Something like this? In awk:
$ cat split.awk
BEGIN {if(max=="")
print "Invalid numer of lines"; exit # exit if no max
}
(a=substr($0,2,1)) && ++c>=max && prev!=a { # first letter to a, if count >= max
c=0 # and first letter changes
fc++ # reset count and filemask counter
}
{
print $0 > (mask==""?"x":mask) (fc==""?0:fc) # write to file default mask x
prev=a # remember previous first letter
}
Run it:
$ awk -v max=3 -v mask="file" -f split.awk file.csv
$ cat file0
'Aberdeen Research", 'Los Angeles', 'California'
'Aplueyo Labs", 'Los Angeles', 'US'
'Acar Media Group", 'Los Angeles', 'US'
'Aberdeen Research", 'San Jose', 'US'
$ cat file1
'Beethoven Inc", 'San Jose', 'US'
mask is the filename prefix or $new_file_pattern and max is $MAX_NUMBER_OF_LINES, ie. in the command line set -v max=$MAX_NUMBER_OF_LINES -v mask=$new_file_pattern.

Related

Remove duplicated fasta sequence (bash of biopython method)

I have a fasta file such as:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
And in this file I would like to remove duplicated sequence and get:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
Here as you can see the content after the > name for sequence1_CP, sequence2 and sequence3 is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP in its name, then I want to keep this one especially. If there is none _CP in any of them it does not mater which one I keep.
So for the first duplicates between Sequence1_CP, Sequence2 and Sequence3 I keep sequence1_CP
For the second duplicates between sequence4_CP and sequence5 I keep sequence4_CP
And for the third duplicates between sequence6 and sequence7 I keep the first one sequence6
Does someone have an idea using biopython or a bash method?
In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:
The following awk will do exactly that:
$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"}
{seq="";for(i=2;i<=NF;++i) seq=seq toupper($i)}
!(seq in a){print; a[seq]}' file.fasta
There exists actually a version of awk which has been upgraded to process fasta files:
$ bioawk -c fastx '!(seq in a){print; a[seq]}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
You could use this awk one-liner:
$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file
Output:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
Above relies on observation that the sequences are in order where the _CPs come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP sequence is found:
$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file
Or in pretty-print:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
seen[$2,$3]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
The output order is awk default ie. appears random.
Update If #kvantour's BOTH comments are valid in this case, use this awk:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
for(i=2;i<=NF;i++)
k=(i==2?"":k) $i
if(!(k in seen)||$1~/^[^ ]+_CP /)
seen[k]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
Output now:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
Or pure-bash solution (following same log as separate perl solution):
#! /bin/bash
declare -A p
# Read inbound data into associative array 'p'
while read id ; do
read s1 ; read s2 ; read s3
key="$s1:$s2"
prev=${p[$key]}
if [[ -z "$prev" || "$id" = %+CP% ]] ; then p[$key]=$id ; fi
done
# Print all data
for k in "${!p[#]}" ; do
echo -e "${p[$k]}\n${k/:/\\n}\n"
done
Here's a python program that will provide you with results you are looking for:
import fileinput
import re
seq=""
nameseq={}
seqnames={}
for line in fileinput.input():
line = line.rstrip()
if re.search( "^>", line ):
if seq:
nameseq[ id ] = seq
if seq in seqnames:
if re.search( "_CP", id ):
seqnames[ seq ] = id
else:
seqnames[ seq ] = id
seq = ""
id = line
continue
seq += line
for k,v in seqnames.iteritems():
print(v)
print(k)
Or with perl. Assuming code in m.pl ,can be wrapped into bash script
Hopefully, code will help find medicines, and not develop new viruses :-)
perl m.pl < input-file
! /usr/bin/perl
use strict ;
my %to_id ;
local $/ = "\n\n";
while ( <> ) {
chomp ;
my ($id, $s1, $s2 ) = split("\n") ;
my $key = "$s1\n$s2" ;
my $prev_id = $to_id{$key} ;
$to_id{$key} = $id if !defined($prev_id) || $id =~ /_CP/ ;
} ;
print "$to_id{$_}\n$_\n\n" foreach(keys(%to_id)) ;
It's not clear what is the expected order. Perl code will print directly from hash. Can be customized, if needed.
Here's a Biopython answer. Be aware that you only have two unique sequences in your example (sequence 6 and 7 only show a character more in the first line but are essentially the same protein sequence as 1).
from Bio import SeqIO
seen = []
records = []
# examples are in sequences.fasta
for record in SeqIO.parse("sequences.fasta", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
# printing to console
for record in records:
print(record.name)
print(record.seq)
# writing to a fasta file
SeqIO.write(records, "unique_sequences.fasta", "fasta")
For more info you can try the biopython cookbook

Concatenate string in .csv after x commas using shell/bash

I have several .csv files containing data. The data vendor created the files indicating the years once in the first line with missing values in between, variables names in the second. Data follows in the third to the Xth line.
"year 1", , , "year 2", , ,"year 2", , ,
"Var1", "Var2", "Var3", "Var1", "Var2", "Var3", "Var1", "Var2", "Var3"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"
I am new to shell programming but it shouldn't be too complicated writing a script that outputs the following
"Var1_year1", "Var2_year1", "Var3_year1", "Var1_year2", "Var2_year2", "Var3_year2", "Var1_year3", "Var2_year3", "Var3_year3"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"
Some thing like
#!/bin/bash
FILES=/Users/pathTo.csvfiles/*.csv
for f in $FILES
do
echo "Processing $f file..."
# 1. Replace the second line with 'Varname_YearX' where YearX comes from the first line
cat ????
# 2. Delete first line
sed -i '' 1d $f
done
echo "Processing complete."
Update: The .csv files vary in their amount of lines. Only the first two lines need to be edited, the following lines are data.
If you want to merge the first and the second line of each CSV, try this.
# No point in using a variable for the wildcard
for f in /Users/pathTo.csvfiles/*.csv
do
awk -F , 'NR==1 { # Collect first line
# Squash quotes
gsub(/"/, "")
for(i=1;i<=NF;++i)
y[i] = $i || y[i-1]
next # Do not fall through to print
}
NR==2 { # Combine collected with current
gsub(/"/, "")
for(i=1;i<=NF;++i)
$i = y[i] "_" $i
}
# Print everything (except first)
1' "$f" > "$f.tmp"
mv "$f.tmp" "$f"
done
The first loop simply copies the previous field's value to y[i] if the i:th field is empty.
Ugly code using csvtool, various standard tools, and bash:
i=file.csv
paste -d_ <(head -2 $i | tail -1 | csvtool transpose -) \
<(head -1 $i | csvtool transpose - |
sed '$d;s/ //;/^$/{g;b};h') |
csvtool transpose - | sed 's/[^,]*/"&"/g' | cat - <(tail +3 $i)
Output:
"Var1_year1","Var2_year1","Var3_year1","Var1_year2","Var2_year2","Var3_year2","Var1_year2","Var2_year2","Var3_year2"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"

Bash - How to replace strings started from the second column and second row of a file?

I have a csv file like this
time my_account form_token address City
13:19:43 username1 aa57d1cd3d5d8845 Street name 1 City 1
13:19:43 username2 aa57d1cd3d5d8846 Street name 2 City 2
13:19:43 username3 aa57d1cd3d5d8847 Street name 3 City 3
13:19:43 username4 aa57d1cd3d5d8848 Street name 4 City 4
and I want to replace the values below my_account and form_token only, so the rest of the columns should be the same without getting replaced. The end result should be like this:
time my_account form_token address City
13:19:43 OhYeah12345 xxxaaaasssss1 Street name 1 City 1
13:19:43 OhYeah12346 xxxaaaasssss2 Street name 2 City 2
13:19:43 OhYeah12347 xxxaaaasssss3 Street name 3 City 3
13:19:43 OhYeah12348 xxxaaaasssss4 Street name 4 City 4
Here's the file if you wanna download https://www.dropbox.com/s/t9damejyrlccyam/demo.csv?dl=0
How do I do this with bash ?
Here's what I have done:
awk -F ',' -v OFS=',' '$1 {$3="Another Street"; print}' /tmp/demo.csv
But this command also replaces address on the first row, I want it to start from the second row and below
You could use the following awk command and redirect the output to a new file:
$ awk -v var="OhYeah123" -v var2="xxxaaaasssss" 'NR>1{$2=var substr($3,length($3)-1,length($3)); $3=var2 (NR-1); print}NR==1' file.in
time my_account form_token address City
13:19:43 OhYeah12345 xxxaaaasssss1 Street name 1 City 1
13:19:43 OhYeah12346 xxxaaaasssss2 Street name 2 City 2
13:19:43 OhYeah12347 xxxaaaasssss3 Street name 3 City 3
13:19:43 OhYeah12348 xxxaaaasssss4 Street name 4 City 4
I have defined 2 variables that you can use within awk to replace the fields by whatever you want.
NR==1 will print the first line
NR>1{$2=var substr($3,length($3)-1,length($3)); $3=var2 (NR-1); print} from the 2nd line you replace the 2nd field and the 3rd field by the variable content and you concatenate it with the respectively substr($3,length($3)-1,length($3) and (NR-1) in order to have your output
If you do not need this logic, just use:
$ awk -v var="OhYeah123" -v var2="xxxaaaasssss" 'NR>1{$2=var; $3=var2;}1' file.in

Delete an entire row if date is less than 50 days of current date

I need help to delete a row if date is older than n days at specified column.My file contains following. From the below file , I need to find out the entries less than 50 days old of current date in column 4 and delete the entire row.
ABC, 2017-02-03, 123, 2012-09-08
BDC, 2017-01-01, 456, 2015-09-05
Test, 2017-01-05, 789, 2017-02-03
My desired output is follows.
ABC, 2017-02-03, 123, 2012-09-08
BDC, 2017-01-01, 456, 2015-09-05
Note: I have an existing script and need to integrate this to the existing one.
you can leverage date command for this task, which will simplify the script
$ awk -v t=$(date -d"-50 day" +%Y-%m-%d) '$4<t' input > output
which will have this content in the output file
ABC, 2017-02-03, 123, 2012-09-08
BDC, 2017-01-01, 456, 2015-09-05
replace input/output with your file names
You can use a gawk logic something like below,
gawk '
BEGIN {FS=OFS=",";date=strftime("%Y %m %d %H %M %S")}
{
split($4, d, "-")
epoch = mktime(d[1] " " d[2] " " d[3] " " "00" " " "00" " " "00")
if ( ((mktime(date) - epoch)/86400 ) > 50) print
}' file
The idea is to use GNU Awk string functions strftime() and mtkime() for date conversion. The former produces a timestamp in YYYY MM DD HH MM SS format which mktime uses for conversion to EPOCH time.
Once the two times, i.e. the current timestamp (date) and epoch from $4 in file are converted to EPOCH, the difference is divided by 86400 to get the differences in days and only those lines whose difference is greater 50 are printed.

Remove linefeed from csv preserving rows

I have a CSV that was exported, some lines have a linefeed (ASCII 012) in the middle of a record. I need to replace this with a space, but preserve the new line for each record to load it.
Most of the lines are fine, however a good few have this:
Input:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :
Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
Output:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
I have been looking into Awk but cannot really make sense of how to preserve the actual row.
Another Example:
Input:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working
Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
Output:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
One way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS = "[,~]"
}
NF < 21 {
line = (line ? line OFS : line) $0
fields = fields + NF
}
fields >= 21 {
print line
line=""
fields=0
}
NF == 21 {
print
}
Alternatively, you can use this one-liner:
awk -F "[,~]" 'NF < 21 { line = (line ? line OFS : line) $0; fields = fields + NF } fields >= 21 { print line; line=""; fields=0 } NF == 21 { print }' file.txt
Explanation:
I made an observation about your expected output: it seems each line should contain exactly 21 fields. Therefore if your line contains less than 21 fields, store the line and store the number of fields. When we loop onto the next line, the line will be joined to the stored line with a space, and the number of fields totaled. If this number of fields is greater or equal to 21 (the sum of the fields of a broken line will add to 22), print the stored line. Else if the line contains 21 fields (NF == 21), print it. HTH.
I think sed is your choice. I assume all the records end with non-colon character, thus if a line end with a colon, it is recognized as an exception and should be concatenated to the previous line.
Here is the code:
cat data | sed -e '/[^"]$/N' -e 's/\n//g'
The first execution -e '/[^"]$/N' match an abnormal case, and read in next record without empty the buffer. Then -e 's/\n//g' remove the new line character.
try this one-liner:
awk '{if(t){print;t=0;next;}x=$0;n=gsub(/"/,"",x);if(n%2){printf $0" ";t=1;}else print $0}' file
idea:
count the number of " in a line. if the count is odd, join the following line, otherwise the current line would be considered as a complete line.

Resources