Merging specific lines in a CSV file if specific trend is found out - shell

I have been stuck with creating the script for below mentioned scenario:
I have a file a.csv with content as
123,fsfs,4124124,412412
1314,fasfwe,42145,rwr
1234,fwtrwqt,twt
wqrfsdgaseg
12424,23532,fafwe,gewgt
14214,wet,wertwtw,wet
What happens is, due to some application, the CSV content of one line gets printed on the second line.
My task is to find such occurrence and merge the such lines in a new file.
so new files will contains only required CSV records I tried few things using sed, but couldn't succeed.

$ awk -F, '!length $4 && length $3 {printf "%s,", $0;next}1' file
123,fsfs,4124124,412412
1314,fasfwe,42145,rwr
1234,fwtrwqt,twt,wqrfsdgaseg
12424,23532,fafwe,gewgt
14214,wet,wertwtw,wet

All previous answers seem great, but I wanted to add a sed answer as well because sed is awesome! (And sed was added as a tag so we were missing a sed answer.)
This answer should work on multiple lines, provided that the cut always happen on a separator and that that separator is omitted (see input example for these assumptions).
sed ':l;/\([^,]*,\)\{3\}[^,]*/!{;N;s/\n/,/g;bl;}' <file_in >file_out
What it does is :
defines a label (:l)
tests if there are four fields (/\([^,]*,\)\{3\}[^,]*/)
If there isn't (!), execute the block ({;N;s/\n/,/g;bl;})
The block :
reads the next line into the buffer (N)
replaces the newline with the separator (s/\n/,/g)
loops by branching to our :llabel (bl)
Proof :
$ sed ':l;/\([^,]*,\)\{3\}[^,]*/!{;N;s/\n/,/g;bl;}' <<EOF
> 123,fsfs,4124124,412412
> 1314,fasfwe,42145,rwr
> 1234,fwtrwqt,twt
> wqrfsdgaseg
> 12424,23532,fafwe,gewgt
> 14214,wet,wertwtw,wet
> EOF
123,fsfs,4124124,412412
1314,fasfwe,42145,rwr
1234,fwtrwqt,twt,wqrfsdgaseg
12424,23532,fafwe,gewgt
14214,wet,wertwtw,wet

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

How to detect some pattern with grep -f on a file in terminal, and extract those lines without the pattern

I'm on mac terminal.
I have a txt file with one column with 9 IDs, allofthem.txt, where every ID starts with ¨rs¨:
rs382216
rs11168036
rs9296559
rs9349407
rs10948363
rs9271192
rs11771145
rs11767557
rs11
Also, I have another txt file, useful.txt, with those IDs that were useful in an analysis I did. It looks the same, one column with several rows of IDs, but with less IDS, only 5.
rs9349407
rs10948363
rs9271192
rs11
Problem:I want to generate a new txt file with the non-useful ones (the ones that appear in allofthem.txt but not in useful.txt).
I want to do the inverse of:
grep -f useful.txt allofthem.txt
I want to use some systematic way of deleting all the IDs in useful and obtain a file with the remaining ones. Maybe with awk or sed, but I can´t see it. Can you help me, please? Thanks in advance!
Desired output:
rs382216
rs11168036
rs9296559
rs11771145
rs11767557
-v option does the inverse for you:
grep -vxf useful.txt allofthem.txt > remaining.txt
-x option matches the whole line in allofthem.txt, not parts.
As #hek2mgl rightly pointed out, you need -F if you want to treat the content of useful.txt as strings and not patterns:
grep -vxFf useful.txt allofthem.txt > remaining.txt
Make sure your files have no leading or trailing white spaces - they could affect the results.
I recommend to use awk:
awk 'FNR==NR{patterns[$0];next} $0 in patterns' useful.txt allofthem.txt
Explanation:
FNR==NR is true as long as we are reading useful.txt. We create an index in patterns for every line of useful.txt. next stops further processing.
$0 in patterns runs, because of the previous next statement, on every line of allofthem.txt. It checks for every line of that file if it is a key in patterns. If that checks evaluates to true awk will print that line.

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

sed/awk - print text between patterns spanned across multiple lines

I am new to scripting and was trying to learn how to extract any text that exists between two different patterns. However, I am still not able to figure out how to extract text between two patterns in the following scenario:
If I have my input file reading:
Hi I would like
to print text
between these
patterns
and my expected output is like:
I would like
to print text
between these
i.e. my first search pattern is "Hi' and skip this pattern, but print everything that exists in the same line following that matched pattern. My second search pattern is "patterns" and I would like to completely avoid printing this line or any lines beyond that.
I tried the following:
sed -n '/Hi/,/patterns/p' test.txt
[output]
Hi I would like
to print text
between these
patterns
Next, I tried:
`awk ' /'"Hi"'/ {flag=1;next} /'"pattern"'/{flag=0} flag { print }'` test.txt
[output]
to print text
between these
Can someone help me out in identifying how to achieve this?
Thanks in advance
You have the right idea, a mini-state-machine in awk but you need some slight mods as per the following transcript:
pax> echo 'Hi I would like
to print text
between these
patterns ' | awk '
/patterns/ { echo = 0 }
/Hi / { gsub("^.*Hi ", "", $0); echo = 1 }
{ if (echo == 1) { print } }'
Or, in compressed form:
awk '/patterns/{e=0}/Hi /{gsub("^.*Hi ","",$0);e=1}{if(e==1){print}}'
The output of that is:
I would like
to print text
between these
as requested.
The way this works is as follows. The echo variable is initially 0 meaning that no echoing will take place.
Each line is checked in turn. If it contains patterns, echoing is disabled.
If it contains Hi followed by a space, echoing is turned on and gsub is used to modify the line to get rid of everything up to the Hi.
Then, regardless, the line (possibly modified) is echoed when the echo flag is on.
Now, there's going to be edge cases such as:
lines containing two occurrences of Hi; or
lines containing something before the patterns.
You haven't specified how they should be handled so I didn't bother, but the basic concept should be the same.
Updated the solution to remove the line "patterns" :
$ sed -n '/^Hi/,/patterns/{s/^Hi //;/^patterns/d;p;}' file
I would like
to print text
between these
This might work for you (GNU sed):
sed '/Hi /!d;s//\n/;s/.*\n//;ta;:a;s/patterns.*$//;tb;$!{n;ba};:b;/^$/d' file
Just set a flag (f) when you find+replace Hi at the start of a line, clear it when you find patterns, then invoke the default print when the flag is set:
$ awk 'sub(/^Hi /,""){f=1} /patterns/{f=0} f' file
I would like
to print text
between these

Resources