Adding variable amount of lines to file in Bash - bash

I have a file with lines of a format XXXXXX_N where N is some number. For example:
41010401_1
42023920_3
45788_1
I would like to add N-1 lines before every line where N>1 such that I have lines for the specified XXXX value with all N values up to and including the original N:
41010401_1
42023920_1
42023920_2
42023920_3
45788_1
I thought about doing it with sed but I'm not sure how to conditionally append different amount of lines with different value which is based on what sed reads.
Is sed even the correct command to deal with this problem?
Any help would be appreciated.

One way in awk is to set field separators to underscore and print all missing records when 2nd field is greater than 1 in a loop like below.
$ awk 'BEGIN{FS=OFS="_"} $2>1{for(i=1;i<$2;i++) print $1,i} 1' file
41010401_1
42023920_1
42023920_2
42023920_3
45788_1

Related

If Partial Duplicate on Line, Remove Line

I have a file with 400+ lines, but some of the lines have partial duplicates. Below is a simplified version.
file.txt:
A_12_23 A_12_34 B_12_23 B_12_34
A_1_34 A_23_34 B_1_12 B_1_23
The fields are whitespace-separated where the letter before the first underscore is an identifier and the values after the first underscore are its values. A partial duplicate is one where one of the fields for A has the same values after the underscore as one of the B fields. The lines are sorted so that the A fields are always before the B fields. There are no other identifiers.
What I would like to do is remove any line with a partial duplicate.
output.txt:
A_1_34 A_23_34 B_1_12 B_1_23
How would I go about doing this? I know how to remove exact duplicates on a line by:
awk '$1!=$2' file.txt > output.txt # Can use various combinations if needed
I am not sure about about partial duplicates. For example: 12_23 is repeated two times on the first line, so I want it deleted. Stopping at deleting duplicated partial strings is okay since it will also delete if repeated more.
Please let me know how I can improve this question. Thanks in advance!
Slightly generalizing the answer by malarres, here is a regex which looks for any value after A which also occurs after B, followed by space or newline. The number of digit groups in each field is arbitrary, but this does assume that all A values are before all B values, and that these tokens only occur at the beginning of a field.
grep -Ev 'A_([^_ ]+(_[^ _]+)*) (.* )?B_\1( |$)'
Rather than awk you can use grep for that
$ grep -v -E '._(.._..).*\1' file.txt
-v to print lines NOT matching
'._(.._..).*\1' looks for repetitions of the pattern .._..
Exclude first two characters of each field and check for duplicates, if not any, print the line. You can modify the last argument of substr to exlude any number of initial chars.
awk '{delete a; for (i=1;i<=NF;i++) if (a[substr($i,3)]++) next} 1' file

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

Remove every n lines for removing datablock using sed or awk

I have a big file made up of 316125000 lines. This file is made up of 112500 data blocks, and each data block has 2810 lines.
I need to reduce the size of the file, so I want to leave the 1st, 10th, 20th, ... 112490th, and 112450th data blocks, and remove all other data blocks. This will gonna give me 11250 data blocks as a result.
This means the same thing that I want to remove every 2811 ~ 28100 lines, and leaving every 1~2810, and 28101~30910 .... lines.
I was thinking of awk, sed or grep, but which one is faster, and how can I acheive this? I know how to remove every 2nd or 3rd line, with awk and NR, but I don't know how to remove big chunk of lines repetitively.
Thanks
Best,
Something along these lines might work:
awk 'int((NR - 1) / 2810) % 10 == 0' <infile >outfile
That is, int((NR - 1) / 2810) gives the (zero-based) number of the block of 2810 lines for the current line (NR), and if the remainder of that block number divided by ten is 0 (% 10 == 0) prints the line. This should result in every 10th block being printed, including the first (block number 0).
I wouldn't guess which is fastest, but I can provide a GNU sed recipe for your benchmarking:
sed -e '2811~28100,+25289d' <input >output
This says: starting at line 2811 and every 28100 lines thereafter, delete it and the next 25289 lines.
Equivalently, we can use sed -n and print lines 1-2810 every 28100 lines:
sed -ne '1~28100,+2809p' <input >output

How to insert a new line character after a fixed number of characters in a file

I am looking for a bash or sed script (preferably a one-liner) with which I can insert a new line character after a fixed number of characters in huge text file.
How about something like this? Change 20 is the number of characters before the newline, and temp.text is the file to replace in..
sed -e "s/.\{20\}/&\n/g" < temp.txt
Let N be a shell variable representing the count of characters after which you want a newline. If you want to continue the count accross lines:
perl -0xff -pe 's/(.{'$N'})/$1\n/sg' input
If you want to restart the count for each line, omit the -0xff argument.
Because I can't comment directly (to less reputations) a new hint to upper comments:
I prefer the sed command (exactly what I want) and also tested the Posix-Command fold. But there is a little difference between both commands for the original problem:
If you have a flat file with n*bytes records (without any linefeed characters) and use the sed command (with bytes as number (20 in the answer of #Kristian)) you got n lines if you count with wc. If you use the fold command you only got n-1 lines with wc!
This difference is sometimes important to know, if your input file doesn't contain any newline character, you got one after the last line with sed and got no one with fold
if you mean you want to insert your newline after a number of characters with respect to the whole file, eg after the 30th character in the whole file
gawk 'BEGIN{ FS=""; ch=30}
{
for(i=1;i<=NF;i++){
c+=1
if (c==ch){
print ""
c=0
}else{
printf $i
}
}
print ""
}' file
if you mean insert at specific number of characters in each line eg after every 5th character
gawk 'BEGIN{ FS=""; ch=5}
{
print substr($0,1,ch) "\n" substr($0,ch)
}' file
Append an empty line after a line with exactly 42 characters
sed -ie '/^.\{42\}$/a\
' huge_text_file
This might work for you:
echo aaaaaaaaaaaaaaaaaaaax | sed 's/./&\n/20'
aaaaaaaaaaaaaaaaaaaa
x

Resources