Repeating characters when attempting to concatenate man pages to plain text files - bash

I tried converting some man pages to plain text files. But when I open the file, many of the words have unnecessary repeating characters.
For example doing man awk > awk.txt changes the sections in the awk.txt file from:
NAME to NNAAMMEE
SYNOPSIS to SSYYNNOOPPSSIISS
DESCRIPTION to DDEESSCCRRIIPPTTIIOONN
I thought this would be a simple task. Why does this happen?

Man pages contain formating information (for instance to indicate if some words should be bold). Consequently, some characters may appear repeated when redirecting the output in a file.
You may want to try:
man awk | col -b > awk.txt
What col is doing:
col — filter reverse line feeds from input
SYNOPSIS
col [-bfhpx] [-l num]
DESCRIPTION
The col utility filters out reverse (and half reverse) line feeds so that the output is in the correct order with only forward and half
forward line feeds, and
replaces white-space characters with tabs where possible. This can be useful in processing the output of nroff(1) and tbl(1).
The col utility reads from the standard input and writes to the standard output.
The options are as follows:
-b Do not output any backspaces, printing only the last character written to each column position.

Related

Bash script which adds space inside long words in Pages file

I like to convert documents to EPUB format because it is easier for me to read. However, if I do this for for example some code documentation, some really long lines of code are not readable in the EPUB, because they trail off-screen. I would like to automatically insert spaces in any words in a text file (specifically, a Pages document) over a certain length, so they are reduced to say, 10 character words, at maximum. Then, I will convert that Pages document to an EPUB.
How can I write a bash script which goes through a Pages document and inserts spaces into any word longer than, perhaps, 10 characters?
sed is your friend:
$ cat input.txt
a file with a
verylongwordinit to test with.
$ sed 's/[^[:space:]]\{10\}/& /g' input.txt
a file with a
verylongwo rdinit to test with.
For every sequence of 10 non-whitespace characters in each line, add a space after (The & in the replacement text is itself replaced with the matched text).
If you want to change the file inline instead of making a copy, ed comes into play:
ed input.txt <<'EOF'
s/[^[:space:]]\{10\}/& /g
w
EOF
(Or some versions of sed take an -i switch for inline editing)

Shell scripting cut -d " " -f4 file.txt command

I have a file with words separated by only single space.
I want to read 4th word from each line of file using command:
cut -d " " -f4 file.txt
It works fine, but I don't understand its property.
If a line contains 4 or more words then it prints the 4th word.
If a line contains only 1 word then it prints that word.
If a line contains 2 or 3 words then it prints nothing.
I want to know that how it is working.
From man cut:
-f, --fields=LIST
select only these fields; also print any line that contains no delimiter character, unless the -s option is specified
If a line contains 1 word, then it does not contain the delimiter and therefore cut prints the whole line (which is exactly that one word).
Other cases are obvious: the line contains at least one delimiter, therefore it prints the fourth word, if available.
If you add the -s parameter, it will print the fourth word only if available (and thus ignore lines with one word without delimiter).
By default, cut expects each input line to contain the delimiter (space in the OP example). Lines that do not contain the delimited are printed as-is.
The default behavior can be changes with -s, which will always print the 4th column, even when the delimited is not found on the line (the case of ` word). Use
cut -s -d " " -f4 file.txt
As to the why this is the default behavior - no clear answer. Probably, this behavior was used to allow some lines to be excluded from the filtering. The initial Unix systems had lot of semi-structured files, where this functionality could have been used to process man pages, nroff pages and similar.
From the man page:
-f list
Cut based on a list of fields, assumed to be separated in the file by
a delimiter character (see -d). Each selected field shall be output.
Output fields shall be separated by a single occurrence of the field
delimiter character. Lines with no field delimiters shall be passed
through intact, unless -s is specified. It shall not be an error to
select fields not present in the input line.
-s, --only-delimited do not print lines not containing delimiters
See also: https://unix.stackexchange.com/questions/157677/does-cut-return-any-fields-if-separator-does-not-exist

Is there a way to create a bash script that prints out specific paragraphs, i.e. prints a specific block of text between empty lines?

Working on a script that reads a text file and redirects out a paragraph based on an input. Let's say the input is 2, it would redirect the second paragraph in the text file to another file. The text files wouldn't have headers, they would be plain text paragraphs separated by empty lines. I've been looking at egrep but I'm not very familiar with regex, so I'm not sure where to start. Any help would be appreciated.
With GNU awk, this function
print_nth_paragraph() {
awk -v RS= -v p="$1" 'NR == p'
}
will print the Nth paragraph of its standard input, N being the first and only parameter. A paragraph is delimited by two or more consecutive new lines. Adapt it to your needs.

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

use sed to merge lines and add comma

I found several related questions, but none of them fits what I need, and since I am a real beginner, I can't figure it out.
I have a text file with entries like this, separated by a blank line:
example entry &with/ special characters
next line (any characters)
next %*entry
more words
I would like the output merge the lines, put a comma between, and delete empty lines. I.e., the example should look like this:
example entry &with/ special characters, next line (any characters)
next %*entry, more words
I would prefer sed, because I know it a little bit, but am also happy about any other solution on the linux command line.
Improved per Kent's elegant suggestion:
awk 'BEGIN{RS="";FS="\n";OFS=","}{$1=$1}7' file
which allows any number of lines per block, rather than the 2 rigid lines per block I had. Thank you, Kent. Note: The 7 is Kent's trademark... any non-zero expression will cause awk to print the entire record, and he likes 7.
You can do this with awk:
awk 'BEGIN{RS="";FS="\n";OFS=","}{print $1,$2}' file
That sets the record separator to blank lines, the field separator to newlines and the output field separator to a comma.
Output:
example entry &with/ special characters,next line (any characters)
next %*entry,more words
Simple sed command,
sed ':a;N;$!ba;s/\n/, /g;s/, , /\n/g' file
:a;N;$!ba;s/\n/, /g -> According to this answer, this code replaces all the new lines with ,(comma and space).
So After running only the first command, the output would be
example entry &with/ special characters, next line (any characters), , next %*entry, more words
s/, , /\n/g - > Replacing , , with new line in the above output will give you the desired result.
example entry &with/ special characters, next line (any characters)
next %*entry, more words
This might work for you (GNU sed):
sed ':a;$!N;/.\n./s/\n/, /;ta;/^[^\n]/P;D' file
Append the next line to the current line and if there are characters either side of the newline substitute the newline with a comma and a space and then repeat. Eventually an empty line or the end-of-file will be reached, then only print the next line if it is not empty.
Another version but a little more sofisticated (allowing for white space in the empty line) would be:
sed ':a;$!N;/^\s*$/M!s/\n/, /;ta;/\`\s*$/M!P;D' file
sed -n '1h;1!H
$ {x
s/\([^[:cntrl:]]\)\n\([^[:cntrl:]]\)/\1, \2/g
s/\(\n\)\n\{1,\}/\1/g
p
}' YourFile
change all after loading file in buffer. Could be done "on the fly" while reading the file and based on empty line or not.
use -e on GNU sed

Resources