Replace multiple newlines with just 2 newlines using unix utilities - bash

I have tried to look for the correct way to implement this, reading from stdin and printing to stdout. I know that I can use squeeze (-s) to delete multiple lines of the same type, but I want to leave two newlines in the place of many, not just one. I have looked into using uniq as well, but am unsure of how to. I know that fold can also be used, but I cannot find any information on the fold version I want, fold (1p).
So, if I have the text as input:
A B C D
B C D E
I would want the output to instead be
A B C D
B C D E

You can use awk like this:
awk 'BEGIN{RS="";ORS="\n\n"}1' file
RS is the input record separator, ORS is the output record separator.
From the awk manual:
If RS is null, then records are separated by sequences consisting of a newline plus one or more blank lines
That means that the above command splits the input text by two or more blank lines and concatenates them again with exactly two newlines.

Following awk may help you on same.
awk -v lines=$(wc -l < Input_file) 'FNR!=lines && NF{print $0 ORS ORS;next} NF' Input_file
OR
awk -v lines=$(wc -l < Input_file) 'FNR!=lines && NF{$0=$0 ORS ORS} NF' Input_file

Related

Bash: how to put each line of a column below the same-row-line of another column?

I'm working with some data using bash and I need that this kind of input file:
Col1 Col2
A B
C D
E F
G H
Turn out in this output file:
Col1
A
B
C
D
E
F
G
H
I tried with some commands but they didn't work. Any kind of suggestions will be very appreciated!
As with many problems, there are many solutions. Here is one using awk:
awk 'NR > 1 {print $1; print $2}' inputfile.txt
The NR > 1 expression says to execute the following block for all line numbers greater than one. (NR is the current record number which is the same as line number by default.)
The {print $1; print $2} code block says to print the first field, then print the second field. The advantage of using awk in this case is that it doesn't matter if the fields are separated by space characters, tabs, or a combination; the fields just have to be separated by some number of whitespace characters.
If the field values on each line are only separated by a single space character, then this would work:
tail -n +2 inputfile.txt | tr ' ' '\n'
In this solution, tail -n +2 is used to print all lines starting with the second line and tr ' ' '\n' is used to replace all the space characters with newlines, as suggested by previously.

Keeping the last two fields in an input line in linux

I have the following problem:
I need to process lines structured as follows:
<e_1> <e_2> ... <e_n-1> <e_n>
where each <e_i> (except for <e_n>) is separated from the next by a single space character. The actual number of <e_i> elements in each line is always at least two, but otherwise unpredictable: one line might consist of five such elements, while the next might have twelve.
For each such line I must remove all the elements, except for the last two - e.g. if the input line is
a b c d e
after processing I should end up with the line
d e
What tool accessible from a bash script would allow me to pull this off?
Just use awk to filter the last two columns:
awk '{print $(NF-1), $NF}'
eg:
$ printf 'a b c d e\nf g\na b c\n' | awk '{print $(NF-1), $NF}'
d e
f g
b c
Actually, immediately after posting this I noticed that a combination of rev and cut will do the trick.
A sed one-liner:
sed 's/.* \(.* .*\)$/\1/'

AWK print if found three matches, one false

There is several lines in a file that looks like:
A B C H
A B C D
and, I want to print all lines that contain this RE:
/A\tB/
But, if the line contain and H in the fourth field, do not print, the output would be:
A B C D
It could be written in one line in sed, awk or grep?
The only thing that I know is:
awk '/^A\tB/'
This will work:
awk '$1$2 == "AB" && $4 != "H"' file
If all entries are single characters this will also work:
awk '$1$2$3$4 ~ /^AB.[^H]/' file
With awk one-liner:
awk -F'\t' '$1=="A" && $2=="B" && $4!="H"' file
-F'\t' - tab char \t is treated as field separator
The output:
A B C D
This might work for you (GNU sed):
sed '/^A\tB\t.\t[^H]/!d' file
If a line does not contain A ,B ,any character and a character other than H separated by tabs, delete it.
Could be written:
sed -n '/^A\tB\t.\t[^H]/p' file
Use this.
awk '/^A\tB/ { if ( $4 != "H" ) print }'

cut string in a specific column in bash

How can I cut the leading zeros in the third field so it will only be 6 characters?
xxx,aaa,00000000cc
rrr,ttt,0000000yhh
desired output
xxx,aaa,0000cc
rrr,ttt,000yhh
or here's a solution using awk
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3)}1'
output
xxx,aaa,0000cc
rrr,ttt,000yhh
awk uses -F (or FS for FieldSeparator) and you must use OFS for OutputFieldSeparator) .
sub(/srchtarget/, "replacmentstring", stringToFix) is uses a regular expression to look for 4 0s at the front of (^) the third field ($3).
The 1 is a shorthand for the print statement. A longhand version of the script would be
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3);print}'
# ---------------------------------------------------------^^^^^^
Its all related to awk's /pattern/{action} idiom.
IHTH
If you can assume there are always three fields and you want to strip off the first four zeros in the third field you could use a monstrosity like this:
$ cat data
xxx,0000aaa,00000000cc
rrr,0000ttt,0000000yhh
$ cat data |sed 's/\([^,]\+\),\([^,]\+\),0000\([^,]\+\)/\1,\2,\3/
xxx,0000aaa,0000cc
rrr,0000ttt,000yhh
Another more flexible solution if you don't mind piping into Python:
cat data | python -c '
import sys
for line in sys.stdin():
print(",".join([f[4:] if i == 2 else f for i, f in enumerate(line.strip().split(","))]))
'
This says "remove the first four characters of the third field but leave all other fields unchanged".
Using awks substr should also work:
awk -F, -v OFS=, '{$3=substr($3,5,6)}1' file
xxx,aaa,0000cc
rrr,ttt,000yhh
It just take 6 characters from 5 position in field 3 and set it back to field 3

Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!
With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !
you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)
you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.
Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.
What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.
If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Resources