Keeping the last two fields in an input line in linux - bash

I have the following problem:
I need to process lines structured as follows:
<e_1> <e_2> ... <e_n-1> <e_n>
where each <e_i> (except for <e_n>) is separated from the next by a single space character. The actual number of <e_i> elements in each line is always at least two, but otherwise unpredictable: one line might consist of five such elements, while the next might have twelve.
For each such line I must remove all the elements, except for the last two - e.g. if the input line is
a b c d e
after processing I should end up with the line
d e
What tool accessible from a bash script would allow me to pull this off?

Just use awk to filter the last two columns:
awk '{print $(NF-1), $NF}'
eg:
$ printf 'a b c d e\nf g\na b c\n' | awk '{print $(NF-1), $NF}'
d e
f g
b c

Actually, immediately after posting this I noticed that a combination of rev and cut will do the trick.

A sed one-liner:
sed 's/.* \(.* .*\)$/\1/'

Related

Bash: how to put each line of a column below the same-row-line of another column?

I'm working with some data using bash and I need that this kind of input file:
Col1 Col2
A B
C D
E F
G H
Turn out in this output file:
Col1
A
B
C
D
E
F
G
H
I tried with some commands but they didn't work. Any kind of suggestions will be very appreciated!
As with many problems, there are many solutions. Here is one using awk:
awk 'NR > 1 {print $1; print $2}' inputfile.txt
The NR > 1 expression says to execute the following block for all line numbers greater than one. (NR is the current record number which is the same as line number by default.)
The {print $1; print $2} code block says to print the first field, then print the second field. The advantage of using awk in this case is that it doesn't matter if the fields are separated by space characters, tabs, or a combination; the fields just have to be separated by some number of whitespace characters.
If the field values on each line are only separated by a single space character, then this would work:
tail -n +2 inputfile.txt | tr ' ' '\n'
In this solution, tail -n +2 is used to print all lines starting with the second line and tr ' ' '\n' is used to replace all the space characters with newlines, as suggested by previously.

Bash: Separating a file by blank lines and assigning to a list

So i have a file for example
a
b
c
d
I'd like to make the list of the lines with data out of this. The empty line would be the seperator. So above file's list would be
First element = a
Second element = b
c
Third element = d
Replace blank lines with ,, then remove newline characters:
cat <file> | sed 's/^$/, /' | tr -d '\n'
The following awk would do:
awk 'BEGIN{RS="";ORS=",";FS="\n";OFS=""}($1=$1)' file
This adds an extra , at the end. You can get rid of that in the following way:
awk 'BEGIN{RS="";ORS=",";FS="\n";OFS=""}
{$1=$1;s=s $0 ORS}END{sub(ORS"$","",s); print s}' file
But what happened now, by making this slight modification to eliminate the last ORS (i.e. comma), you have to store the full thing in memory. So you could then just do it more boring and less elegant by storing the full file in memory:
awk '{s=s $0}END{gsub(/\n\n/,",",s);gsub(/\n/,"",s); print s}' file
The following sed does exactly the same. Store the full file in memory and process it.
sed ':a;N;$!ba;s/\n\n/,/g;s/\n//g' <file>
There is, however, a way to play it a bit more clever with awk.
awk 'BEGIN{RS=OFS="";FS="\n"}{$1=$1; print (NR>1?",":"")$0}' file
It depends on what you need to do with that data.
With perl, you have a one-liner:
$ perl -00 -lnE 'say "element $. = $_"' file.txt
element 1 = a
element 2 = b
c
element 3 = d
But clearly you need to process the elements in some way, and I suspect Perl is not your cup of tea.
With bash you could do:
elements=()
n=0
while IFS= read -r line; do
[[ $line ]] && elements[n]+="$line"$'\n' || ((n++))
done < file.txt
# strip the trailing newline from each element
elements=("${elements[#]/%$'\n'/}")
# and show what's in the array
declare -p elements
declare -a elements='([0]="a" [1]="b
c" [2]="d")'
$ awk -v RS= '{print "Element " NR " = " $0}' file
Element 1 = a
Element 2 = b
c
Element 3 = d
If you really want to say First Element instead of Element 1 then enjoy the exercise :-).

Replace multiple newlines with just 2 newlines using unix utilities

I have tried to look for the correct way to implement this, reading from stdin and printing to stdout. I know that I can use squeeze (-s) to delete multiple lines of the same type, but I want to leave two newlines in the place of many, not just one. I have looked into using uniq as well, but am unsure of how to. I know that fold can also be used, but I cannot find any information on the fold version I want, fold (1p).
So, if I have the text as input:
A B C D
B C D E
I would want the output to instead be
A B C D
B C D E
You can use awk like this:
awk 'BEGIN{RS="";ORS="\n\n"}1' file
RS is the input record separator, ORS is the output record separator.
From the awk manual:
If RS is null, then records are separated by sequences consisting of a newline plus one or more blank lines
That means that the above command splits the input text by two or more blank lines and concatenates them again with exactly two newlines.
Following awk may help you on same.
awk -v lines=$(wc -l < Input_file) 'FNR!=lines && NF{print $0 ORS ORS;next} NF' Input_file
OR
awk -v lines=$(wc -l < Input_file) 'FNR!=lines && NF{$0=$0 ORS ORS} NF' Input_file

analyze two fields at a time with awk

I have a file with approximately 1 000 000 fields (tab delimited), but I need to incrementally look at fields in pairs to see if they are identical or different.
Here is 1 line of the file (abbreviated to 6 fields):
C G G G T A
I essentially need to print 1 if the pairs are identical and 2 if the pairs are different, so the output should be:
2 1 2
Is this possible with an awk for loop? Using awk '{ if ($1==$2) print "1"; else print "2" }' is simply not viable for the number of fields I have.
Thank you!
you can try,
echo "C G G G T A" |
awk '{
for(i=1; i<=NF; i+=2){
printf (i<NF-1?"%s ":"%s\n"), ($i==$(i+1)?1:2)
}
}'
you get,
2 1 2
I would do it with sed instead, probably much quicker (no splitting):
sed -r 's/(^\S|\s\S)\s/\1/g; s/(\S)\1/1/g; s/\S\S/2/g'
The first s/ groups pairs by removing the space between them.
The second s/ finds the matches.
The third s/ converts the leftovers (mismatches).
Or the equivalent, if your sed does not have -r:
sed 's/^\(\S\)\s/\1/; s/\(\s\S\)\s/\1/g; s/\(\S\)\1/1/g; s/\S\S/2/g'

How to grep the last occurrence of a line pattern

I have a file with contents
x
a
x
b
x
c
I want to grep the last occurrence,
x
c
when I try
sed -n "/x/,/b/p" file
it lists all the lines, beginning x to c.
I'm not sure if I got your question right, so here are some shots in the dark:
Print last occurence of x (regex):
grep x file | tail -1
Alternatively:
tac file | grep -m1 x
Print file from first matching line to end:
awk '/x/{flag = 1}; flag' file
Print file from last matching line to end (prints all lines in case of no match):
tac file | awk '!flag; /x/{flag = 1};' | tac
grep -A 1 x file | tail -n 2
-A 1 tells grep to print one line after a match line
with tail you get the last two lines.
or in a reversed way:
tac fail | grep -B 1 x -m1 | tac
Note: You should make sure your pattern is "strong" enough so it gets you the right lines. i.e. by enclosing it with ^ at the start and $ at the end.
This might work for you (GNU sed):
sed 'H;/x/h;$!d;x' file
Saves the last x and what follows in the hold space and prints it out at end-of-file.
not sure how to do it using sed, but you can try awk
awk '{a=a"\n"$0; if ($0 == "x"){ a=$0}} END{print a}' file
POSIX vi (or ex or ed), in case it is useful to someone
Done in Command mode, of course
:set wrapscan
Go to the first line and just search Backwards!
1G?pattern
Slower way, without :set wrapscan
G$?pattern
Explanation:
G go to the last line
Move to the end of that line $
? search Backwards for pattern
The first backwards match will be the same as the last forward match
Either way, you may now delete all lines above current (match)
:1,.-1d
or
kd1G
You could also delete to the beginning of the matched line prior to the line deletions with d0 in case there were multiple matches on the same line.
POSIX awk, as suggested at
get last line from grep search on multiple files
awk '(FNR==1)&&s{print s; s=""}/PATTERN/{s=$0}END{if(s) print s}'
if you wanna do awk in truly hideous one-liner fashion but getting awk to resemble closer to functional programming paradigm syntax without having to keep track when the last occurrence is
mawk/mawk2/gawk 'BEGIN { FS = "=7713[0-9]+="; RS = "^$";
} END { print ar0[split($(0 * sub(/\n.+$/,"",$NF)), ar0, ORS)] }'
Here i'm employing multiple awk short-hands :
sub(/[\n.+$/, "", $NF) # trimming all extra rows after pattern
g/sub() returns # of substitutions made, so multiplying that by 0 forces the split() to be splitting $0, the full file, instead.
split() returns # of items in the array (which is another way of saying the position of last element), so even though I've already trimmed out the trailing \n, i still can directly print ar0[split()], knowing that ORS will fill in the missing trailing \n.
That's why this code looks like i'm trying to extract array items before the array itself is defined, but due to flow of logic needed, the array will become defined by the time it reaches print.
Now if you want something simpler, these 2 also work
mawk/gawk 'BEGIN { FS="=7713[0-9]+="; RS = "^$"
} END { $NF = substr($NF, 1, index($NF, ORS));
FS = ORS; $0 = $0; print $(NF-1) }'
or
mawk/gawk '/=7713[0-9]+=/ { lst = $0 } END { print lst }'
I didn't use the same x|c requirements as OP just to showcase these work regardless of whether you need fixed-strings or regex based matches.
The above solutions only work for one single file, to print the last occurrence for many files (say with suffix .txt), use the following bash script
#!/bin/bash
for fn in `ls *.txt`
do
result=`grep 'pattern' $fn | tail -n 1`
echo $result
done
where 'pattern' is what you would like to grep.

Resources