awk to print all columns from the nth to the last with spaces - bash

I have the following input file:
a 1 o p
b 2 o p p
c 3 o p p p
in the last line there is a double space between the last p's,
and columns have different spacing
I have used the solution from: Using awk to print all columns from the nth to the last.
awk '{for(i=2;i<=NF;i++){printf "%s ", $i}; printf "\n"}'
and it works fine, untill it reaches double-space in the last column and removes one space.
How can I avoid that while still using awk?

Since you want to preserve spaces, let's just use cut:
$ cut -d' ' -f2- file
1 o p
2 o p p
3 o p p p
Or for example to start by column 4:
$ cut -d' ' -f4- file
p
p p
p p p
This will work as long as the columns you are removing are one-space separated.
If the columns you are removing also contain different amount of spaces, you can use the beautiful solution by Ed Morton in Print all but the first three columns:
awk '{sub(/[[:space:]]*([^[:space:]]+[[:space:]]+){1}/,"")}1'
^
number of cols to remove
Test
$ cat a
a 1 o p
b 2 o p p
c 3 o p p p
$ awk '{sub(/[[:space:]]*([^[:space:]]+[[:space:]]+){2}/,"")}1' a
o p
o p p
o p p p

GNU sed
remove first n fields
sed -r 's/([^ ]+ +){2}//' file
GNU awk 4.0+
awk '{sub("([^"FS"]"FS"){2}","")}1' file
GNU awk <4.0
awk --re-interval '{sub("([^"FS"]"FS"){2}","")}1' file
Incase FS one doesn't work(Eds suggestion)
awk '{sub(/([^ ] ){2}/,"")}1' file
Replace 2 with number of fields you wish to remove
EDIT
Another way(doesn't require re-interval)
awk '{for(i=0;i<2;i++)sub($1"[[:space:]]*","")}1' file
Further edit
As advised by EdMorton it is bad to use fields in sub as they may contain metacharacters so here is an alternative(again!)
awk '{for(i=0;i<2;i++)sub(/[^[:space:]]+[[:space:]]*/,"")}1' file
Output
o p
o p p
o p p p

In Perl, you can use split with capturing to keep the delimiters:
perl -ne '#f = split /( +)/; print #f[ 1 * 2 .. $#f ]'
# ^
# |
# column number goes
# here (starting from 0)

If you want to preserve all spaces after the start of the second column, this will do the trick:
{
match($0, ($1 "[ \\t*]+"))
print substr($0, RSTART+RLENGTH)
}
The call to match locates the start of the first 'token' on the line and the length of the first token and the whitespace that follows it. Then you just print everything on the line after that.
You could generalize it somewhat to ignore the first N tokens this way:
BEGIN {
N = 2
}
{
r = ""
for (i=1; i<=N; i++) {
r = (r $i "[ \\t*]+")
}
match($0, r)
print substr($0, RSTART+RLENGTH)
}
Applying the above script to your example input yields:
o p
o p p
o p p p

Related

How to remove columns from a file by given the columns in anther file in Linux?

Suppose I have a file A contains the column numbers need to be removed (I really have over 500 columns in my input file fileB),
fileA:
2
5
And I want to remove those columns(2 and 5) from fileB:
a b c d e f
g h i j k l
in Linux to get:
a c d f
g i j l
what should I do? I found out that I could eliminate printing those columns with the code:
awk '{$2=$5="";print $0}' fileB
however, there are two problems in this way, first it does not really remove those columns, it just using empty string to replace them; second, instead of manually typing in those column numbers, how can I get these column numbers by reading from another file.
Original Question:
Suppose I have a file A contains the column numbers need to be removed,
file A:
223
345
346
567
And I want to remove those columns(223, 345,567) from file B in Linux, what should I do?
If your cut have the --complement option then you can do:
cut --complement -d ' ' -f "$(echo $(<FileA))" fileB
$ cat tst.awk
NR==FNR {
badFldNrs[$1]
next
}
FNR == 1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in badFldNrs) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk fileA fileB
a c d f
g i j l
One awk idea:
awk '
FNR==NR { skip[$1] ; next } # store field #s to be skipped
{ line="" # initialize output variable
pfx="" # first prefix will be ""
for (i=1;i<=NF;i++) # loop through the fields in this input line ...
if ( !(i in skip) ) { # if field # not mentioned in the skip[] array then ...
line=line pfx $i # add to our output variable
pfx=OFS # prefix = OFS for 2nd-nth fields to be added to output variable
}
if ( pfx == OFS ) # if we have something to print ...
print line # print output variable to stdout
}
' fileA fileB
NOTE: OP hasn't provided the input/output field delimiters; OP can add the appropriate FS/OFS assignments as needed
This generates:
a c d f
g i j l
Using awk
$ awk 'NR==FNR {col[$1]=$1;next} {for(i=1;i<=NF;++i) if (i != col[i]) printf("%s ", $i); printf("\n")}' fileA fileB
a c d f
g i j l

If there's match append text to the beginning of the next line

I have a file like this
from a
b
to c
d
from e
f
from g
h
to i
j
If there's match for from, add to to the beginning of the next line. If there's a match for to, add from to the beginning of the next line. The output should be like this
from a
to b
to c
from d
from e
to f
from g
to h
to i
from j
Can this be done using any unix commands?
I have tried paste command to merge every 2 lines and then using sed. Something like this. But it's definitely wrong. Also, I don't know how to split it back again.
paste -d - - <file> | sed "s/\(^from.*\)/\1 to/" | sed "s/\(^to.*\)/\1 from/"
I think there should be an easier solution to this compared to what I'm doing.
Using sed :
sed '/^from/{n;s/^/to /;b};/^to/{n;s/^/from /}'
You can try it here.
$ awk '{if ($1 ~ /^(from|to)$/) dir=$1; else $0=(dir=="from" ? "to" : "from") OFS $0} 1' file
from a
to b
to c
from d
from e
to f
from g
to h
to i
from j
Could you please try following.
awk '
{
val=prev=="from" && $0 !~ /to/?"to "$0:prev=="to" && $0 !~/from/?"from "$0:$0
prev=$1
$0=val
}
1
' Input_file
Something like this should work :
awk '
#Before reading the file I build a dictionary that links "from" keywoard to "to" value and inversally
BEGIN{kw["from"]="to"; kw["to"]="from"}
#If the first word of the line is a key of my dictionary (to or from), I save the first word in k variable and print the line
$1 in kw{k=$1;print;next}
#Else I add the "opposite" of k at the beginning of the line
{print kw[k], $0}
' <input>

Matching contents of one file with another and returning second column

So I have two txt files
file1.txt
s
j
z
z
e
and file2.txt
s h
f a
j e
k m
z l
d p
e o
and what I want to do is match the first letter of file1 with the first letter of file 2 and return the second column of file 2. so for example excepted output would be
h
e
l
l
o
I'm trying to use join file1.txt file2.txt but that just prints out the entire second file. not sure how to fix this. Thank you.
This is an awk classic:
$ awk 'NR==FNR{a[$1]=$2;next}{print a[$1]}' file2 file1
h
e
l
l
o
Explained:
$ awk '
NR==FNR { # processing file2
a[$1]=$2 # hash records, first field as key, second is the value
next
} { # second file
print a[$1] # output, change the record with related, stored one
}' file2 file1

How to repeat lines in bash and paste with different columns?

is there a short way in bash to repeat the first line of a file as often as needed to paste it with another file in a kronecker product type (for the mathematicians of you)?
What I mean is, I have a file A:
a
b
c
and a file B:
x
y
z
and I want to merge them as follows:
a x
a y
a z
b x
b y
b z
c x
c y
c z
I could probably write a script, read the files line by line and loop over them, but I am wondering if there a short one-line command that could do the same job. I can't think of one and as you can see, I am also lacking some keywords to search for. :-D
Thanks in advance.
You can use this one-liner awk command:
awk 'FNR==NR{a[++n]=$0; next} {for(i=1; i<=n; i++) print $0, a[i]}' file2 file1
a x
a y
a z
b x
b y
b z
c x
c y
c z
Breakup:
NR == FNR { # While processing the first file in the list
a[++n]=$0 # store the row in array 'a' by the an incrementing index
next # move to next record
}
{ # while processing the second file
for(i=1; i<=n; i++) # iterate over the array a
print $0, a[i] # print current row and array element
}
alternative to awk
join <(sed 's/^/_\t/' file1) <(sed 's/^/_\t/' file2) | cut -d' ' -f2-
add a fake key for join to have all records of file1 to match all records of file2, trim afterwards

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Resources