shortening headers using awk - bash

I have headers like
>XX|6226515|new|xx_000000.1| XXXXXXX
in a text file which I am trying shorten to
>XX6226515
using awk. I tried
awk -F"|" '/>/{$0=">"$1}1' input.txt > output.txt
but it yields the following instead
>XX|6226515|new|

awk -F"|" '{print $1$2}' input.txt > output.txt
Output:
>XX6226515

sed solution:
sed -e 's/|//' -e 's/|.*//'
The first substitution removes the first vertical bar, the second one removes the second one and anything after it.

$ awk -F'|' '$0=$1$2' <<< ">XX|6226515|new|xx_000000.1| XXXXXXX"
>XX6226515

This cut can also make it:
cut -d"|" --output-delimiter="" -f-2
See output:
$ echo ">XX|6226515|new|xx_000000.1| XXXXXXX" | cut -d"|" --output-delimiter="" -f-2
>XX6226515
-d"|" sets | as field delimiter.
--output-delimiter="" indicates that the output delimiter has to be empty.
-f-2 indicates that it has to print all records up to the 2nd (inclusive).
Also with just bash:
while IFS="|" read a b _
do
echo "$a$b"
done <<< ">XX|6226515|new|xx_000000.1| XXXXXXX"
See output:
$ while IFS="|" read a b _; do echo "$a$b"; done <<< ">XX|6226515|new|xx_000000.1| XXXXXXX"
>XX6226515

Related

Print only the contents after a certain pattern match

I have a string like this:
query:schema:query_result{cell=ab}: <timestamp>
I'd like to just print the ab and assign it to a variable. How can I do this with grep/sed?
You may try his,
$ var=$(grep -oP '=\K\w+' <<< "$str")
or
$ sed 's/.*=\(\w\+\).*/\1/' <<<"$var"
ab
You can also use awk:
s='query:schema:query_result{cell=ab}: <timestamp>'
awk -F '[=}]' '{print $2}' <<< "$s"
ab
To assign it to a variable:
var="$(awk -F '[=}]' '{print $2}' <<< "$s")"

behavior of awk in read line

$ cat file
11 asasaw121
12 saasks122
13 sasjaks22
$ cat no
while read line
do
var=$(awk '{print $1}' $line)
echo $var
done<file
$ cat yes
while read line
do
var=$(echo $line | awk '{print $1}')
echo $var
done<file
$ sh no
awk: can't open file 11
source line number 1
awk: can't open file 12
source line number 1
awk: can't open file 13
source line number 1
$ sh yes
11
12
13
Why doesn't the first one work? What does awk expect to find in $1 in it? I think understanding this will help me avoid numerous scripting problems.
awk always expects a file name as input
In following, $line is string not a file.
var=$(awk '{print $1}' $line)
You could say (Note double quotes around variable)
var=$(awk '{print $1}' <<<"$line")
Why doesn't the first one work?
Because of this line:
var=$(awk '{print $1}' $line)
Which assumes $line is a file.
You can make it:
var=$(echo "$line" | awk '{print $1}')
OR
var=$(awk '{print $1}' <<< "$line")
awk '{print $1}' $line
^^ awk expects to see a file path or list of file paths here
what it is getting from you is the actual file line
What you want to do is pipe the line into awk as you do in your second example.
You got the answers to your specific questions but I'm not sure it's clear that you would never actually do any of the above.
To print the first field from a file you'd either do this:
while IFS= read -r first rest
do
printf "%s\n" "$first"
done < file
or this:
awk '{print $1}' file
or this:
cut -d ' ' -f1 <file
The shell loop would NOT be recommended.

How to print all columns but last 2?

How to print all columns but last 2?
e.g
input :echo FB_SYS_0032_I03_LTO3_idaen02r_02_20130820_181008
output : FB_SYS_0032_I03_LTO3_idaen02r_02
delimiter : _ (underscore)
for your example, this awk one liner should do:
awk -F'_' -v OFS='_' 'NF-=2' file
test:
kent$  awk -F'_' -v OFS='_' 'NF-=2' <<< "FB_SYS_0032_I03_LTO3_idaen02r_02_20130820_181008"
FB_SYS_0032_I03_LTO3_idaen02r_02
Just use an RE that describes the last 2 fields:
awk '{sub(/_[^_]*_[^_]*$/,"")}1'
or:
sed 's/_[^_]*_[^_]*$//'
e.g.:
$ echo FB_SYS_0032_I03_LTO3_idaen02r_02_20130820_181008 | awk '{sub(/_[^_]*_[^_]*$/,"")}1'
FB_SYS_0032_I03_LTO3_idaen02r_02
$ echo FB_SYS_0032_I03_LTO3_idaen02r_02_20130820_181008 | sed 's/_[^_]*_[^_]*$//'
FB_SYS_0032_I03_LTO3_idaen02r_02
Te above will work with any modern awk and any sed on any system.
use this awk command:
awk -F "_" '{for (i=1; i<=NF-2; i++) {printf ("%s", $i); if (i<NF-2) printf "_"} print ""}'
FB_SYS_0032_I03_LTO3_idaen02r_02
Using sed:
sed -r 's/(_[^_]*){2}$//'
For example,
$ echo 1_2_3_4_5 | sed -r 's/(_[^_]*){2}$//'
1_2_3
$ echo 1_2_3_4 | sed -r 's/(_[^_]*){2}$//'
1_2
$ echo FB_SYS_0032_I03_LTO3_idaen02r_02_20130820_181008 | sed -r 's/(_[^_]*){2}$//'
FB_SYS_0032_I03_LTO3_idaen02r_02
Probably this is the simplest way:
$ input="FB_SYS_0032_I03_LTO3_idaen02r_02_20130820_181008"
$ echo "${input%_*_*}"
FB_SYS_0032_I03_LTO3_idaen02r_02

Awk: Drop last record separator in one-liner

I have a simple command (part of a bash script) that I'm piping through awk but can't seem to suppress the final record separator without then piping to sed. (Yes, I have many choices and mine is sed.) Is there a simpler way without needing the last pipe?
dolls = $(egrep -o 'alpha|echo|november|sierra|victor|whiskey' /etc/passwd \
| uniq | awk '{IRS="\n"; ORS=","; print}'| sed s/,$//);
Without the sed, this produces output like echo,sierra,victor, and I'm just trying to drop the last comma.
You don't need awk, try:
egrep -o ....uniq|paste -d, -s
Here is another example:
kent$ echo "a
b
c"|paste -d, -s
a,b,c
Also I think your chained command could be simplified. awk could do all things in an one-liner.
Instead of egrep, uniq, awk, sed etc, all this can be done in one single awk command:
awk -F":" '!($1 in a){l=l $1 ","; a[$1]} END{sub(/,$/, "", l); print l}' /etc/password
Here is a small and quite straightforward one-liner in awk that suppresses the final record separator:
echo -e "alpha\necho\nnovember" | awk 'y {print s} {s=$0;y=1} END {ORS=""; print s}' ORS=","
Gives:
alpha,echo,november
So, your example becomes:
dolls = $(egrep -o 'alpha|echo|november|sierra|victor|whiskey' /etc/passwd | uniq | awk 'y {print s} {s=$0;y=1} END {ORS=""; print s}' ORS=",");
The benefit of using awk over paste or tr is that this also works with a multi-character ORS.
Since you tagged it bash here is one way of doing it:
#!/bin/bash
# Read the /etc/passwd file in to an array called names
while IFS=':' read -r name _; do
names+=("$name");
done < /etc/passwd
# Assign the content of the array to a variable
dolls=$( IFS=, ; echo "${names[*]}")
# Display the value of the variable
echo "$dolls"
echo "a
b
c" |
mawk 'NF-= _==$NF' FS='\n' OFS=, RS=
a,b,c

get the file name from the path

I have a file file.txt having the following structure:-
./a/b/c/sdsd.c
./sdf/sdf/wer/saf/poi.c
./asd/wer/asdf/kljl.c
./wer/asdfo/wer/asf/asdf/hj.c
How can I get only the c file names from the path.
i.e., my output will be
sdsd.c
poi.c
kljl.c
hj.c
You can do this simpy with using awk.
set field seperator FS="/" and $NF will print the last field of every record.
awk 'BEGIN{FS="/"} {print $NF}' file.txt
or
awk -F/ '{print $NF}' file.txt
Or, you can do with cut and unix command rev like this
rev file.txt | cut -d '/' -f1 | rev
You can use basename command:
basename /a/b/c/sdsd.c
will give you sdsd.c
For a list of files in file.txt, this will do:
while IFS= read -r line; do basename "$line"; done < file.txt
Using sed:
$ sed 's|.*/||g' file
sdsd.c
poi.c
kljl.c
hj.c
The most simple one ($NF is the last column of current line):
awk -F/ '{print $NF}' file.txt
or using bash & parameter expansion:
while read file; do echo "${file##*/}"; done < file.txt
or bash with basename :
while read file; do basename "$file"; done < file.txt
OUTPUT
sdsd.c
poi.c
kljl.c
hj.c
Perl solution:
perl -F/ -ane 'print $F[#F-1]' your_file
Also you can use sed:
sed 's/.*[/]//g' your_file

Resources