How to remove all but the last 3 parts of FQDN? - bash

I have a list of IP lookups and I wish to remove all but the last 3 parts, so:
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
would become
163data.com.cn
I have spent hours searching for clues, including parameter substitution, but the closest I got was:
$ string="98.254.237.114.broad.lyg.js.dynamic.163data.com.cn"
$ string1=${string%.*.*.*}
$ echo $string1
Which gives me the inverted answer of:
98.254.237.114.broad.lyg.js.dynamic
which is everything but the last 3 parts.
A script to do a list would be better than just the static example I have here.
Using CentOS 6, I don't mind if it by using sed, cut, awk, whatever.
Any help appreciated.
Thanks, now that I have working answers, may I ask as a follow up to then process the resulting list and if the last part (after last '.') is 3 characters - eg .com .net etc, then to just keep the last 2 parts.
If this is against protocol, please advise how to do a follow up question.

if parameter expansion inside another parameter expansion is supported, you can use this:
$ s='98.254.237.114.broad.lyg.js.dynamic.163data.com.cn'
$ # removing last three fields
$ echo "${s%.*.*.*}"
98.254.237.114.broad.lyg.js.dynamic
$ # pass output of ${s%.*.*.*} plus the extra . to be removed
$ echo "${s#${s%.*.*.*}.}"
163data.com.cn
can also reverse the line, get required fields and then reverse again.. this makes it easier to use change numbers
$ echo "$s" | rev | cut -d. -f1-3 | rev
163data.com.cn
$ echo "$s" | rev | cut -d. -f1-4 | rev
dynamic.163data.com.cn
$ # and easy to use with file input
$ cat ip.txt
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
foo.bar.123.baz.xyz
a.b.c.d.e.f
$ rev ip.txt | cut -d. -f1-3 | rev
163data.com.cn
123.baz.xyz
d.e.f

echo $string | awk -F. '{ if (NF == 2) { print $0 } else { print $(NF-2)"."$(NF-1)"."$NF } }'
NF signifies the total number of field separated by "." and so we want the last piece (NF), last but 1 (NF-1) and last but 2 (NF-2)

$ echo $string | awk -F'.' '{printf "%s.%s.%s\n",$(NF-2),$(NF-1),$NF}'
163data.com.cn
Brief explanation,
Set the field separator to .
Print only last 3 field using the awk parameter $(NF-2), $(NF-1),and $NF.
And there's also another option you may try,
$ echo $string | awk -v FPAT='[^.]+.[^.]+.[^.]+$' '{print $NF}'
163data.com.cn

It sounds like this is what you need:
awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
e.g.
$ echo "$string" | awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
163data.com.cn
but with just 1 sample input/output it's just a guess.
wrt your followup question, this might be what you're asking for:
$ echo "$string" | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
163data.com.cn
$ echo 'www.google.com' | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
google.com

Version which uses only bash:
echo $(expr "$string" : '.*\.\(.*\..*\..*\)')
To use it with a file you can iterate with xargs:
File:
head list.dat
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
98.254.34.56.broad.kkk.76onepi.co.cn
98.254.237.114.polst.a65dal.com.cn
iterating the whole file:
cat list.dat | xargs -I^ -L1 expr "^" : '.*\.\(.*\..*\..*\)'
Notice: it won't be very efficient in large scale, so you need to consider by your own whether it is good enough for you.
Regexp explanation:
.* \. \( .* \. .* \. .* \)
\___| | | | |
| \------------------------/> brakets shows which part we extract
| | |
| \-------/> the \. indicates the dots to separate specific number of words
|
|
-> the rest and the final dot which we are not interested in (out of brakets)
details:
http://tldp.org/LDP/abs/html/string-manipulation.html -> Substring Extraction

Related

How to use awk to select text from a file starting from a line number until a certain string

I have this file where I want to read it starting from a certain line number, until a string. I already used
awk "NR>=$LINE && NR<=$((LINE + 121)) {print}" db_000022_model1.dlg
to read from a specific line until and incremented line number, but right now I need to make it stop by itself at a certain string in order to be able to use it on other files.
DOCKED: ENDBRANCH 7 22
DOCKED: TORSDOF 3
DOCKED: TER
DOCKED: ENDMDL
I want it to stop after it reaches
DOCKED: ENDMDL
#!/bin/bash
# This script is for extracting the pdb files from a sorted list of scored
# ligands
mkdir top_poses
for d in $(head -20 summary_2.0.sort | cut -d, -f1 | cut -d/ -f1)
do
cd "$d"||continue
# find the cluster with the highest population within the dlg
RUN=$(grep '###*' "$d.dlg" | sort -k10 -r | head -1 | cut -d\| -f3 | sed 's/ //g')
LINE=$(grep -ni "BEGINNING GENETIC ALGORITHM DOCKING $RUN of 100" "$d.dlg" | cut -d: -f1)
echo "$LINE"
# extract the best pose and correct the format
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
# convert the pdbqt file into pdb
#obabel -ipdbqt $d.pdbqt -opdb -O../top_poses/$d.pdb
cd ..
done
When I try the
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
Just like that in the shell terminal, it works. But in the script it outputs an empty file.
Depending on your requirements for handling DOCKED: ENDMDL occurring before your target line:
awk -v line="$LINE" 'NR>=line; /DOCKED: ENDMDL/{exit}' db_000022_model1.dlg
or:
awk -v line="$LINE" 'NR>=line{print; if (/DOCKED: ENDMDL/) exit}' db_000022_model1.dlg

Count number of Special Character in Unix Shell

I have a delimited file that is separated by octal \036 or Hexadecimal value 1e.
I need to count the number of delimiters on each line using a bash shell script.
I was trying to use awk, not sure if this is the best way.
Sample Input (| is a representation of \036)
Example|Running|123|
Expected output:
3
awk -F'|' '{print NF-1}' file
Change | to whatever separator you like. If your file can have empty lines then you need to tweak it to:
awk -F'|' '{print (NF ? NF-1 : 0)}' file
You can try
awk '{print gsub(/\|/,"")}'
Simply try
awk -F"|" '{print substr($3,length($3))}' OFS="|" Input_file
Explanation: Making field separator -F as | and then printing the 3rd column by doing $3 only as per your need. Then setting OFS(output field separator) to |. Finally mentioning Input_file name here.
This will work as far as I know
echo "Example|Running|123|" | tr -cd '|' | wc -c
Output
3
This should work for you:
awk -F '\036' '{print NF-1}' file
3
-F '\036' sets input field delimiter as octal value 036
Awk may not be the best tool for this. Gnu grep has a cool -o option that prints each matching pattern on a separate line. You can then count how many matching lines are generated for each input line, and that's the count of your delimiters. E.g. (where ^^ in the file is actually hex 1e)
$ cat -v i
a^^b^^c
d^^e^^f^^g
$ grep -n -o $'\x1e' i | uniq -c
2 1:
3 2:
if you remove the uniq -c you can see how it's working. You'll get "1" printed twice because there are two matching patterns on the first line. Or try it with some regular ascii characters and it becomes clearer what the -o and -n options are doing.
If you want to print the line number followed by the field count for that line, I'd do something like:
$grep -n -o $'\x1e' i | tr -d ':' | uniq -c | awk '{print $2 " " $1}'
1 2
2 3
This assumes that every line in the file contains at least one delimiter. If that's not the case, here's another approach that's probably faster too:
$ tr -d -c $'\x1e\n' < i | awk '{print length}'
2
3
0
0
0
This uses tr to delete (-d) all characters that are not (-c) 1e or \n. It then pipes that stream of data to awk which just counts how many characters are left on each line. If you want the line number, add " | cat -n" to the end.

How to process string in using awk in shell script

I am very new to shell scripting and have to do so many tasks around it. I am trying to learn as fast a possible but some times shell scripting makes a task look very easy and at other times it just toys with me. And I am facing similar situation now.
I have a command which gives me an output like this.
File Dependents
----------------------------------------------------------------------------
<File> is a requisite of <Dependents>
Path: /usr/lib/obj
Java 1.0.0.0 analysis 0.0.0.2
runtime 1.2.0.0
client 1.2.0.0
framework 6.1.9.100
sguide 1.9.10.0
sysmgt 6.1.9.100
dsm 6.1.9.200
Path: /etc/obj
Java 1.0.0.0 analysis 1.2.0.2
runtime 2.0.0.0
client3 6.1.9.0
sysmgt 6.1.9.0
dsm2 6.1.9.0
Now I want to get the list of dependencies into an array for further processing. This is what I am able to do so far:
<command> | cut -f1 | grep '[a-z]' | grep -v File | grep -v : | awk '{ print $1}'
output is:
Java<<< I want this to be analysis
runtime
client
framework
sguide
sysmgt
dsm
Java<<< want this to be analysis
runtime
client3
sysmgt
dsm2
I have to capture these two lists in two separate arrays.
Can someone please help me in achieving this output in an elegant way. I don't want to butcher this code with my brute force method involving lot of conditions and comparisions.
awk to the rescue!
$ arr1=$(command ... | awk -v c=1 '!NF{f=0} f && s==c{print $1} /Java/{f=1; s++; if(s==c) print $(NF-1)}')
$ arr2=$(command ... | awk -v c=2 '!NF{f=0} f && s==c{print $1} /Java/{f=1; s++; if(s==c) print $(NF-1)}')
$ echo $arr1
analysis runtime client framework sguide sysmgt dsm
$ echo $arr2
analysis runtime client3 sysmgt dsm2
perhaps better if you run the command once and split the results into two arrays.
Explanation
awk -v c=1 set awk variable c to 1 (describes group instance number)
'!NF{f=0} if there are no fields (empty line) reset f
f && s==c{print $1} if f is set and counter equals to c print the first field
/Java/{f=1; s++; when pattern matched to Java, set f and increment counter and
...if(s==c) print $(NF-1)}' if counter matches c print the penultimate field.
You can fix your solution by removing the substring with Java first:
command | sed 's/Java [^ ]*//' | cut -f1 | grep '[a-z]' | grep -v File | grep -v : | awk '{ print $1}'
When you use awk, you can better use the full strength of awk. Just say you want the print the second last field of any line with a number:
command | awk '/[0-9]/ { print $(NF-1) }'
This is better than trying to use sed (do you have tabs or spaces?)
command | sed -n '/[0-9].[0-9]/ s/^.* \([^ ]*\) .*/\1/p'
A funny solution is using rev to revert your text. That way cut can find the second field.
command | grep '[0-9].[0-9]' | rev | cut -d " " -f2 | rev
For people who only read the last line, I will repeat the awk solution:
command | awk '/[0-9]/ { print $(NF-1) }'

how can I get the index of a character in a given concurrence which is repeated several times in a TEXT line using SHELL (BASH) script

I have a Text string like below
"/path/to/log/file/LOG_FILE.log.2013-10-02-15:2013-10-02 15:46:57.809 INFO - TTT005|Receive|0000293|N~0000284~YOS~TTT005~ ~000~YC~|YOS TYOS-YCUPDT1-H 20131002154657669284YCARR TTT005 Y0TD04 |1|0150520106050|001|051052020603|003|015030010101502702060510520101|000||000|| "
Here "|" is repeated several times within the string and I need to get the index of 4th occurrence of "|" character using shell-script (BASH) command. I tried to find a way using grep command's options.
Thanks.
Using awk you can do:
awk -F '|' '{print index($0, $5)-1}' file
This will print character position of fourth pipe in the file.
grep can print the byte-offset; when used with -o it prints the byte-offset of the matching part.
$ string="/path/to/log/file/LOG_FILE.log.2013-10-02-15:2013-10-02 15:46:57.809 INFO - TTT005|Receive|0000293|N~0000284~YOS~TTT005~ ~000~YC~|YOS TYOS-YCUPDT1-H 20131002154657669284YCARR TTT005 Y0TD04 |1|0150520106050|001|051052020603|003|015030010101502702060510520101|000||000||"
$ grep -ob "[^|]*" <<< "${string}" | sed '5!d' | cut -d: -f1
132
Alternatively, without using grep:
$ newstring=$(echo "${string}" | cut -d\| -f5-)
$ echo $(( ${#string} - ${#newstring} ))
132

Bash: creating a pipeline to list top 100 words

Ok, so I need to create a command that lists the 100 most frequent words in any given file, in a block of text.
What I have at the moment:
$ alias words='tr " " "\012" <hamlet.txt | sort -n | uniq -c | sort -r | head -n 10'
outputs
$ words
14 the
14 of
8 to
7 and
5 To
5 The
5 And
5 a
4 we
4 that
I need it to output in the following format:
the of to and To The And a we that
((On that note, how would I tell it to print the output in all caps?))
And I need to change it so that I can pipe 'words' to any file, so instead of having the file specified within the pipe, the initial input would name the file & the pipe would do the rest.
Okay, taking your points one by one, though not necessarily in order.
You can change words to use standard input just by removing the <hamlet.txt bit since tr will take its input from standard input by default. Then, if you want to process a specific file, use:
cat hamlet.txt | words
or:
words <hamlet.txt
You can remove the effects of capital letters by making the first part of the pipeline:
tr '[A-Z]' '[a-z]'
which will lower-case your input before doing anything else.
Lastly, if you take that entire pipeline (with the suggested modifications above) and then pass it through a few more commands:
| awk '{printf "%s ", $2}END{print ""}'
This prints the second argument of each line (the word) followed by a space, then prints an empty string with terminating newline at the end.
For example, the following script words.sh will give you what you need:
tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf "%s ", $2}END{print ""}'
(on one line: I've split it for readability) as per the following transcript:
pax> echo One Two two Three three three Four four four four | ./words.sh
four three two
You can achieve the same end with the following alias:
alias words="tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf \"%s \", \$2}END{print \"\"}'"
(again, one line) but, when things get this complex, I prefer a script, if only to avoid interminable escape characters :-)

Resources