Reading numbers from a text line in bash shell - bash

I'm trying to write a bash shell script, that opens a certain file CATALOG.dat, containing the following lines, made of both characters and numbers:
event_0133_pk.gz
event_0291_pk.gz
event_0298_pk.gz
event_0356_pk.gz
event_0501_pk.gz
What I wanna do is print the numbers (only the numbers) inside a new file NUMBERS.dat, using something like > ./NUMBERS.dat, to get:
0133
0291
0298
0356
0501
My problem is: how do I extract the numbers from the text lines? Is there something to make the script read just the number as a variable, like event_0%d_pk.gz in C/C++?

A grep solution:
grep -oP '[0-9]+' CATALOG.dat >NUMBERS.dat
A sed solution:
sed 's/[^0-9]//g' CATALOG.dat >NUMBERS.dat
And an awk solution:
awk -F"[^0-9]+" '{print $2}' CATALOG.dat >NUMBERS.dat

There are many ways that you can achieve your result. One way would be to use awk:
awk -F_ '{print $2}' CATALOG.dat > NUMBERS.dat
This sets the field separator to an underscore, then prints the second field which contains the numbers.

Awk
awk 'gsub(/[^[:digit:]]/,"")' infile
Bash
while read line; do echo ${line//[!0-9]}; done < infile
tr
tr -cd '[[:digit:]\n]' <infile

You can use grep command to extract the number part.
grep -oP '(?<=_)\d+(?=_)' CATALOG.dat
gives output as
0133
0291
0298
0356
0501
Or
much simply
grep -oP '\d+' CATALOG.dat

You don't need perl mode in grep for this. BREs can do this.
grep -o '[[:digit:]]\+' CATALOG.dat > NUMBERS.dat

Related

extract string between '$$' characters - $$extractabc$$

I am working on shell script and new to it. I want to extract the string between double $$ characters, for example:
input:
$$extractabc$$
output
extractabc
I used grep and sed but not working out. Any suggestions are welcome!
You could do
awk -F"$" '{print $3}' file.txt
assuming the file contained input:$$extractabc$$ output:extractabc. awk splits your data into pieces using $ as a delimiter. First item will be input:, next will be empty, next will be extractabc.
You could use sed like so to get the same info.
sed -e 's/.*$$\(.*\)$$.*/\1/' file.txt
sed looks for information between $$s and outputs that. The goal is to type something like this .*$$(.*)$$.*. It's greedy but just stay with me.
looks for .* - i.e. any character zero or more times before $$
then the string should have $$
after $$ there'll be any character zero or more times
then the string should have another $$
and some more characters to follow
between the 2 $$ is (.*). String found between $$s is given a placeholder \1
sed finds such information and publishes it
Using grep PCRE (where available) and look-around:
$ echo '$$extractabc$$' | grep -oP "(?<=\\$\\$).*(?=\\$\\$)"
extractabc
echo '$$extractabc$$' | awk '{gsub(/\$\$/,"")}1'
extractabc
Here is an other variation:
echo "$$extractabc$$" | awk -F"$$" 'NF==3 {print $2}'
It does test of there are two set of $$ and only then prints whats between $$
Does also work for input like blabla$$some_data$$moreblabla
How about remove all the $ in the input?
$ echo '$$extractabc$$' | sed 's/\$//g'
extractabc
Same with tr
$ echo '$$extractabc$$' | tr -d '$'
extractabc

Using grep to pull a series of random numbers from a known line

I have a simple scalar file producing strings like...
bpred_2lev.ras_rate.PP 0.9413 # RAS prediction rate (i.e., RAS hits/used RAS)
Once I use grep to find this line in the output.txt, is there a way I can directly grab the "0.9413" portion? I am attempting to make a cvs file and just need whatever value is generated.
Thanks in advance.
There are several ways to combine finding and extracting into a single command:
awk (POSIX-compliant)
awk '$1 == "bpred_2lev.ras_rate.PP" { print $2 }' file
sed (GNU sed or BSD/OSX sed)
sed -En 's/^bpred_2lev\.ras_rate\.PP +([^ ]+).*$/\1/p' file
GNU grep
grep -Po '^bpred_2lev\.ras_rate\.PP +\K[^ ]+' file
You can use awk like this:
grep <your_search_criteria> output.txt | awk '{ print $2 }'

How to extract specific lines from a file in bash?

I want to extract the string from a line which starts with a specific pattern from a file in shell script.
For example: I want the strings from lines that start with hello:
hi to_RAm
hello to_Hari
hello to_kumar
bye to_lilly
output should be
to_Hari
to_kumar
Can anyone help me?
sed is the most appropriate tool:
sed -n 's/^hello //p'
Use grep:
grep ^hello file | awk '{print $2}'
^ is to match lines that starts with "hello". This is assuming you want to print the second word.
If you want to print all words except the first then:
grep ^hello file | awk '{$1=""; print $0}'
You could use GNU grep's perl-compatible regexes and use a lookbehind:
grep -oP '(?<=hello ).*'

Basic stream/sed? bash script, perform substring on each line

I know this is basic, but I couldn't find the simplest way to iterate through a file with hundreds of lines and extract a substring.
If I have a file:
ABCY uuuu
UNUY uuuu
...
I want to end up with:
uuuu
uuuu
....
Ideally do a substring
{5} detect at character 5 and output that
You need no sed:
cut -c5-9 yourfile
It would be easier to use cut or awk. Assuming that your fields are separated by a space and you want the second field, you can use:
cut -d' ' -f2 file.txt
awk '{print $2}' file.txt
You can also use cut and awk to extract substrings:
cut -c6- file.txt
awk '{print substr($0,6);}' file.txt
However, if you really want to iterate through the file and extract substrings, you can use a while loop:
while IFS= read -r line
do
echo ${line:5}
done < file.txt
if you really love sed, you could try:
sed -r 's/^.{5}//' file

Bash: sort text file by last field value

I have a text file containing ~300k rows. Each row has a varying number of comma-delimited fields, the last of which is guaranteed numerical. I want to sort the file by this last numerical field. I can't do:
sort -t, -n -k 2 file.in > file.out
as the number of fields in each row is not constant. I think sed, awk maybe the answer, but not sure how. E.g:
awk -F, '{print $NF}' file.in
gives me the last column value, but how to use this to sort the file?
Use awk to put the numeric key up front. $NF is the last field of the current record. Sort. Use sed to remove the duplicate key.
awk -F, '{ print $NF, $0 }' yourfile | sort -n -k1 | sed 's/^[0-9][0-9]* //'
vim file.in -c '%sort n /.*,\zs/' -c 'saveas file.out' -c 'q'
Maybe reverse the fields of each line in the file before sorting? Something like
perl -ne 'chomp; print(join(",",reverse(split(","))),"\n")' |
sort -t, -n -k1 |
perl -ne 'chomp; print(join(",",reverse(split(","))),"\n")'
should do it, as long as commas are never quoted in any way. If this is a full-fledged CSV file (in which commas can be quoted with backslash or space) then you need a real CSV parser.
Perl one-liner:
#lines=<STDIN>;foreach(sort{($a=~/.*,(\d+)/)[0]<=>($b=~/.*,(\d+)/)[0]}#lines){print;}
I'm going to throw mine in here as an alternative (and I couldn't get awk to work) :)
sample file:
Call of Doody 1322
Seam the Ripper 1329
Mafia Bots 1 1109
Chicken Fingers 1243
Batup Light 1221
Hunter F Tomcat 1140
Tober 0833
code:
for i in `sed -e 's/.* \(\d\)*/\1/' file.txt | sort`; do grep $i file.txt; done > file_sort.txt
Python one-liner:
python -c "print ''.join(sorted(open('filename'), key=lambda l: int(l.split(',')[-1])))"

Resources