Removing characters from each line in shellscript - shell

I have a text file like below
IMPALA COUNT :941 MONGO COUNT : 980
IMPALA COUNT :78 MONGO COUNT : 78
IMPALA COUNT :252 MONGO COUNT : 258
IMPALA COUNT :3008 MONGO COUNT : 3064
I want to remove everything and keep like below
941 980
78 78
252 258
3008 3064
Can anybody suggest any shellscript for this.

One way:
cut -d':' -f2,3 file.txt | cut -d' ' -f1,5
Another:
awk '{print substr($3, 2) " " $7}' file.txt

A sed solution extracting the two digits:
sed -r 's/[^0-9]*([0-9]+)[^0-9]*([0-9]+).*/\1 \2/g' file

Here's a few options:
grep -Eo '[0-9]+' file | paste -d " " - -
awk -F'[ :]+' '{print $4, $7}' file
awk -F: '{print $2+0, $3}' file
perl -lne '#matches = /(?<=:) *(\S+)/g; print join " ", #matches' file

sed -e 's/[^:]*: *\([0-9]*\) */\1 /g;s/ $//'
that is: Replace any sequence of non-colons [^:]*, followed by a colon and possibly spaces : *, followed by a sequence of digits and possibly spaces \([0-9]*\) *, by the digit sequence \1 plus one space; afterwards delete the final space in the line.

sed -r 's/[^0-9]+://g' file
It just matches all the characters distinct of a number [^0-9]+ followed by a : and remove them.
Example
$ cat file
IMPALA COUNT :941 MONGO COUNT : 980
IMPALA COUNT :78 MONGO COUNT : 78
IMPALA COUNT :252 MONGO COUNT : 258
IMPALA COUNT :3008 MONGO COUNT : 3064
$ sed -r 's/[^0-9]+://g' file
941 980
78 78
252 258
3008 3064

Related

Is there a way to use the cut command in BASH to print specific columns but with characters?

I know I can use -f1 to print a column, but is there a way for the cut to look through columns for a specific string and print out that column?
Not entirely clear if this is what you're looking for, but:
$ cat input
Length,Color,Height,Weight,Size
1,2,1,4,5
7,7,1,7,7
$ awk 'NR==1{for(i=1;i<=NF+1;i++) if($i==h) break; next} {print $i}' h=Color FS=, input
2
7
You can figure out the colum no by a small function like this:
function select_column() {
file="$1"
sep="$2"
col_name="$3"
# get the separators before the field:
separators=$(head -n 1 "${file}" | sed -e"s/\(.*${sep}\|^\)${col_name}\(${sep}.*\|$\)/\1/g" | tr -d -c ",")
# add one, because before the n-th fields there are n-1 separators
((field_no=${#separators}+1))
# now just call cut and skip the first row by using tail -n +2
cut -d "${sep}" -f ${field_no} "${file}" | tail -n +2
}
When called with:
select_column testfile.csv "," subno
it outputs:
10
76
55
83
30
53
67
25
52
16
57
86
2
75
28
on the following testfile.csv:
rand2,no,subno,rand1
john,8017610,10,96
ringo,5673276,76,42
ringo,9260555,55,19
john,7565683,83,72
ringo,8833230,30,35
paul,1571553,53,55
john,9972467,67,80
ringo,922025,25,88
paul,9908052,52,1
john,6264216,16,19
paul,4350857,57,3
paul,7253386,86,50
john,3426002,2,57
ringo,1437775,75,85
paul,4384228,28,77

Creating a CSV file from a text file

i have the below text file in this format
2015-04-21
00:21:00
5637
5694
12
2015-04-21
00:23:00
5637
5694
12
I want to create a csv file like the below one-
2015-04-21,00:21:00,5637,5694,12
2015-04-21,00:23:00,5637,5694,12
i used the tr and the sed like this-
cat file | tr '\n' ',' | sed 's/,$//'
It results in the below way-
2015-04-21,00:21:00,5637,5694,12,2015-04-21,00:23:00,5637,5694,12
but it doesn't have an new line after the column 5.
Do suggest a solution.
Use awk like so:
awk 'ORS=NR%5 ? "," : "\n"'
$ cat test.txt
2015-04-21
00:21:00
5637
5694
12
2015-04-21
00:23:00
5637
5694
12
$ awk 'ORS=NR%5 ? "," : "\n"' test.txt
2015-04-21,00:21:00,5637,5694,12
2015-04-21,00:23:00,5637,5694,12
Explanation:
ORS stands for output record separator
NR is number of records
NR % 5 - % is modulo operator. If it is zero (every 5th record), use line feed. Otherwise, use comma
a simple solution in python
fin = open('file','r')
fout = open('outputfile','w')
a=[]
i=0
for line in fin:
a.append(line.rstrip())
i+=1
if i==5:
fout.write(','.join(a)+'\n')
a=[]
i=0
fin.close()
fout.close()

awk: find minimum and maximum in column

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

How to get the second column from command output?

My command's output is something like:
1540 "A B"
6 "C"
119 "D"
The first column is always a number, followed by a space, then a double-quoted string.
My purpose is to get the second column only, like:
"A B"
"C"
"D"
I intended to use <some_command> | awk '{print $2}' to accomplish this. But the question is, some values in the second column contain space(s), which happens to be the default delimiter for awk to separate the fields. Therefore, the output is messed up:
"A
"C"
"D"
How do I get the second column's value (with paired quotes) cleanly?
Use -F [field separator] to split the lines on "s:
awk -F '"' '{print $2}' your_input_file
or for input from pipe
<some_command> | awk -F '"' '{print $2}'
output:
A B
C
D
If you could use something other than 'awk' , then try this instead
echo '1540 "A B"' | cut -d' ' -f2-
-d is a delimiter, -f is the field to cut and with -f2- we intend to cut the 2nd field until end.
This should work to get a specific column out of the command output "docker images":
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 16.04 12543ced0f6f 10 months ago 122 MB
ubuntu latest 12543ced0f6f 10 months ago 122 MB
selenium/standalone-firefox-debug 2.53.0 9f3bab6e046f 12 months ago 613 MB
selenium/node-firefox-debug 2.53.0 d82f2ab74db7 12 months ago 613 MB
docker images | awk '{print $3}'
IMAGE
12543ced0f6f
12543ced0f6f
9f3bab6e046f
d82f2ab74db7
This is going to print the third column
Or use sed & regex.
<some_command> | sed 's/^.* \(".*"$\)/\1/'
You don't need awk for that. Using read in Bash shell should be enough, e.g.
some_command | while read c1 c2; do echo $c2; done
or:
while read c1 c2; do echo $c2; done < in.txt
If you have GNU awk this is the solution you want:
$ awk '{print $1}' FPAT='"[^"]+"' file
"A B"
"C"
"D"
awk -F"|" '{gsub(/\"/,"|");print "\""$2"\""}' your_file
#!/usr/bin/python
import sys
col = int(sys.argv[1]) - 1
for line in sys.stdin:
columns = line.split()
try:
print(columns[col])
except IndexError:
# ignore
pass
Then, supposing you name the script as co, say, do something like this to get the sizes of files (the example assumes you're using Linux, but the script itself is OS-independent) :-
ls -lh | co 5

bash process data from two files

file1:
456
445
2323
file2:
433
456
323
I want get the deficit of the data in the two files and output to output.txt, that is:
23
-11
2000
How do I realize this? thank you.
$ paste file1 file2 | awk '{ print $1 - $2 }'
23
-11
2000
Use paste to create the formulae, and use bc to perform the calculations:
paste -d - file1 file2 | bc
In pure bash, with no external tools:
while read -u 4 line1 && read -u 5 line2; do
printf '%s\n' "$(( line1 - line2 ))"
done 4<file1 5<file2
This works by opening both files (attaching them to file descriptors 4 and 5); going into a loop in which we read one line from each descriptor per iteration (exiting the loop if either has no value), and calculate and print the result.
You could use paste and awk to operate between columns:
paste -d" " file1 file2 | awk -F" " '{print ($1-$2)}'
Or even pipe to a file:
paste -d" " file1 file2 | awk -F" " '{print ($1-$2)}' > output.txt
Hope it helps!

Resources