adding numbers without grep -c option - bash

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.

A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.

Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

Related

Combine expressions and parameter expansion in bash

Is it possible to combine parameter expansion with arithmetic expressions in bash? For example, could I do a one-liner to evaluate lineNum or numChar here?
echo "Some lines here
Here is another
Oh look! Yet another" > $1
lineNum=$( grep -n -m1 'Oh look!' $1 | cut -d : -f 1 ) #Get line number of "Oh look!"
(( lineNum-- )) # Correct for array indexing
readarray -t lines < $1
substr=${lines[lineNum]%%Y*} # Get the substring "Oh look! "
numChar=${#substr} # Get the number of characters in the substring
(( numChar -= 2 )) # Get the position of "!" based on the position of "Y"
echo $lineNum
echo $numChar
> 2
8
In other words, can I get the position of one character in a string based on the position of another in a one-line expression?
As far as for getting position of ! in a line that matches Oh look! regex, just:
awk -F'!' '/Oh look!/{ print length($1) + 1; quit }' "$file"
You can also do calculation to your liking, so with your original code I think that would be:
awk -F':' '/^[[:space:]][A-Z]/{ print length($1) - 2; quit }' "$file"
Is it possible to combine parameter expansion with arithmetic expressions in bash?
For computing ${#substr} you have to have the substring. So you could:
substr=${lines[lineNum-1]%%.*}; numChar=$((${#substr} - 2))
You could also edit your grep and have the filtering from Y done by bash, but awk is going to be magnitudes faster:
IFS=Y read -r line _ < <(grep -m1 'Oh look!' "$file")
numChar=$((${#line} - 2))
Still you could merge the 3 lines into just:
numChar=$(( $(<<<${lines[lineNum - 1]%%Y*} wc -c) - 1))

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

Cut column by column name in bash

I want to specify a column by name (i.e. 102), find the position of this column and then use something like cut -5,7- with the found position to delete the specified column.
This is my file header (delim = "\t"):
#CHROM POS 1 100 101 102 103 107 108
This awk should work:
awk -F'\t' -v c="102" 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' file
Here's one possible solution without the restriction that only one column is to be removed. It is written as a bash function, where the first argument is the filename, and the remaining arguments are the columns to exclude.
rmcol() {
local file=$1
shift
cut -f$(head -n1 "$file" | tr \\t \\n | grep -vFxn "${#/#/-e}" |
cut -d: -f1 | paste -sd,) "$file"
}
If you want to select rather than exclude the named columns, then change -vFxn to -Fxn.
That almost certainly requires some sort of explanation. The first two lines of the function just removes the filename from the arguments and stores it for later use. The cut command will then select the appropriate columns; the column numbers are computed with the complicated pipeline which follows:
head -n1 "$file" | # Take the first line of the file
tr \\t \\n | # Change all the tabs to newlines [ Note 1]
grep # Select all lines (i.e. column names) which
-v # don't match
F # the literal string
x # which is the complete line
n # and include the line number in the output
"${#/#/-e}" | # Put -e at the beginning of each command line argument,
# converting the arguments into grep pattern arguments (-e)
cut -d: -f1 | # Select only the line number from that matches
paste -sd, # Paste together all the line numbers, separated with commas.
Using a for loop in bash:
C=1; for i in $(head file -n 1) ; do if [ $i == "102" ] ; then break ; else C=$(( $C + 1 )) ; fi ; done ; echo $C
And a full script
C=1
for i in $(head in_file -n 1) ; do
echo $i
if [ $i == "102" ] ; then
break ;
else
echo $C
C=$(( $C + 1 ))
fi
done
cut -f1-$(($C-1)),$(($C+1))- in_file
trying a solution without looping through columns, I get:
#!/bin/bash
pick="$1"
titles="pos 1 100 102 105"
tmp=" $titles "
tmp="${tmp%% $pick* }"
tmp=($tmp)
echo "column ${#tmp[#]}"
It suffers from incorrectly reporting last column if column name can't be found.
Try this small awk utility to cut specific headers - https://github.com/rohitprajapati/toyeca-cutter
Example usage -
awk -f toyeca-cutter.awk -v c="col1, col2, col3, col4" my_file.csv

Replacing numbers with SED

I'm trying to replace numbers from -20 to 30 using sed, but it adds "v" character. What's wrong?
For example: SINR=-18, output must be "c", but output is "vc".
I tryed to delete 1st character, but it returns 1 instead of j.
SINR=`curl -s http://10.0.0.1/status | awk '/3GPP.SINR=/ {print $0}' | awk -F "3GPP.SINR=" '{print $2}'` # returns number
echo $SINR | sed "s/-20/a/;s/-19/b/;s/-18/c/;s/-17/d/;s/-16/e/;s/-15/f/;s/-14/g/;s/-13/h/;s/-12/i/;s/-11/j/;s/-10/k/;s/-9/l/;s/-8/m/;s/-7/n/;s/-6/o/;s/-5/p/;s/-4/q/;s/-3/r/;s/-2/s/;s/-1/t/;s/0/u/;s/1/v/;s/2/w/;s/3/x/;s/4/y/;s/5/z/;s/6/A/;s/7/B/;s/8/C/;s/9/D/;s/10/E/;s/11/F/;s/12/G/;s/13/H/;s/14/I/;s/15/J/;s/16/K/;s/17/L/;s/18/M/;s/19/N/;s/20/O/;s/21/P/;s/22/Q/;s/23/R/;s/24/S/;s/25/T/;s/26/U/;s/27/V/;s/28/W/;s/29/X/;s/30/Y/"
This way would be more elegant and less error-prone:
echo $SINR | awk 'BEGIN { chars="abcdefg" } { print substr(chars, $1 + 21, 1) }'
Of course, chars should contain all the letters you need for the mapping. That is, all the way until ...VWXY as in your example, I just wrote until g to keep it short and sweet.
With this solution your problem disappears.
You don't really need sed or awk if you have bash like you say you do. You can use arrays, which is maybe even less error-prone ;-)
map=({a..z} {A..Z}) # Create map of your characters
SINR=-18 # Set your SINR number to something
SINR=$(($SINR+20)) # Add an offset to get to right place
result=${map[$SINR]} # Lookup your result
echo $result # Print it
c
If you have a mapping process, you're surely better off building a switch statement, a couple of if's, or even using bash associative arrays (bash >= 4.0). For example, you could tackle your problem with the following snippet:
function mapper() {
if [[ $1 -ge -20 && $1 -le 5 ]]; then
printf \\$(printf '%03o' $(( $1 + 117 )) )
elif [[ $1 -ge 6 && $1 -le 30 ]]; then
printf \\$(printf '%03o' $(( $1 + 59 )) )
else
echo ""; return 1
fi
return 0
}
And use like below:
$ mapper -20
a
$ mapper 5
z
$ mapper 6
A
$ mapper 30
Y
$ mapper $SINR
c
echo "${SINR}" | sed 's/-20/a/;t;s/-19/b/;t;s/-18/c/;t;s/-17/d/;t;s/-16/e/;t;s/-15/f/;t;s/-14/g/;t;s/-13/h/;t;s/-12/i/;t;s/-11/j/;t;s/-10/k/;t;s/-9/l/;t;s/-8/m/;t;s/-7/n/;t;s/-6/o/;t;s/-5/p/;t;s/-4/q/;t;s/-3/r/;t;s/-2/s/;t;s/-1/t/;t;s/0/u/;t;s/1/v/;t;s/2/w/;t;s/3/x/;t;s/4/y/;t;s/5/z/;t;s/6/A/;t;s/7/B/;t;s/8/C/;t;s/9/D/;t;s/10/E/;t;s/11/F/;t;s/12/G/;t;s/13/H/;t;s/14/I/;t;s/15/J/;t;s/16/K/;t;s/17/L/;t;s/18/M/;t;s/19/N/;t;s/20/O/;t;s/21/P/;t;s/22/Q/;t;s/23/R/;t;s/24/S/;t;s/25/T/;t;s/26/U/;t;s/27/V/;t;s/28/W/;t;s/29/X/;t;s/30/Y/'
Use the t after s// to accelerate a bit.
vc is normaly not occuring if SINR is just a number like specified

BASH script - print sorted contents from all files in directory with no rep's

In the current directory there are files with names of the form "gradesXXX" (where XXX is a course number) which look like this:
ID GRADE (this line is not contained in the files)
123456789 56
213495873 84
098342362 77
. .
. .
. .
I want to write a BASH script that prints all the IDs that have a grade above a certain number, which is given as the first parameter to said script.
The requirements are that an ID must be printed once at most, and that no intermediate files are used.
I was guided to use two scripts - the first with length of one line, and the second with length of up to six lines (not including the "#!" line).
I'm quite lost with this one so any suggestions will be appreciated.
Cheers.
The answer I was looking for was
// internal script
#!/bin/bash
while read line; do
line_split=( $line )
if (( ${line_split[1]} > $1 )); then
echo ${line_split[0]}
fi
done
// external script
#!/bin/bash
cat grades* | sort -r -n -k 1 | internalScript $1 | cut -f1 -d" " | uniq
OK, a simple solution.
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt 60 ] ; then break ; fi ; echo $ID ; done | sort -u
I'm not sure why two scripts should be necessary. All in a script:
#!/bin/bash
threshold=$1
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt $threshold ] ; then break ; fi ; echo $ID ; done | sort -u
We first cat all the grade files, the sort them by grade in reverse order. The while loop breaks if grade is below threshold, so that only lines with higher grades get their ID printed. sort -u makes sure that every ID is sent only once.
You can use awk:
awk '{ if ($2 > 70) print $1 }' grades777
It prints the first column of every line which seconds column is greater than 70. If you need to change the threshold:
N=71
awk '{ if ($2 > '$N') print $1 }' grades777
That ' are required to pass shell variables in AWK. To work with all grade??? files in the current directory and remove duplicated lines:
awk '{ if ($2 > '$N') print $1 }' grades??? | sort -u
A simple one-line solution.
Yet another solution:
cat grades[0-9][0-9][0-9] | awk -v MAX=70 '{ if ($2 > MAX) foo[$1]=1 }END{for (id in foo) print id }'
Append | sort -n after that if you want the IDs in sorted order.
In pure bash :
N=60
for file in /path/*; do
while read id grade; do ((grade > N)) && echo "$id"; done < "$file"
done
OUTPUT
213495873
098342362

Resources