Cut column by column name in bash - bash

I want to specify a column by name (i.e. 102), find the position of this column and then use something like cut -5,7- with the found position to delete the specified column.
This is my file header (delim = "\t"):
#CHROM POS 1 100 101 102 103 107 108

This awk should work:
awk -F'\t' -v c="102" 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' file

Here's one possible solution without the restriction that only one column is to be removed. It is written as a bash function, where the first argument is the filename, and the remaining arguments are the columns to exclude.
rmcol() {
local file=$1
shift
cut -f$(head -n1 "$file" | tr \\t \\n | grep -vFxn "${#/#/-e}" |
cut -d: -f1 | paste -sd,) "$file"
}
If you want to select rather than exclude the named columns, then change -vFxn to -Fxn.
That almost certainly requires some sort of explanation. The first two lines of the function just removes the filename from the arguments and stores it for later use. The cut command will then select the appropriate columns; the column numbers are computed with the complicated pipeline which follows:
head -n1 "$file" | # Take the first line of the file
tr \\t \\n | # Change all the tabs to newlines [ Note 1]
grep # Select all lines (i.e. column names) which
-v # don't match
F # the literal string
x # which is the complete line
n # and include the line number in the output
"${#/#/-e}" | # Put -e at the beginning of each command line argument,
# converting the arguments into grep pattern arguments (-e)
cut -d: -f1 | # Select only the line number from that matches
paste -sd, # Paste together all the line numbers, separated with commas.

Using a for loop in bash:
C=1; for i in $(head file -n 1) ; do if [ $i == "102" ] ; then break ; else C=$(( $C + 1 )) ; fi ; done ; echo $C
And a full script
C=1
for i in $(head in_file -n 1) ; do
echo $i
if [ $i == "102" ] ; then
break ;
else
echo $C
C=$(( $C + 1 ))
fi
done
cut -f1-$(($C-1)),$(($C+1))- in_file

trying a solution without looping through columns, I get:
#!/bin/bash
pick="$1"
titles="pos 1 100 102 105"
tmp=" $titles "
tmp="${tmp%% $pick* }"
tmp=($tmp)
echo "column ${#tmp[#]}"
It suffers from incorrectly reporting last column if column name can't be found.

Try this small awk utility to cut specific headers - https://github.com/rohitprajapati/toyeca-cutter
Example usage -
awk -f toyeca-cutter.awk -v c="col1, col2, col3, col4" my_file.csv

Related

Accept filename as argument and calculate repeated words along with count

I need to find the number or repeated characters from a text file and need to pass filename as argument.
Example:
test.txt data contains
Zoom
Output should be like:
z 1
o 2
m 1
I need a command that will accept filename as argument and then lists the number of characters from that file. In my example I have a test.txt which has zoom word. So the output will be like how many times each letter has repeated.
My attempt:
vi test.sh
#!/bin/bash
FILE="$1" --to pass filename as argument
sort file1.txt | uniq -c --to count the number of letters
Just a guess?
cat test.txt |
tr '[:upper:]' '[:lower:]' |
fold -w 1 |
sort |
uniq -c |
awk '{print $2, $1}'
m 1
o 2
z 1
Suggesting awk script that count all kinds of chars:
awk '
BEGIN{FS = ""} # make each char a field
{
for (i = 1; i <= NF; i++) { # iteratre over all fields in line
++charsArr[$i]; # count each field occourance in array
}
}
END {
for (char in charsArr) { # iterrate over chars array
printf("%3d %s\n", charsArr[char], char); # cournt char-occourances and the char
}
}' |sort -n
Or in one line:
awk '{for(i=1;i<=NF;i++)++arr[$i]}END{for(char in arr)printf("%3d %s\n",arr[char],char)}' FS="" input.1.txt|sort -n
#!/bin/bash
#get the argument for further processing
inputfile="$1"
#check if file exists
if [ -f $inputfile ]
then
#convert file to a usable format
#convert all characters to lowercase
#put each character on a new line
#output to temporary file
cat $inputfile | tr '[:upper:]' '[:lower:]' | sed -e 's/\(.\)/\1\n/g' > tmp.txt
#loop over every character from a-z
for char in {a..z}
do
#count how many times a character occurs
count=$(grep -c "$char" tmp.txt)
#print if count > 0
if [ "$count" -gt "0" ]
then
echo -e "$char" "$count"
fi
done
rm tmp.txt
else
echo "file not found!"
exit 1
fi

Bash : How to check in a file if there are any word duplicates

I have a file with 6 character words in every line and I want to check if there are any duplicate words. I did the following but something isn't right:
#!/bin/bash
while read line
do
name=$line
d=$( grep '$name' chain.txt | wc -w )
if [ $d -gt '1' ]; then
echo $d $name
fi
done <$1
Assuming each word is on a new line, you can achieve this without looping:
$ cat chain.txt | sort | uniq -c | grep -v " 1 " | cut -c9-
You can use awk for that:
awk -F'\n' 'found[$1] {print}; {found[$1]++}' chain.txt
Set the field separator to newline, so that we look at the whole line. Then, if the line already exists in the array found, print the line. Finally, add the line to the found array.
Note: If a line will only be suppressed once, so if the same line appears, say, 6 times, it will be printed 5 times.

Reading a file in a shell script and selecting a section of the line

This is probably pretty basic, I want to read in a occurrence file.
Then the program should find all occurrences of "CallTilEdb" in the file Hendelse.logg:
CallTilEdb 8
CallCustomer 9
CallTilEdb 4
CustomerChk 10
CustomerChk 15
CallTilEdb 16
and sum up then right column. For this case it would be 8 + 4 + 16, so the output I would want would be 28.
I'm not sure how to do this, and this is as far as I have gotten with vistid.sh:
#!/bin/bash
declare -t filename=hendelse.logg
declare -t occurance="$1"
declare -i sumTime=0
while read -r line
do
if [ "$occurance" = $(cut -f1 line) ] #line 10
then
sumTime+=$(cut -f2 line)
fi
done < "$filename"
so the execution in terminal would be
vistid.sh CallTilEdb
but the error I get now is:
/home/user/bin/vistid.sh: line 10: [: unary operator expected
You have a nice approach, but maybe you could use awk to do the same thing... quite faster!
$ awk -v par="CallTilEdb" '$1==par {sum+=$2} END {print sum+0}' hendelse.logg
28
It may look a bit weird if you haven't used awk so far, but here is what it does:
-v par="CallTilEdb" provide an argument to awk, so that we can use par as a variable in the script. You could also do -v par="$1" if you want to use a variable provided to the script as parameter.
$1==par {sum+=$2} this means: if the first field is the same as the content of the variable par, then add the second column's value into the counter sum.
END {print sum+0} this means: once you are done from processing the file, print the content of sum. The +0 makes awk print 0 in case sum was not set... that is, if nothing was found.
In case you really want to make it with bash, you can use read with two parameters, so that you don't have to make use of cut to handle the values, together with some arithmetic operations to sum the values:
#!/bin/bash
declare -t filename=hendelse.logg
declare -t occurance="$1"
declare -i sumTime=0
while read -r name value # read both values with -r for safety
do
if [ "$occurance" == "$name" ]; then # string comparison
((sumTime+=$value)) # sum
fi
done < "$filename"
echo "sum: $sumTime"
So that it works like this:
$ ./vistid.sh CallTilEdb
sum: 28
$ ./vistid.sh CustomerChk
sum: 25
first of all you need to change the way you call cut:
$( echo $line | cut -f1 )
in line 10 you miss the evaluation:
if [ "$occurance" = $( echo $line | cut -f1 ) ]
you can then sum by doing:
sumTime=$[ $sumTime + $( echo $line | cut -f2 ) ]
But you can also use a different approach and put the line values in an array, the final script will look like:
#!/bin/bash
declare -t filename=prova
declare -t occurance="$1"
declare -i sumTime=0
while read -a line
do
if [ "$occurance" = ${line[0]} ]
then
sumTime=$[ $sumtime + ${line[1]} ]
fi
done < "$filename"
echo $sumTime
For the reference,
id="CallTilEdb"
file="Hendelse.logg"
sum=$(echo "0 $(sed -n "s/^$id[^0-9]*\([0-9]*\)/\1 +/p" < "$file") p" | dc)
echo SUM: $sum
prints
SUM: 28
the sed extract numbers from a lines containing the given id, such CallTilEdb
and prints them in the format number +
the echo prepares a string such 0 8 + 16 + 4 + p what is calculation in RPN format
the dc do the calculation
another variant:
sum=$(sed -n "s/^$id[^0-9]*\([0-9]*\)/\1/p" < "$file" | paste -sd+ - | bc)
#or
sum=$(grep -oP "^$id\D*\K\d+" < "$file" | paste -sd+ - | bc)
the sed (or the grep) extracts and prints only the numbers
the paste make a string like number + number + number (-d+ is a delimiter)
the bc do the calculation
or perl
sum=$(perl -slanE '$s+=$F[1] if /^$id/}{say $s' -- -id="$id" "$file")
sum=$(ID="CallTilEdb" perl -lanE '$s+=$F[1] if /^$ENV{ID}/}{say $s' "$file")
Awk translation to script:
#!/bin/bash
declare -t filename=hendelse.logg
declare -t occurance="$1"
declare -i sumTime=0
sumtime=$(awk -v entry=$occurance '
$1==entry{time+=$NF+0}
END{print time+0}' $filename)

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

BASH script - print sorted contents from all files in directory with no rep's

In the current directory there are files with names of the form "gradesXXX" (where XXX is a course number) which look like this:
ID GRADE (this line is not contained in the files)
123456789 56
213495873 84
098342362 77
. .
. .
. .
I want to write a BASH script that prints all the IDs that have a grade above a certain number, which is given as the first parameter to said script.
The requirements are that an ID must be printed once at most, and that no intermediate files are used.
I was guided to use two scripts - the first with length of one line, and the second with length of up to six lines (not including the "#!" line).
I'm quite lost with this one so any suggestions will be appreciated.
Cheers.
The answer I was looking for was
// internal script
#!/bin/bash
while read line; do
line_split=( $line )
if (( ${line_split[1]} > $1 )); then
echo ${line_split[0]}
fi
done
// external script
#!/bin/bash
cat grades* | sort -r -n -k 1 | internalScript $1 | cut -f1 -d" " | uniq
OK, a simple solution.
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt 60 ] ; then break ; fi ; echo $ID ; done | sort -u
I'm not sure why two scripts should be necessary. All in a script:
#!/bin/bash
threshold=$1
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt $threshold ] ; then break ; fi ; echo $ID ; done | sort -u
We first cat all the grade files, the sort them by grade in reverse order. The while loop breaks if grade is below threshold, so that only lines with higher grades get their ID printed. sort -u makes sure that every ID is sent only once.
You can use awk:
awk '{ if ($2 > 70) print $1 }' grades777
It prints the first column of every line which seconds column is greater than 70. If you need to change the threshold:
N=71
awk '{ if ($2 > '$N') print $1 }' grades777
That ' are required to pass shell variables in AWK. To work with all grade??? files in the current directory and remove duplicated lines:
awk '{ if ($2 > '$N') print $1 }' grades??? | sort -u
A simple one-line solution.
Yet another solution:
cat grades[0-9][0-9][0-9] | awk -v MAX=70 '{ if ($2 > MAX) foo[$1]=1 }END{for (id in foo) print id }'
Append | sort -n after that if you want the IDs in sorted order.
In pure bash :
N=60
for file in /path/*; do
while read id grade; do ((grade > N)) && echo "$id"; done < "$file"
done
OUTPUT
213495873
098342362

Resources