Bash : How to check in a file if there are any word duplicates - bash

I have a file with 6 character words in every line and I want to check if there are any duplicate words. I did the following but something isn't right:
#!/bin/bash
while read line
do
name=$line
d=$( grep '$name' chain.txt | wc -w )
if [ $d -gt '1' ]; then
echo $d $name
fi
done <$1

Assuming each word is on a new line, you can achieve this without looping:
$ cat chain.txt | sort | uniq -c | grep -v " 1 " | cut -c9-

You can use awk for that:
awk -F'\n' 'found[$1] {print}; {found[$1]++}' chain.txt
Set the field separator to newline, so that we look at the whole line. Then, if the line already exists in the array found, print the line. Finally, add the line to the found array.
Note: If a line will only be suppressed once, so if the same line appears, say, 6 times, it will be printed 5 times.

Related

Removing current line of a file

I'm facing something that looks easy, but can't find the answer :
The goal of this function is to remove all the line that contains 3 commas ',' :
while read line; do
COUNT=$(echo $line | grep -o "\," | wc -)
if [ $COUNT -ne 3 ]; then
remove line
fi
done < tmp.txt
I dont find how to remove current line, can you help me ?
I extract this tmp.txt from a larger with grep, if it was in a variable instead of a tmp.txt will it be the same ?
while read line; do
COUNT=$(echo $line | grep -o "\," | wc -)
COUNT=$(echo $line | grep -o "\," | wc -)
if [ $COUNT -ne 3 ]; then
remove line
fi
done <<< "$toto"
Thanks in advance
Using sed command only solution.
sed '/^\([^,]*,\)\{3\}[^,]*$/d' infile
Delete all those line which character comma , occurred exactly 3 times.
Or using awk:
awk -F, 'NF!=4' infile
Or both read from a variable.
sed '/^\([^,]*,\)\{3\}[^,]*$/d' <<<"$variable"
awk -F, 'NF!=4' <<<"$variable"
A simple awk solution
awk 'gsub(/,/,",")!=3' file
gsub replaces the pattern with the specified string and it returns the number of substitutions/replacements made.
We are replacing , with , here and thus gsub will return us the number of , in the string.
Example :
Input file
hello this line has 1 ,
This line, has, 3 ,
This line, has, 4 , commas , Thanks
Output
$ awk 'gsub(/,/,",")!=3' file
hello this line has 1 ,
This line, has, 4 , commas , Thanks
I would have done it in the other way :
while read line; do
COUNT=$(echo $line | grep -o "\," | wc -)
if [ $COUNT -eq 3 ]; then
echo $line >> $tempofile
fi
done < tmp.txt
If the line is matched, keep it, otherwise get to next line.
This simple command can remove all the lines that contains 3
$ awk '!/3/' file_name

Bash Shell: Infinite Loop

The problem is the following I have a file that each line has this form:
id|lastName|firstName|gender|birthday|joinDate|IP|browser
i want to sort alphabetically all the firstnames in that file and print them one on each line but each name only once
i have created the following program but for some reason it creates an infinite loop:
array1=()
while read LINE
do
if [ ${LINE:0:1} != '#' ]
then
IFS="|"
array=($LINE)
if [[ "${array1[#]}" != "${array[2]}" ]]
then
array1+=("${array[2]}")
fi
fi
done < $3
echo ${array1[#]} | awk 'BEGIN{RS=" ";} {print $1}' | sort
NOTES
if [ ${LINE:0:1} != '#' ] : this command is used because there are comments in the file that i dont want to print
$3 : filename
array1 : is used for all the seperate names
Wow, there's a MUCH simpler and cleaner way to achieve this, without having to mess with the IFS variable or using arrays. You can use "for" to do this:
First I created a file with the same structure as yours:
$ cat file
id|lastName|Douglas|gender|birthday|joinDate|IP|browser
id|lastName|Tim|gender|birthday|joinDate|IP|browser
id|lastName|Andrew|gender|birthday|joinDate|IP|browser
id|lastName|Sasha|gender|birthday|joinDate|IP|browser
#id|lastName|Carly|gender|birthday|joinDate|IP|browser
id|lastName|Madson|gender|birthday|joinDate|IP|browser
Here's the script I wrote using "for":
#!/bin/bash
for LINE in `cat file | grep -v "^#" | awk -F'|' '{print$3}' | sort -u`
do
echo $LINE
done
And here's the output of this script:
$ ./script.sh
Andrew
Douglas
Madson
Sasha
Tim
Explanation:
for LINE in `cat file`
Creates a loop that reads each line of "file". The commands between ` are run by linux, for example, if you wanted to store the date inside of a variable you could use "VARDATE=`date`".
grep -v "^#"
The option -v is used to exclude results matching the pattern, in this case the pattern is "^#". The "^" character means "line begins with". So grep -v "^#" means "exclude lines beginning with #".
awk -F'|' '{print$3}'
The -F option switches the column delimiter from the default (the default is a space) to whatever you put between ' after it, in this case the "|" character.
The '{print$3}' prints the 3rd column.
sort -u
And the "sort -u" command to sort the names alphabetically.

Cut column by column name in bash

I want to specify a column by name (i.e. 102), find the position of this column and then use something like cut -5,7- with the found position to delete the specified column.
This is my file header (delim = "\t"):
#CHROM POS 1 100 101 102 103 107 108
This awk should work:
awk -F'\t' -v c="102" 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' file
Here's one possible solution without the restriction that only one column is to be removed. It is written as a bash function, where the first argument is the filename, and the remaining arguments are the columns to exclude.
rmcol() {
local file=$1
shift
cut -f$(head -n1 "$file" | tr \\t \\n | grep -vFxn "${#/#/-e}" |
cut -d: -f1 | paste -sd,) "$file"
}
If you want to select rather than exclude the named columns, then change -vFxn to -Fxn.
That almost certainly requires some sort of explanation. The first two lines of the function just removes the filename from the arguments and stores it for later use. The cut command will then select the appropriate columns; the column numbers are computed with the complicated pipeline which follows:
head -n1 "$file" | # Take the first line of the file
tr \\t \\n | # Change all the tabs to newlines [ Note 1]
grep # Select all lines (i.e. column names) which
-v # don't match
F # the literal string
x # which is the complete line
n # and include the line number in the output
"${#/#/-e}" | # Put -e at the beginning of each command line argument,
# converting the arguments into grep pattern arguments (-e)
cut -d: -f1 | # Select only the line number from that matches
paste -sd, # Paste together all the line numbers, separated with commas.
Using a for loop in bash:
C=1; for i in $(head file -n 1) ; do if [ $i == "102" ] ; then break ; else C=$(( $C + 1 )) ; fi ; done ; echo $C
And a full script
C=1
for i in $(head in_file -n 1) ; do
echo $i
if [ $i == "102" ] ; then
break ;
else
echo $C
C=$(( $C + 1 ))
fi
done
cut -f1-$(($C-1)),$(($C+1))- in_file
trying a solution without looping through columns, I get:
#!/bin/bash
pick="$1"
titles="pos 1 100 102 105"
tmp=" $titles "
tmp="${tmp%% $pick* }"
tmp=($tmp)
echo "column ${#tmp[#]}"
It suffers from incorrectly reporting last column if column name can't be found.
Try this small awk utility to cut specific headers - https://github.com/rohitprajapati/toyeca-cutter
Example usage -
awk -f toyeca-cutter.awk -v c="col1, col2, col3, col4" my_file.csv

Correctly count number of lines a bash variable

I need to count the number of lines of a given variable. For example I need to find how many lines VAR has, where VAR=$(git log -n 10 --format="%s").
I tried with echo "$VAR" | wc -l), which indeed works, but if VAR is empty, is prints 1, which is wrong. Is there a workaround for this? Something better than using an if clause to check whether the variable is empty...(maybe add a line and subtract 1 from the returned value?).
The wc counts the number of newline chars. You can use grep -c '^' for counting lines.
You can see the difference with:
#!/bin/bash
count_it() {
echo "Variablie contains $2: ==>$1<=="
echo -n 'grep:'; echo -n "$1" | grep -c '^'
echo -n 'wc :'; echo -n "$1" | wc -l
echo
}
VAR=''
count_it "$VAR" "empty variable"
VAR='one line'
count_it "$VAR" "one line without \n at the end"
VAR='line1
'
count_it "$VAR" "one line with \n at the end"
VAR='line1
line2'
count_it "$VAR" "two lines without \n at the end"
VAR='line1
line2
'
count_it "$VAR" "two lines with \n at the end"
what produces:
Variablie contains empty variable: ==><==
grep:0
wc : 0
Variablie contains one line without \n at the end: ==>one line<==
grep:1
wc : 0
Variablie contains one line with \n at the end: ==>line1
<==
grep:1
wc : 1
Variablie contains two lines without \n at the end: ==>line1
line2<==
grep:2
wc : 1
Variablie contains two lines with \n at the end: ==>line1
line2
<==
grep:2
wc : 2
You can always write it conditionally:
[ -n "$VAR" ] && echo "$VAR" | wc -l || echo 0
This will check whether $VAR has contents and act accordingly.
For a pure bash solution: instead of putting the output of the git command into a variable (which, arguably, is ugly), put it in an array, one line per field:
mapfile -t ary < <(git log -n 10 --format="%s")
Then you only need to count the number of fields in the array ary:
echo "${#ary[#]}"
This design will also make your life simpler if, e.g., you need to retrieve the 5th commit message:
echo "${ary[4]}"
try:
echo "$VAR" | grep ^ | wc -l

Count mutiple occurences of a word on the same line using grep

Here I made a small script that take input from user searching some pattern from a file and displays required no of lines from that file where the pattern is found. Although this code is searching the pattern line wise due to standard grep practice. I mean if the pattern occurs twice on the same line, i want the output to print twice. Hope I make some sense.
#!/bin/sh
cat /dev/null>copy.txt
echo "Please enter the sentence you want to search:"
read "inputVar"
echo "Please enter the name of the file in which you want to search:"
read "inputFileName"
echo "Please enter the number of lines you want to copy:"
read "inputLineNumber"
[[-z "$inputLineNumber"]] || inputLineNumber=20
cat /dev/null > copy.txt
for N in `grep -n $inputVar $inputFileName | cut -d ":" -f1`
do
LIMIT=`expr $N + $inputLineNumber`
sed -n $N,${LIMIT}p $inputFileName >> copy.txt
echo "-----------------------" >> copy.txt
done
cat copy.txt
As I understood, the task is to count number of pattern occurrences in line. It can be done like so:
count=$((`echo "$line" | sed -e "s|$pattern|\n|g" | wc -l` - 1))
Suppose you have one file to read. Then, code will be following:
#!/bin/bash
file=$1
pattern="an."
#reading file line by line
cat -n $file | while read input
do
#storing line to $tmp
tmp=`echo $input | grep "$pattern"`
#counting occurrences count
count=$((`echo "$tmp" | sed -e "s|$pattern|\n|g" | wc -l` - 1))
#printing $tmp line $count times
for i in `seq 1 $count`
do
echo $tmp
done
done
I checked this for pattern "an." and input:
I pass here an example of many 'an' letters
an
ananas
an-an-as
Output is:
$ ./test.sh input
1 I pass here an example of many 'an' letters
1 I pass here an example of many 'an' letters
1 I pass here an example of many 'an' letters
3 ananas
4 an-an-as
4 an-an-as
Adapt this to your needs.
How about using awk?
Assume the pattern you are searching for is in variable $pattern and the file you are checking is $file
The
count=`awk 'BEGIN{n=0}{n+=split($0,a,"'$pattern'")-1}END {print n}' $file`
or for a line
count=`echo $line | awk '{n=split($0,a,"'$pattern'")-1;print n}`

Resources