BASH - Tell if duplicate lines exist (y/n) - bash

I am writing a script to manipulate a text file.
First thing I want to do is check if duplicate entries exist and if so, ask the user whether we wants to keep or remove them.
I know how to display duplicate lines if they exist, but what I want to learn is just to get a yes/no answer to the question "Do duplicates exist?"
It seems uniq will return 0 either if duplicates were found or not as long as the command completed without issues.
What is that command that I can put in an if-statement just to tell me if duplicate lines exist?
My file is very simple, it is just values in single column.

I'd probably use awk to do this but, for the sake of variety, here is a brief pipe to accomplish the same thing:
$ { sort | uniq -d | grep . -qc; } < noduplicates.txt; echo $?
1
$ { sort | uniq -d | grep . -qc; } < duplicates.txt; echo $?
0
sort + uniq -d make sure that only duplicate lines (which don't have to be adjacent) get printed to stdout and grep . -c counts those lines emulating wc -l with the useful side effect that it returns 1 if it doesn't match (i.e. a zero count) and -q just silents the output so it doesn't print the line count so you can use it silently in your script.
has_duplicates()
{
{
sort | uniq -d | grep . -qc
} < "$1"
}
if has_duplicates myfile.txt; then
echo "myfile.txt has duplicate lines"
else
echo "myfile.txt has no duplicate lines"
fi

You can use awk combined with the boolean || operator:
# Ask question if awk found a duplicate
awk 'a[$0]++{exit 1}' test.txt || (
echo -n "remove duplicates? [y/n] "
read answer
# Remove duplicates if answer was "y" . I'm using `[` the shorthand
# of the test command. Check `help [`
[ "$answer" == "y" ] && uniq test.txt > test.uniq.txt
)
The block after the || will only get executed if the awk command returns 1, meaning it found duplicates.
However, for a basic understanding I'll also show an example using an if block
awk 'a[$0]++{exit 1}' test.txt
# $? contains the return value of the last command
if [ $? != 0 ] ; then
echo -n "remove duplicates? [y/n] "
read answer
# check answer
if [ "$answer" == "y" ] ; then
uniq test.txt > test.uniq.txt
fi
fi
However the [] are not just brackets like in other programming languages. [ is a synonym for the test bash builtin command and ] it's last argument. You need to read help [ in order to understand

A quick bash solution:
#!/bin/bash
INPUT_FILE=words
declare -A a
while read line ; do
[ "${a[$line]}" = 'nonempty' ] && duplicates=yes && break
a[$line]=nonempty
done < $INPUT_FILE
[ "$duplicates" = yes ] && echo -n "Keep duplicates? [Y/n]" && read keepDuplicates
removeDuplicates() {
sort -u $INPUT_FILE > $INPUT_FILE.tmp
mv $INPUT_FILE.tmp $INPUT_FILE
}
[ "$keepDuplicates" != "Y" ] && removeDuplicates
The script reads line by line from the INPUT_FILE and stores each line in the associative array a as the key and sets the string nonempty as value. Before storing the value, it first checks whether it is already there - if it is it means it found a duplicate and it sets the duplicates flag and then it breaks out of the cycle.
Later it only checks if the flag is set and asks the user whether to keep the duplicates. If they answer anything else than Y then it calls the removeDuplicates function which uses sort -u to remove the duplicates. ${a[$line]} evaluates to the value of the associative array a for the key $line. [ "$duplicates" = yes ] is a bash builtin syntax for a test. If the test succeeds then whatever follows after && is evaluated.
But note that the awk solutions will likely be faster so you may want to use them if you expect to process bigger files.

You can do uniq=yes/no using this awk one-liner:
awk '!seen[$0]{seen[$0]++; i++} END{print (NR>i)?"no":"yes"}' file
awk uses an array of uniques called seen.
Every time we put an element in unique we increment an counter i++.
Finally in END block we compare # of records with unique # of records in this code: (NR>i)?
If condition is true that means there are duplicate records and we print no otherwise it prints yes.

Related

Search for value and print something if found (BASH)

I have the following list:
COX1
COX1
COX1
COX1
COX1
Cu-oxidase
Cu-oxidase_3
Cu-oxidase_3
Fer4_NifH
and I want to search if COX1 and Cu-oxidase is in the list, I want to print xyz, if Cu-oxidase_3 and Fer4_NifHis in the list too (independent if the first two are in the list, then it should print abc.
This is what I could script so far:
if grep 'COX1' file.txt; then echo xyz; else exit 0; fi
but it is of course incomplete.
Any solution to that?
ideally my output would be:
xyz
abc
Awk lets you easily search for multiple regular expressions and print something else than the matched string itself. (grep can easily search for multiple patterns, too, but it will print the match or its line number or file name, not some arbitrary string.)
The following assumes that you have a single token per line. This assumption makes the script really simple, though it would also not be hard to support other scenarios.
awk '{ a[$1]++ }
END { if (("COX1" in a) && ("Cu-oxidase" in a)) print "xyz";
if (("Cu-oxidase_3" in a) && ("Fer4_NifH" in a)) print "abc" }' file.txt
This builds an associative array of each token (actually the first whitespace-separated token on each line) and then at the end, when it has read every line in the file, checks whether the sought tokens exist as keys in the array.
Performing a single pass over the input file is a big win especially if you have a large input file and many patterns. Just for completeness, the syntax for performing multiple passes with grep is very straightforward;
if grep -qx 'COX1' file.txt && grep -qx 'Cu-oxidase' file.txt
then
echo xyz
fi
which can be further abbreviated to
grep -qx 'COX1' file.txt && grep -qx 'Cu-oxidase' file.txt && echo xyz
Notice the -x switch to require the whole line to match (otherwise the regex 'Cu-oxidase' would also match on the Cu-oxidase_3 lines).
Above is a very verbose way to achieve this. There are ways to write the same with less ifs and less greps, but I really wanted to show you the logic:
you run a grep command, check for its return value with $?, and finally acts on the conditions.
# default values
HAS_COX1=0
HAS_CUOX=0
HAS_CUO3=0
HAS_FER4=0
# run silently grep
grep -q 'COX1' file.txt
# check for return value and set variable accordingly
if [ $? -eq 0 ]; then HAS_COX1=1; fi
# same as above
grep -q 'Cu-oxidase' file.txt
if [ $? -eq 0 ]; then HAS_CUOX=1; fi
grep -q 'Cu-oxidase_3' file.txt
if [ $? -eq 0 ]; then HAS_CUO3=1; fi
grep -q 'Fer4_NifH' file.txt
if [ $? -eq 0 ]; then HAS_FER4=1; fi
if [ $HAS_COX1 -eq 1 ]; then
if [ $HAS_CUOX -eq 1 ]; then
echo 'xyz'
exit 0
fi
fi
if [ $HAS_CUO3 -eq 1 ]; then
if [ $HAS_FER4 -eq 1 ]; then
echo 'abc'
exit 0
fi
fi
echo 'None of the checks where matched'
exit 1
Beware: this code is untested, so there might be bugs ☺
The code isn't perfect, as it cannot print both 'xyz' and 'abc' when both conditions are met (but that would be an easy fix with the syntax I provide). Also $HAS_CUOX will be set to 1 whenever $HAS_CUO3 is found (no boundary checking in the grep regex).
You could take that code further by using a single grep for each set of conditions to check, using something like 'COX1\|Cu_oxidase' as the regex for grep. And also fix the minor issues I mentioned above.
ideally my output would be:
xyz
abc
You added your expected output after I wrote the above script, but given the elements I gave you, you should be able to figure how to improve that (basically removing the exit 0 where I placed them, and doing exit 1 when no output has been given.
Or just remove all exits as a dirty solution.

Bash Script : While loop and if statment

I am writing a script to filter GID greater than 1000 and GID less than or equal to 1000. Purpose is to filter out local groups and non-local groups (groups coming from AD) from a file..
There is a file called groups.out which contains group names and GIDs. It could be in any order. Below is the sample file which contains local groups, non=local groups and GIDs as well.
1098052
1098051
domain users
fuse
gdm
haldaemon
and here is the logic I want to apply
Read line by line from the file,
if the line is a number then check
if number greater than or equal to 1000 then check
if greater than or equal to 1000, append it to the file
else if number less than 1000 then dump it
else if erorr occurs append error to file and break the loop and exit
if the line is a string then check the gid of the string/group
if number greater than or equal to 1000 then append to file
else if gid less than 1000 then dump it
else if error occurs append error to file and break the loop and exit
want to repeat it in the loop line by line and if anywhere the error occurs loop should break and exit the entire script
After successful execution of the loop it should print success or if any error occurs, it should exit and append errors to the file.
Below is my uncooked code with many parts missing. Many errors are there as well for gt or eq errors. so you can ignore it
fileA="groups.out"
value=1000
re='[a-z]'
num='[0-9]'
while IFS= read lineA
do
group=$(getent group "$lineA" | awk -F: '{print $3}')
# ------Don't know how to check if a number or string -----
if [ "$group" -gt "$value" ]; then
echo "$lineA" >> ldapgroups.out 2>> error.out
elif [ "$group" -lt "$value" ]; then
echo "$lineA" >> /dev/null 2>> error.out
else
echo " FAILED"
exit 1
fi
#/bin/bash
fileA="groups.out"
value=1000
num='^[0-9]+$'
while IFS= read lineA
do
#check if the line is numbers only
if [[ $lineA =~ $num ]];then
echo "This is a number"
echo $lineA
#check if $line is greater than 1000
if [[ $lineA -gt $value ]];then
#write it to file named numbers.out
echo "number is greater than 1000 writing to file"
echo $lineA >> numbers.out
else
echo "less than, Skipping"
fi
#if its not number, its group names right? so no need to check if with regex
else
#do what ever u want with group names here ...
echo "string"
echo $lineA
fi
# This is where you feed the file to the while loop
done < $fileA
Here is corrected version of your script. it should get u going.
chmod +x scriptfile and use bash scriptfile to run it or schedule it in crontab.
Since your information about how to match group names with gids isnt sufficent I left it out in the script but you should be able to finish it with provided information in other parts of the script.
This really looks like you want two separate scripts. Finding numbers in a particular range is simple with Awk.
awk '!/[^0-9]/ && ($1 >= 1000)' groups.out
The regular expression selects all-numeric input lines (or more properly, it excludes lines which contain a non-numeric character anywhere within them) and the numeric comparison requires the first field to be 1000 or more. (The default action of Awk is to print the entire line when the conditions in your script are true, so we can omit the implicit {print} action).
If you also want to extract the numbers which are less than 1000 to a separate file, the change should be obvious.
For the non-numeric values, we can do
grep '[^0-9]' groups.out |
xargs getent |
awk -F : '$3 >= 1000 { print $3 }'
Several of the branches in your pseudocode seem superfluous. It's not clear in what situation ou would expect an error to occur, or how the action you specify in the error situation would help you diagnose or recover from the error (write acc ss denied, disk full?) so I have not spent any energy on trying to implement those parts.

using head -n within if condition

I am currently trying to evaluate txt-files in a directory using bash. I want to know if the third line of the txt-file matches a certain string. The file starts with two empty lines, then the target string. I tested the following one liner:
if [[ $(head -n 3 a_txt_file.txt) == "target_string" ]]; then echo yes; else echo no; fi
I can imagine that since head -n 3 also prints out the two empty lines, I have to add them to the if condition. But "\n\ntarget_string" and "\n\ntarget_string\n" also don't work.
How would one do this correctly (And I guess it can be done more elegantly as well)?
Try this instead - it will print only the third line:
sed -n 3p file.txt
If you just need to remove the top two lines:
head -n 3 | tail -1
You'll want to use sed instead of head. This gets the third line, tests if it matches, and then you can do whatever you want with it if it does match.
if [[ $(sed '3q;d' test_text.txt ) == "target_string" ]]; then echo yes; else echo no; fi
Besides sed you can try awk to print 3rd line
awk 'NR==3'
A pure bash solution:
if { read; read; read line; } < test_text.txt
[[ $line = target_string ]]
then
echo yes
else
echo no
fi < test_text.txt
This takes advantage of the fact that the condition of the if statement can be a sequence of commands. First, read twice from the file to discard the empty lines; the third sets line to the 3rd line. After that, you can test it against the target string.

How sort recursively by maximum file size and count files?

I'm beginner in bash programming. I want to display head -n $1 results of sorting files
by size in /etc/*. The problem is that at final search, I must know how many directories and files has processed.
I compose following code:
#!/bash/bin
let countF=0;
let countD=0;
for file in $(du -sk /etc/* |sort +0n | head $1); do
if [ -f "file" ] then
echo $file;
let countF=countF+1;
else if [ -d "file" ] then
let countD=countD+1;
fi
done
echo $countF
echo $countD
I have errors at execution. How use find with du, because I must search recursively?
#!/bin/bash # directory and program reversed
let countF=0 # semicolon not needed (several more places)
let countD=0
while read -r file; do
if [ -f "$file" ]; then # missing dollar sign and semicolon
echo $file
let countF=countF+1 # could also be: let countF++
else if [ -d "$file" ]; then # missing dollar sign and semicolon
let countD=countD+1
fi
done < <(du -sk /etc/* |sort +0n | head $1) # see below
echo $countF
echo $countD
Changing the loop from a for to a while allows it to work properly in case filenames contain spaces.
I'm not sure what version of sort you have, but I'll take your word for it that the argument is correct.
It's #!/bin/bash not #!/bash/bin.
I don't know what that argument to sort is supposed to be. Maybe you meant sort -r -n?
Your use of head is wrong. Giving head file arguments causes it to ignore its standard input, so in general it's an error to both pipe something to head and give it a file argument. Besides that, "$1" refers to the script's first argument. Did you maybe mean head -n 1, or were you trying to make the number of lines processed configurable from an argument to the script: head -n"$1".
In your if tests, you're not referencing your loop variable: it should read "$file", not "file".
Not that the bash parser cares, but you should try to indent sanely.
#!/bin/bash # directory and program reversed
let countF=0 # semicolon not needed (several more places)
let countD=0
while read -r file; do
if [ -f "$file" ]; then # missing dollar sign and semicolon
echo $file
let countF=countF+1 # could also be: let countF++
else if [ -d "$file" ]; then # missing dollar sign and semicolon
let countD=countD+1
fi
done < <(du -sk /etc/* |sort +0n | head $1) # see below
echo $countF
echo $countD
I tried instead of file variable the /etc/* but I don't see a result. the idea is to sort all files by size from a directories and subdirectories and display $1 results ordered by
size of the files. In this process I must know how many files and dirs contains the directory where
I did the search.
Ruby(1.9+)
#!/usr/bin/env ruby
fc=0
dc=0
a=Dir["/etc/*"].inject([]) do |x,f|
fc+=1 if File.file?(f)
dc+=1 if File.directory?(f)
x<<f
end
puts a.sort
puts "number of files: #{fc}"
puts "number of directories: #{dc}"

How can I get my bash script to work?

My bash script doesn't work the way I want it to:
#!/bin/bash
total="0"
count="0"
#FILE="$1" This is the easier way
for FILE in $*
do
# Start processing all processable files
while read line
do
if [[ "$line" =~ ^Total ]];
then
tmp=$(echo $line | cut -d':' -f2)
count=$(expr $count + 1)
total=$(expr $total + $tmp)
fi
done < $FILE
done
echo "The Total Is: $total"
echo "$FILE"
Is there another way to modify this script so that it reads arguments into $1 instead of $FILE? I've tried using a while loop:
while [ $1 != "" ]
do ....
done
Also when I implement that the code repeats itself. Is there a way to fix that as well?
Another problem that I'm having is that when I have multiple files hi*.txt it gives me duplicates. Why? I have files like hi1.txt hi1.txt~ but the tilde file is of 0 bytes, so my script shouldn't be finding anything.
What i have is fine, but could be improved. I appreciate your awk suggestions but its currently beyond my level as a unix programmer.
Strager: The files that my text editor generates automatically contain nothing..it is of 0 bytes..But yeah i went ahead and deleted them just to be sure. But no my script is in fact reading everything twice. I suppose its looping again when it really shouldnt. I've tried to silence that action with the exit commands..But wasnt successful.
while [ "$1" != "" ]; do
# Code here
# Next argument
shift
done
This code is pretty sweet, but I'm specifying all the possible commands at one time. Example: hi[145].txt
If supplied would read all three files at once.
Suppose the user enters hi*.txt;
I then get all my hi files read twice and then added again.
How can I code it so that it reads my files (just once) upon specification of hi*.txt?
I really think that this is because of not having $1.
It looks like you are trying to add up the totals from the lines labelled 'Total:' in the files provided. It is always a good idea to state what you're trying to do - as well as how you're trying to do it (see How to Ask Questions the Smart Way).
If so, then you're doing in about as complicated a way as I can see. What was wrong with:
grep '^Total:' "$#" |
cut -d: -f2 |
awk '{sum += $1}
END { print sum }'
This doesn't print out "The total is" etc; and it is not clear why you echo $FILE at the end of your version.
You can use Perl or any other suitable program in place of awk; you could do the whole job in Perl or Python - indeed, the cut work could be done by awk:
grep "^Total:" "$#" |
awk -F: '{sum += $2}
END { print sum }'
Taken still further, the whole job could be done by awk:
awk -F: '$1 ~ /^Total/ { sum += $2 }
END { print sum }' "$#"
The code in Perl wouldn't be much harder and the result might be quicker:
perl -na -F: -e '$sum += $F[1] if m/^Total:/; END { print $sum; }' "$#"
When iterating over the file name arguments provided in a shell script, you should use '"$#"' in place of '$*' as the latter notation does not preserve spaces in file names.
Your comment about '$1' is confusing to me. You could be asking to read from the file whose name is in $1 on each iteration; that is done using:
while [ $# -gt 0 ]
do
...process $1...
shift
done
HTH!
If you define a function, it'll receive the argument as $1. Why is $1 more valuable to you than $FILE, though?
#!/bin/sh
process() {
echo "doing something with $1"
}
for i in "$#" # Note use of "$#" to not break on filenames with whitespace
do
process "$i"
done
while [ "$1" != "" ]; do
# Code here
# Next argument
shift
done
On your problem with tilde files ... those are temporary files created by your text editor. Delete them if you don't want them to be matched by your glob expression (wildcard). Otherwise, filter them in your script (not recommended).

Resources