I need to filter only duplicated lines from many files using bash [duplicate] - bash

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 2 years ago.
I have the following three files
filea
a
bc
cde
fileb
a
bc
cde
frtdff
filec
a
bc
cddeeer
erer34
I am able to filter by the duplicated lines from these three files.
I am using the following command
ls file* | wc -l
which returns 3. Then, I am launching
sort file* | uniq --count --repeated | awk '{ if ($1 == 3) { print $2} }'
The last command returns precisely what I need, only in case I am not creating more files starting with "file".
But, in case I have thousands of files that need to be created during the time a script is running , I should get an exact number of files coming retrieved from this command
n=`ls file* | wc -l`
sort file* | uniq --count --repeated | awk '{ if ($1 == $n) { print $2} }'
Unfortunately, variable n is not accepted inside the if condition of the awk command.
My issue is that I am not able to use the value of the variable n as a comparison criteria inside an if conditional that is part of awk command.

You can use:
awk '!line[$0]++' file*
This will print only once any string even if present in several files and or in same file.

Related

Bash unable to assign value [duplicate]

This question already has answers here:
How do I set a variable to the output of a command in Bash?
(15 answers)
Closed 1 year ago.
Got a bit of a problem. I'm new to bash and I'm trying my hardest, but I can't figure out how to assign the desired output to the variable. Running this command in the prompt
wc -l data.txt | awk '{ print $1 }'
yields the result 12, which is desired. However if I put it in the test.sh file, it won't work. I've tried different quotations, but all I've managed to get is the entire line as a string...
Test.sh
#! /bin/bash
# Count lines for data.txt files
data1=wc -l data.txt | awk '{ print $1 }'
echo "Lines in data.txt: $data1"
exit
I think you want:
data1="$(wc -l data.txt | awk '{ print $1 }')"
The $() syntax causes bash to execute that expression and replace it with the results.
Actually, powershell does allow you to do a straight = assignment like you did...

Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2

Below is my file 1 content:
123|yid|def|
456|kks|jkl|
789|mno|vsasd|
and this is my file 2 content
123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|
The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:
134|rst|uvw|
Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.
How can I achieve this?
Currently I'm using this in my code:
#sort FILEs first before comparing
sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted
for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"
#for every oid in FILE 1, compare it with oid FILE 2 and output the difference
grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \ -f 2 > $FILE_1_tmp
You can do this in Awk very easily!
awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2
Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.
So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with |, so set OFS="|"
The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.
Here is another one liner that uses join, sort and grep
join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
grep -E -v '.*\|.*\|.*\|.*\|'
join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.
Since join requires input files to be sorted, we sort them.
Finally, grep removes all lines that contain more than three fields from the output.

Unique entry set in the first column of all csv files under directory [duplicate]

This question already has answers here:
Is there a way to 'uniq' by column?
(8 answers)
Closed 7 years ago.
I have a list of comma separated files under the directory. There are no headers, and unfortunately they are not even the same length for each row.
I want to find the unique entry in the first column across all files.
What's the quickest way of doing it in shell programming?
awk -F "," '{print $1}' *.txt | uniq
seems to only get uniq entries of each files. I want all files.
Shortest is still using awk (this will print the row)
awk -F, '!a[$1]++' *.txt
to get just the first field
awk -F, '!a[$1]++ {print $1}' *.txt

How to write a bash and define awk constants in command line [duplicate]

This question already has answers here:
Using awk with variables
(3 answers)
Closed 8 years ago.
As a part of my bash, I want to pass some constant from command line to awk. For example, I want to subtract constant1 from column 1 and constant2 from column 5
$ sh bash.sh infile 0.54 0.32
#!/bin/bash
#infile = $1
#constant1 = $2
#constant2 = $3
cat $1 | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'
thank you very much for your help
As awk is it's own language, by default it does not share the same variables as Bash. To use Bash variables in an awk command, you should pass the variables to awk using the -v option.
#!/bin/bash
awk -v constant1=$2 -v constant2=$3 '{print($1-constant1),($5-constant2)}' $1
You'll notice I removed cat as there is no need to pipe cat into awk since awk can read from files.
you need to remove gaps when defining vaariables:
#!/bin/bash
infile=$1
constant1=$2
constant2=$3
cat $1 | awk '{print $1 $2 $3 $4 $5 $6}'

How to pass variable to awk [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using awk with variables
The following command is wrong, the point is I want to use $curLineNumber in awk, how can I do it? Any solution?
curLineNumber = 3
curTime=`ls -l | awk 'NR==$curLineNumber {print $NF}'`
Thanks
curTime=$(ls -l | awk -v line=$curLineNumber 'NR == line { print $NF }'
The -v option is used to specify variables initialized on the command line. I chose the name line for the awk variable.

Resources