Comparing output from two greps - bash

I have two C source files with lots of defines and I want to compare them to each other and filter out lines that do not match.
The grep (grep NO_BCM_ include/soc/mcm/allenum.h | grep -v 56440) output of the first file may look like:
...
...
# if !defined(NO_BCM_5675_A0)
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
...
...
where grep (grep "define NO_BCM" include/sdk_custom_config.h) of the second looks like:
...
...
#define NO_BCM_56260_B0
#define NO_BCM_5675_A0
#define NO_BCM_56160_A0
...
...
So now I want to find any type number in the braces above that are missing from the #define below. How do I best go about this?
Thank you

You could use an awk logic with two process-substitution handlers for grep
awk 'FNR==NR{seen[$2]; next}!($2 in seen)' FS=" " <(grep "define NO_BCM" include/sdk_custom_config.h) FS="[()]" <(grep NO_BCM_ include/soc/mcm/allenum.h | grep -v 56440)
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
The idea is the commands within <() will execute and produce the output as needed. The usage of FS before the outputs are to ensure the common entity is parsed with a proper-delimiter.
FS="[()]" is to capture $2 as the unique field in second-group and FS=" " for the default whitespace de-limiting on first group.
The core logic of awk is identifying not repeating elements, i.e. FNR==NR parses the first group storing the unique entries in $2 as a hash-map. Once all the lines are parsed, !($2 in seen) is executed on the second-group which means filter those lines whose $2 from second-group is not present in the hash created.

Use comm this way:
comm -23 <(grep NO_BCM_ include/soc/mcm/allenum.h | cut -f2 -d'(' | cut -f1 -d')' | sort) <(grep "define NO_BCM" include/sdk_custom_config.h | cut -f2 -d' ' | sort)
This would give tokens unique to include/soc/mcm/allenum.h.
Output:
NO_BCM_2801PM_A0
NO_BCM_88660_A0
If you want the full lines from that file, then you can use fgrep:
fgrep -f <(comm -23 <(grep NO_BCM_ include/soc/mcm/allenum.h | cut -f2 -d'(' | cut -f1 -d')' | sort) <(grep "define NO_BCM" include/sdk_custom_config.h | cut -f2 -d' ' | sort)) include/soc/mcm/allenum.h
Output:
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
About comm:
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to
FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

It's hard to say without the surrounding context from your sample input files and no expected output but it sounds like this is all you need:
awk '!/define.*NO_BCM_/{next} NR==FNR{defined[$2];next} !($2 in defined)' include/sdk_custom_config.h FS='[()]' include/soc/mcm/allenum.h

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

How to count number of unique fields in a csv file with uneven number of columns in each row

I am having a csv file containing an extract of variables for files in a specific directory. Thus the number of columns vary per row, like this:
filename1,variable1,variable2,variable3,variable4
filename2,variable1,variable2,variable5
filename3,variable1,variable5,variable6,variable7,variable8
(trailing commas have been removed)
The command:
awk -F ',' "{print NF}" < input.csv
Do not really do the trick, since it just displays the number of columns for the "largest" row in the file for all rows.
It would be great if I could get the number of variables of each row, and more importantly, get the count of unique fields in the whole file.
The ideal output for the first request would be something like:
filename1 4
filename2 3
filename3 5
The ideal output for second request (number of unique fields in the whole file):
8
Any great ideas on how to approach this?
Thanks,
Best wishes, Birgitte
Your two requirements can be done in one shot:
awk -F, '{for(i=2;i<=NF;i++)a[$i]}{print $1, NF-1}
END{print "total unique vars:"length(a)}' file.csv
With your example data as input, we got:
filename1 4
filename2 3
filename3 5
total unique vars:8
If you want to divide them into two cmds:
awk -F, '{print $1, NF-1}' file.csv
And
awk -F, '{for(i=2;i<=NF;i++)a[$i]}END{print length(a)}' file.csv
This may be slower than a single awk script, but it's always nice to have an alternative:
Number of unique variables in the whole file
$ cut -d, -f2- file | tr , \\n | sort -u | wc -l
8
Number of variables per line
$ paste \
<(cut -d, -f1 file) \
<(grep -no , file | uniq -c | tr -s ' ' \\t | cut -f2)
filename1 4
filename2 3
filename3 5

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Bash: Formatting multi-line function results alongside each other

I have three functions that digest an access.log file on my server.
hitsbyip() {
cat $ACCESSLOG | awk '{ print $1 }' | uniq -c | sort -nk1 | uniq
}
hitsbyhour() {
cat $ACCESSLOG | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":00"}' | sort -n | uniq -c
}
hitsbymin() {
hr=$1
grep "2015:${hr}" $ACCESSLOG | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":"$3}' | sort -nk1 -nk2 | uniq -c
}
They all work fine when used on their own. All three output 2 small colums of data.
Now I am looking to create another function called report which will simply use printf and its formatting possibilities to print 3 columns of data with header, each of them the result of my three first functions. Something like that:
report() {
printf "%-30b %-30b %-30b\n" `hitsbyip` `hitsbyhour` `hitsbymin 10`
}
The thing is that the format is not what i want; it prints out the columns horizontaly instead of side by side.
Any help would be greatly appreciated.
Once you use paste to combine the output of the three commands into a single stream, then you can operate line-by-line to format those outputs.
while IFS=$'\t' read -r by_ip by_hour by_min; do
printf '%-30b %-30b %-30b\n' "$by_ip" "$by_hour" "$by_min"
done < <(paste <(hitsbyip) <(hitsbyhour) <(hitsbymin 10))
Elements to note:
<() syntax is process substitution; it generates a filename (typically of the form /dev/fd/## on platforms with such support) which will, when read, yield the output of the command given.
paste takes a series of filenames and puts the output of each alongside the others.
Setting IFS=$'\t' while reading ensures that we read content as tab-separated values (the format paste creates). See BashFAQ #1 for details on using read.
Putting quotes on the arguments to printf ensures that we pass each value assembled by read as a single value to printf, rather than letting them be subject to string-splitting and glob-expansion as unquoted values.

Resources