Bash script to efficiently return two file names that both contain a string found in a list - bash

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.
This is an example of the set.csv file
"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","
This is the very inefficient code I wrote
for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt;
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]];
then echo $ID + $FILE1 + $FILE2;
fi;
done;
done;
done
Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.
There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.
Thanks!
Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?

A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming
awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
id=substr($4,4,4)
if(ids[id]) {
print id " is in " ids[id] " and " FILENAME;
} else {
ids[id]=FILENAME;
}
}' *.txt
The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

Related

bash csv file column extraction and deduplication

I have a .csv file I am working with and I need to output another csv file that contains a de-deuplicated list of columns 2 and 6 from the first csv with some caveats.
This is a bit difficult to explain in words but here is an example of what my input is:
"customer_name","cid”,”boolean_status”,”type”,”number”
“conotoso, inc.”,”123456”,”TRUE”,”Inline”,”210”
"conotoso, inc.","123456”,”FALSE”,”Inline”,”411"
“afakename”,”654321”,”TRUE","Inline”,”253”
“bfakename”,”909090”,”FALSE”,”Inline”,”321”
“cfakename”,”121212”,”TRUE","Inline","145”
what I need for this to do is create a new .csv file containing only "customer_name" column and "boolean_status" column.
Now, I also need there to be only one line for "customer_name" and to show "TRUE" if ANY of the customer_name matches a "true" value in the boolean column.
The output from the above input should be this:
"customer_name",”boolean_status”
“conotoso, inc.”,”TRUE”
“afakename”,”TRUE"
“cfakename”,”TRUE"
So far I tried
awk -F "\"*\",\"*\"" '{print $1","$6}' data1.csv >data1out.csv
to give me the output file, but then I attempted to cat data1out.csv | grep 'TRUE' with no good luck
can someone help me out on what i should do to manipulate this properly?
I'm also running into issues with the awk printing out the leading commas
All I really need at the end is a number of "how many unique 'customer_names' have at least 1 'True' in the "boolean" column?"
You will get your de duplicated file by using
sort -u -t, -k2,2 -k6,6 filname>sortedfile
Post this you can write a script to extract the columns required.
while read line
do
grep "TRUE" "$line"
if [ $? -eq 0]
then
a=$(cut -d',' -f1-f3 $line)
echo a >>outputfile
fi
done<<sortedfile

To extract a string from filename and insert it into the file

I want to write a bash script for extracting a string from the file name and insert that string into a specific location in the same file.
For example:
Under /root dir there are different date directories 20160201, 20160202, 20160203 and under each directory there is a file abc20160201.dat, abc20160202.dat, abc20160203.dat.
My requirement is that I need to extract the date from each file name first, and then insert that date into the second column of each record in the file.
For extracting the date I am using
f=abc20160201.dat
s=`echo $f | cut -c 4-11`
echo "$f -> $s"
and for inserting the date iI am using
awk 'BEGIN { OFS = "~"; ORS = "\n" ; date="20160201" ; IFS = "~"} { $1=date"~"$1 ; print } ' file > tempdate
But in my awk command the date is coming in the first column. Please let me know what I am doing wrong here.
The file on which this operation is being done is a delimited file with fields separated by ~ characters.
Or if anybody has a better solution for this, please let me know.
The variable for the input field separator is FS, not IFS. Consequently, the input line is not being split at all, hence when you add the date after field 1, it appears at the end of the line.
You should be able to use:
f=abc20160201.dat
s=$(echo $f | cut -c 4-11)
awk -v date="$s" 'BEGIN { FS = OFS = "~" } { $1 = $1 OFS date; print }' $f
That generates the modified output to standard output. AFAIK, awk doesn't have an overwrite option, so if you want to modify the files 'in place', you'll write the output of the script to a temporary file, and then copy or move the temporary file over the original (removing the temporary if you copied). Copying preserves both hard links and symbolic links (and owner, group, permissions); moving doesn't. If the file names are neither symlinks nor linked files, moving is simpler. (Copying always 'works', but the copy takes longer than a move, requires the remove, and there's a longer window while the over-writing copy could leave you with an incomplete file if interrupted.)
Generalizing a bit:
for file in /root/2016????/*.dat
do
tmp=$(mktemp "$(dirname "$file")/tmp.XXXXXX)")
awk -v date="$(basename "$file" | cut -c 4-11)" \
'BEGIN { FS = OFS = "~" } { $1 = $1 OFS date; print }' "$file" >"$tmp"
mv "$tmp" "$file"
done
One of the reasons for preferring $(…) over back-quotes is that it is much easier to manage nested operations and quoting using $(…). The mktemp command creates the temporary file in the same directory as the source file; you can legitimately decide to use mktemp "${TMPDIR:-/tmp}/tmp.XXXXXX instead. A still more general script would iterate of "$#" (the arguments it is passed), but it might need to validate that the base name of the file matches the format you require/expect.
Adding code to deal with cleaning up on interrupts, or selecting between copy and move, is left as an exercise for the reader. Note that the script makes no attempt to detect whether it has been run on a file before. If you run it three times on the same file, you'll end up with columns 2-4 all containing the date.

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt
refer.txt contains the following
julie,remo,rob,whitney,james
parse.txt contains
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0
Now my output.txt should list the files in parse.txt based on the order specified in refer.txt
ex of output.txt should be:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
i have tried the following code:
sort -nru refer.txt parse.txt
but no luck.
please assist me.TIA
You can do that using gnu-awk:
awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
{s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt
Output:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
Explanation:
-F/ # Use field separator as /
-v RS=',|\n' # Use record separator as comma or newline
NR == FNR { # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the
# records with keys julie, remo, rob etc.
}
{ # while processing the second file refer.txt
s = (s)?s "," a[$1]:a[$1] # aggregate all values by reading key from 2nd file
}
END {print s } # print all the values
In pure native bash (4.x):
# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt
# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[#]}"; do
key=${value%%/*}
if [[ ${kv[$key]} ]]; then
kv[$key]+=",$value" # already exists, comma-separate
else
kv[$key]="$value"
fi
done
# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[#]}"; do
out+=( "${kv[$value]}" )
done
# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt
If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.
tr , "\n" refer.txt | cat -n >person_id.txt # 'cut -n' not posix, use sed and paste
cat person_id.txt | while read person_id person_key
do
print "$person_id" > $person_key
done
tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt
cat person_data.txt | while read foreign_key person_data
do
person_id="$(<$foreign_key)"
print "$person_id" " " "$person_data" >>merge.txt
done
sort merge.txt >output.txt
A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:
[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)
[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'
[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested
The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]
This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.
NB: The above code is very unlikely to work out of the box.
NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:
# 1) index data file on required field
cat person_data.txt | while read data
do
key="$(print "$data" | sed 's/(^[^\/]*)/\1/')" # alt. `cut -d'/' -f1` ??
print "$data" >>./person_data/"$key"
done
# 2) run batch job
cat refer_data.txt | while read key
do
print ./person_data/"$key"
done
However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)
Command
while read line; do
grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -
Key points
For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to #CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

Bash script processing too slow

I have the following script where I'm parsing 2 csv files to find a MATCH the files have 10000 lines each one. But the processing is taking a long time!!! Is this normal?
My script:
#!/bin/bash
IFS=$'\n'
CSV_FILE1=$1;
CSV_FILE2=$2;
sort -t';' $CSV_FILE1 >> Sorted_CSV1
sort -t';' $CSV_FILE2 >> Sorted_CSV2
echo "PATH1 ; NAME1 ; SIZE1 ; CKSUM1 ; PATH2 ; NAME2 ; SIZE2 ; CKSUM2" >> 'mapping.csv'
while read lineCSV1 #Parse 1st CSV file
do
PATH1=`echo $lineCSV1 | awk '{print $1}'`
NAME1=`echo $lineCSV1 | awk '{print $3}'`
SIZE1=`echo $lineCSV1 | awk '{print $7}'`
CKSUM1=`echo $lineCSV1 | awk '{print $9}'`
while read lineCSV2 #Parse 2nd CSV file
do
PATH2=`echo $lineCSV2 | awk '{print $1}'`
NAME2=`echo $lineCSV2 | awk '{print $3}'`
SIZE2=`echo $lineCSV2 | awk '{print $7}'`
CKSUM2=`echo $lineCSV2 | awk '{print $9}'`
# Test if NAM1 MATCHS NAME2
if [[ $NAME1 == $NAME2 ]]; then
#Test checksum OF THE MATCHING NAME
if [[ $CKSUM1 != $CKSUM2 ]]; then
#MAPPING OF THE MATCHING LINES
echo $PATH1 ';' $NAME1 ';' $SIZE1 ';' $CKSUM1 ';' $PATH2 ';' $NAME2 ';' $SIZE2 ';' $CKSUM2 >> 'mapping.csv'
fi
break #When its a match break the while loop and go the the next Row of the 1st CSV File
fi
done < Sorted_CSV2 #Done CSV2
done < Sorted_CSV1 #Done CSV1
This is a quadratic order. Also, see Tom Fenech comment: You are calling awk several times inside a loop inside another loop. Instead of using awk for the fields in every line try setting the IFS shell variable to ";" and read the fields directly in read commands:
IFS=";"
while read FIELD11 FIELD12 FIELD13; do
while read FIELD21 FIELD22 FIELD23; do
...
done <Sorted_CSV2
done <Sorted_CSV1
Though, this would be still O(N^2) and very inefficient. It seems you are matching 2 fields by a coincident field. This task is easier and faster to accomplish by using join command line utility, and would reduce order from O(N^2) to O(N).
Whenever you say "Does this file/data list/table have something that matches this file/data list/table?", you should think of associative arrays (sometimes called hashes).
An associative array is keyed by a particular value and each key is associated with a value. The nice thing is that finding a key is extremely fast.
In your loop of a loop, you have 10,000 lines in each file. You're outer loop executed 10,000 times. Your inner loop may execute 10,000 times for each and every line in your first file. That's 10,000 x 10,000 times you go through that inner loop. That's potentially looping 100 million times through that inner loop. Think you can see why your program might be a little slow?
In this day and age, having a 10,000 member associative array isn't that bad. (Imagine doing this back in 1980 on a MS-DOS system with 256K. It just wouldn't work). So, let's go through the first file, create a 10,000 member associative array, and then go through the second file looking for matching lines.
Bash 4.x has associative arrays, but I only have Bash 3.2 on my system, so I can't really give you an answer in Bash.
Besides, sometimes Bash isn't the answer to a particular issue. Bash can be a bit slow and the syntax can be error prone. Awk might be faster, but many versions don't have associative arrays. This is really a job for a higher level scripting language like Python or Perl.
Since I can't do a Bash answer, here's a Perl answer. Maybe this will help. Or, maybe this will inspire someone who has Bash 4.x can give an answer in Bash.
I Basically open the first file and create an associative array keyed by the checksum. If this is a sha1 checksum, it should be unique for all files (unless they're an exact match). If you don't have a sha1 checksum, you'll need to massage the structure a wee bit, but it's pretty much the same idea.
Once I have the associative array figured out, I then open file #2 and simply see if the checksum already exists in the file. If it does, I know I have a matching line, and print out the two matches.
I have to loop 10,000 times in the first file, and 10,000 times in the second. That's only 20,000 loops instead of 10 million that's 20,000 times less looping which means the program will run 20,000 times faster. So, if it takes 2 full days for your program to run with a double loop, an associative array solution will work in less than one second.
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE1 => "file1.txt",
FILE2 => "file2.txt",
MATCHING => "csv_matches.txt",
};
#
# Open the first file and create the associative array
#
my %file_data;
open my $fh1, "<", FILE1;
while ( my $line = <$fh1> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# The main key is "check_sum" which **should** be unique, especially if it's a sha1
#
$file_data{$check_sum}->{PATH} = $path;
$file_data{$check_sum}->{NAME} = $name;
$file_data{$check_sum}->{SIZE} = $size;
}
close $fh1;
#
# Now, we have the associative array keyed by the data we want to match, read file 2
#
open my $fh2, "<", FILE2;
open my $csv_fh, ">", MATCHING;
while ( my $line = <$fh2> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# If there is a matching checksum in file1, we know we have a matching entry
#
if ( exists $file_data{$check_sum} ) {
printf {$csv_fh} "%s;%s:%s:%s:%s:%s\n",
$file_data{$check_sum}->{PATH}, $file_data{$check_sum}->{NAME}, $file_data{$check_sum}->{SIZE},
$path, $name, $size;
}
}
close $fh2;
close $csv_fh;
BUGS
(A good manpage always list issues!)
This assumes one match per file. If you have multiple duplicates in file1 or file2, you will only pick up the last one.
This assumes a sha256 or equivalent checksum. In such a checksum, it is extremely unlikely that two files will have the same checksum unless they match. A 16bit checksum from the historic sum command may have collisions.
Although a proper database engine would make a much better tool for this, it is still very well possible to do it with awk.
The trick is to sort your data, so that records with the same name are grouped together. Then a single pass from top to bottom is enough to find the matches. This can be done in linear time.
In detail:
Insert two columns in both CSV files
Make sure every line starts with the name. Also add a number (either 1 or 2) which denotes from which file the line originates. We will need this when we merge the two files together.
awk -F';' '{ print $2 ";1;" $0 }' csvfile1 > tmpfile1
awk -F';' '{ print $2 ";2;" $0 }' csvfile2 > tmpfile2
Concatenate the files, then sort the lines
sort tmpfile1 tmpfile2 > tmpfile3
Scan the result, report the mismatches
awk -F';' -f scan.awk tmpfile3
Where scan.awk contains:
BEGIN {
origin = 3;
}
$1 == name && $2 > origin && $6 != checksum {
print record;
}
{
name = $1;
origin = $2;
checksum = $6;
sub(/^[^;]*;.;/, "");
record = $0;
}
Putting it all together
Crammed together into a Bash oneliner, without explicit temporary files:
(awk -F';' '{print $2";1;"$0}' csvfile1 ; awk -F';' '{print $2";2;"$0}' csvfile2) | sort | awk -F';' 'BEGIN{origin=3}$1==name&&$2>origin&&$6!=checksum{print record}{name=$1;origin=$2;checksum=$6;sub(/^[^;]*;.;/,"");record=$0;}'
Notes:
If the same name appears more than once in csvfile1, then all but the last one are ignored.
If the same name appears more than once in csvfile2, then all but the first one are ignored.

Adding file information to an AWK comparison

I'm using awk to perform a file comparison against a file listing in found.txt
while read line; do
awk 'FNR==NR{a[$1]++;next}$1 in a' $line compare.txt >> $CHECKFILE
done < found.txt
found.txt contains full path information to a number of files that may contain the data. While I am able to determine that data exists in both files and output that data to $CHECKFILE, I wanted to be able to put the line from found.txt (the filename) where the line was found.
In other words I end up with something like:
File " /xxxx/yyy/zzz/data.txt "contains the following lines in found.txt $line
just not sure how to get the /xxxx/yyy/zzz/data.txt information into the stream.
Appended for clarification:
The file found.txt contains the full path information to several files on the system
/path/to/data/directory1/file.txt
/path/to/data/directory2/file2.txt
/path/to/data/directory3/file3.txt
each of the files has a list of parameters that need to be checked for existence before appending additional information to them later in the script.
so for example, file.txt contains the following fields
parameter1 = true
parameter2 = false
...
parameter35 = true
the compare.txt file contains a number of parameters as well.
So if parameter35 (or any other parameter) shows up in one of the three files I get it's output dropped to the Checkfile.
Both of the scripts (yours and the one I posted) will give me that output but I would also like to echo in the line that is being read at that point in the loop. Sounds like I would just be able to somehow pipe it in, but my awk expertise is limited.
It's not really clear what you want but try this (no shell loop required):
awk '
ARGIND==1 { ARGV[ARGC] = $0; ARGC++; next }
ARGIND==2 { keys[$1]; next }
$1 in keys { print FILENAME, $1 }
' found.txt compare.txt > "$CHECKFILE"
ARGIND is gawk-specific, if you don't have it add FNR==1{ARGIND++}.
Pass the name into awk inside a variable like this:
awk -v file="$line" '{... print "File: " file }'

Resources