File Name comparision in Bash - bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.

awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list

First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext

So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.

Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Related

How do I find the "full symmetric difference" of several files in bash?

I have five files that each list full file paths like so:
File one
/full/file/path/one.xlsx
/full/file/path/two.txt
/full/file/path/three.pdf
....
File two
/a/b/c/d/r.txt
/full/file/path/two.txt
....
File three
/obe/two/three/graph.m
/full/file/path/two.txt
....
File four
.....
File five
.....
All five may contain the same exact full file paths. However, I want to filter out paths that are common to each file. In other words, I want the total intersection of all files removed. Below is a visual aid describing what I want with a smaller example of three files (excuse my poor mouse drawing skills):
The page on the symmetric difference did not describe exactly what I wanted, hence the visual aid and the quotes around the phrase full symmetric difference.
Question
How do I filter lines of text in several files to get the situation I want above?
Assuming that that each file is free of duplicates you could
Concat all files (cat file1 file2 ... file5)
Count how often each line appears (sort | uniq -c)
And keep only lines which appeared less than five times (sed -En 's/^ *[1-4] //p')
sort file1 ... file5 | uniq -c | sed -En 's/^ *[1-4] //p'
However, if some file may contain the same line multiple times than you would have to remove these duplicates first.
f() { sort -u "$1"; }
sort <(f file1) ... <(f file5) | uniq -c | sed -En 's/^ *[1-4] //p'
or (a bit slower but easier to edit)
for i in file1 ... file5; do sort -u "$i"; done |
sort | uniq -c | sed -En 's/^ *[1-4] //p'
If for some reason you want to keep duplicates from individual files and also want to retain the original order of lines, then you can invert the above command to only print lines which appeared in every file and remove these lines using grep:
f() { sort -u "$1"; }
grep -Fxvhf <(sort <(f file1) ... <(f file5) |
uniq -c | sed -En 's/^ *5 //p') file1 ... file5
or (a bit slower but easier to edit)
files=(file1 ... file5)
grep -Fxvhf <(for i in "${files[#]}"; do sort -u "$i"; done |
sort | uniq -c | sed -En 's/^ *5 //p') "${files[#]}"

Shell: Counting lines per column while ignoring empty ones

I am trying to simply count the lines in the .CSV per column, while at the same time ignoring empty lines.
I use below and it works for the 1st column:
cat /path/test.csv | cut -d, -f1 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 8
And below for the 2nd column:
cat /path/test.csv | cut -d, -f2 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 6
But when I try to count 3rd column, it simply Outputs the Total number of lines in the whole .CSV.
cat /path/test.csv | cut -d, -f3 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 33
#Should be: 19?
I've also tried to use awk instead of cut, but get the same issue.
I have tried creating new file thinking maybe it had some spaces in the lines, still the same.
Can someone clarify what is the difference? Betwen reading 1-2 column and the rest?
20355570_01.tif,,
20355570_02.tif,,
21377804_01.tif,,
21377804_02.tif,,
21404518_01.tif,,
21404518_02.tif,,
21404521_01.tif,,
21404521_02.tif,,
,22043764_01.tif,
,22043764_02.tif,
,22095060_01.tif,
,22095060_02.tif,
,23507574_01.tif,
,23507574_02.tif,
,,23507574_03.tif
,,23507804_01.tif
,,23507804_02.tif
,,23507804_03.tif
,,23509247_01.tif
,,23509247_02.tif
,,23509247_03.tif
,,23527663_01.tif
,,23527663_02.tif
,,23527663_03.tif
,,23527908_01.tif
,,23527908_02.tif
,,23527908_03.tif
,,23535506_01.tif
,,23535506_02.tif
,,23535562_01.tif
,,23535562_02.tif
,,23535636_01.tif
,,23535636_02.tif
That happens when input file has DOS line endings (\r\n). Fix your file using dos2unix and your command will work for 3rd column too.
dos2unix /path/test.csv
Or, you can remove the \r at the end while counting non-empty columns using awk:
awk -F, '{sub(/\r/,"")} $3!=""{n++} END{print n}' /path/test.csv
The problem is in the grep command: the way you wrote it will return 33 lines when you count the 3rd column.
It's better instead to use the following command to count number of lines in .CSV for each column (example below is for the 3rd column):
cat /path/test.csv | cut -d , -f3 | grep -cve '^\s*$'
This will return the exact number of lines for each column and avoid of piping into wc.
See previous post here:
count (non-blank) lines-of-code in bash
edit: I think oguz ismail found the actual reason in their answer. If they are right and your file has windows line endings you can use one of the following commands without having to convert the file.
cut -d, -f3 yourFile.csv cut | tr -d \\r | grep -c .
cut -d, -f3 yourFile.csv | grep -c $'[^\r]' # bash only
old answer: Since I cannot reproduce your problem with the provided input I take a wild guess:
The "empty" fields in the last column contain spaces. A field containing a space is not empty altough it looks like it is empty as you cannot see spaces.
To count only fields that contain something other than a space adapt your regex from . (any symbol) to [^ ] (any symbol other than space).
cut -d, -f3 yourFile.csv | grep -c '[^ ]'

How to use lines in a file as keyword for grep?

I've search lots of questions on here and other sites, and people have suggested things that should fix my problem, but I think there's something wrong with my code that I just don't recognize.
I have 24 .fasta files from NGS sequencing that are 150bp long. There's approximately 1M reads for each file. The reads are from targeted sequencing where we electroplated vectors with cDNA for genes of interest, and a unique barcode sequence. I need to look through the sequencing files for the presence or absence of the barcode sequence which corresponds to a specific gene.
I have a .txt list of the barcodeSequences that I want to pass to grep to look for the barcode in the .fasta file. I've tried so many variations of this command. I can give grep each barcode individually but that's so time consuming, I know it's possible to give it the list of barcode sequences and search each .fasta for each of the barcodes and record how many times each barcode is found in each file.
Here's my code where I give it each barcode individually:
# Barcode 33
mkdir --mode 755 $dir/BC33
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep 'TATTAGAGTTTGAGAATAAGTAGT' > $dir/BC33/"$f"
done
I tried to adapt it so that I don't have to feed every barcode sequence in individually:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep -c -f BarcodeScreenSeq.txt | sort > $dir/Results/"$f"
echo "Finished $f"
done
But it is not searching for the barcode sequences. With this iteration it is just returning new files in the /Results directory that are empty. I also tried a nest loop, where I tried to make the barcode sequence a variable that changed like the $FILES, but that just gave me a new file with the names of my .fasta files:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
for b in `cat /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt`; do
cat "$f" | grep -c "$b" | sort > $dir/"$f"_Barcode
done ;
done
I want a output .txt file that has:
<barcode sequence>: <# of times that bc was found>
for each .fasta file because I want to put all the samples together to make one large excel sheet which shows each barcode and how many times it was found in each sample.
Please help, I've tried everything I can think of.
EDIT
Here is what the BarcodeScreenSeq.txt file would look like. It's just a txt file where each line is a barcode sequence:
head BarcodeScreenSeq.txt
TATTATGAGAAAGTTGAATAGTAG
ATGAAAGTTAGAGTTTATGATAAG
AATAGATAAGATTGATTGTGTTTG
TGTTAAATGTATGTAGTAATTGAG
ATAGATTTAAGTGAAGAGAGTTAT
GAATGTTTGTAAATGTATAGATAG
AAATTGTGAAAGATTGTTTGTGTA
TGTAAGTGAAATAGTGAGTTATTT
GAATTGTATAAAGTATTAGATGTG
AGTGAGATTATGAGTATTGATTTA
EDIT
lozzib#gliaserver:~/AG_Barcode_Seq$ file BarcodeScreenSeq.txt
BarcodeScreenSeq.txt: ASCII text, with CRLF line terminators
Windows Line Endings
Your BarcodeScreenSeq.txt has windows line endings. Each line ends with the special characters \r\n. Linux tools such as grep only deal with linux line endings \r and interpret your file ...
TATTATG\r\n
ATGAAAG\r\n
...
to look for the patterns TATTATG\r, ATGAAAG\r, ... (note the \r at the end). Because of the \r there is no match.
Either: Convert your file once bye running dos2unix BarcodeScreenSeq.txt or sed -i 's/\r//g' BarcodeScreenSeq.txt. This will change your file.
Or: replace every BarcodeScreenSeq.txt in the following scripts by <(tr -d '\r' < BarcodeScreenSeq.txt). This won't change the file, but creates more overhead as the file is converted over and over again.
Command
grep -c has only one counter. If you pass multiple search patterns at once (for instance using -f BarcodeScreenSeq.txt) you still get only one number for all patterns together.
To count the occurrences of each pattern individually you can use the following trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
sort | uniq -c |
awk '{print $2 ": " $1 }' > "Results/$file"
done
grep -o will print each match as a single line.
sort | uniq -c will count how often each line occurs.
awk is only there to change the format from #matches pattern to pattern: #matches.
Benefit: The command should be fairly fast.
Drawback: Patterns from BarcodeScreenSeq.txt that are not found in $file won't be listed at all. Your result will leave out lines of the form pattern: 0.
If you really need the lines of the form pattern: 0 you could use another trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
cat - BarcodeScreenSeq.txt |
sort | uniq -c |
awk '{print $2 ": " ($1 - 1) }' > "Results/$file"
done
cat - BarcodeScreenSeq.txt will insert the content of BarcodeScreenSeq.txt at the end of grep's output such that #matches is one bigger than it should be. The number is corrected by awk.
You can read a text file one line at a time and process each line separately using a redirect, like so:
for f in *.fasta; do
while read -r seq; do
grep -c "${seq}" "${f}" > "${dir}"/"${f}"_Barcode
done < /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt
done

Append xargs argument number as prefix

I want to analyze the most frequentry occuring entries in (column of) a logfile. To write the detail results, I am creating new directories from the output of something along the lines of
cat logs| cut -d',' -f 6 | sort | uniq -c | sort -rn | head -10 | \
awk '{print $2}' |xargs mkdir -p
Is there a way to create the directories with the sequence number of the argument as processed by xargs as a prefix? For e.g. For e.g. "oranges" is the most frequent entry (of the column) the directory created should be named "1.oranges" and so on.
A quick (and dirty?) solution could be to pipe your directory names through cat -n in their proper order and then remove the whitespace separating the line number from the directory name, before passing them to xargs.
A better solution would be to modify your awk command:
... | awk '{ print NR "." $2 }' | xargs mkdir -p
The NR variable contains the record (i.e. line) number.

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources