On the seeking for the pairs of identical files - bash

I need to seek 2 dirs for the pair of files having identical tittles (but not the extensions!) and merge their titles within some new command.
first how to print only name of the files
1)Typically I use the following command within the for loop to select the full name of the file which is looped
for file in ./files/* do;
title=$(base name "file")
print title
done
What should I change in the above script to print as the title of only name of the file but not its extension?
2) how its possible to add some condition to check whether two files has the same names performing double looping over them e,g
# counter for the detected equal files
i=0
for file in ./files1/* do;
title=$(base name "file") #change it to avoid extension within the title
for file2 in ./files2/* do;
title2=$(basename "file2") #change it to avoid extension within the title2
if title1==title2
echo $title1 and $title2 'has been found!'
i=i+1
done
Thanks for help!
Gleb

You could start by fixing the syntax errors in your script, such as do followed by ; when it should be the other way round.
Then, the shell has operators to remove sub-strings from the start (##, #) and end (%%, %) in a variable. Here's how to list files without extensions, i.e. removing the shortest part that matches the glob .* from the right:
for file in *; do
printf '%s\n' "${file%.*}"
done
Read your shell manual to find out about these operators. It will pay for itself many times over in your programming career :-)
Do not believe anyone telling you to use ugly and expensive piping and forking with basename, cut, awk and such. That's all overkill.
On the other hand, maybe there's a better way to achieve your goal. Suppose you have files like this:
$ find files1 files2
files1
files1/file1.x
files1/file3.z
files1/file2.y
files2
files2/file1.x
files2/file4.b
files2/file3.a
Now create two lists of file names, extensions stripped:
ls files1 | sed -e 's/\.[^.]*$//' | sort > f1
ls files2 | sed -e 's/\.[^.]*$//' | sort > f2
The comm utility tests for lines common in two files:
$ comm f1 f2
file1
file2
file3
file4
The first column lists lines only in f1, the second only in f2 and the third common to both. Using the -1 -2 -3 options you can suppress unwanted columns. If you need to count only the common files (third column) , run
$ comm -1 -2 f1 f2 | wc -l
2

Related

Is it possible to work with 'for loop grep' commands?

I have lots of files in every year directory
and in each file have long and large sentence like this for exmaple
List item
home/2001/2001ab.txt
the AAAS kill every one not but me and you and etc
the A1CF maybe color of full fill zombie
home/2002/2002ab.txt
we maybe know some how what
home/2003/2003ab.txt
Mr, Miss boston, whatever
aaas will will will long long
and in home directory, I got home/reference.txt (list of word file)
A1BG
A1CF
A2M
AAAS
I'd like to do count how many word in the file reference.txt is in every single year file
this is my code where I run in every year directory
home/2001/, home/2002/, home/2003/
# awk
function search () {
awk -v pattern="$1" '$0 ~ pattern {print}' *.txt > $1
}
# load custom.txt
for i in $(cat reference.txt)
do
search $i
done
# word count
wc -l * > line-count.txt
this is my result
home/2001/A1BG
$cat A1BG
0
home/2001/A1CF
$cat A1CF
1
home/2001/A2M
$cat A2M
0
home/2001/AAAS
$cat AAAS
1
home/2001/line-count.txt
$cat line-count.txt
2021ab.txt 2
A1BG
A1CF 1
A2M 0
AAAS 1
result line-count.txt file have all information what I want
but I have to do this work repeat manually
do cd directory
do run my code
and then cd directory
I have around 500 directory and file, it is not easy
and second problem is wasty bunch of file
create lots of file and takes too much time
because of this at first I'd likt use grep command
but I dont' know how to use list of file instead of single word
that is why I use awk
How can i do it more simple
at first I'd likt use grep command but I dont' know how to use list of
file instead of single word
You might use --file=FILE option for that purpose, selected file should hold one pattern per line.
How can i do it more simple
You might use --count option to avoid need of using wc -l for that, consider following simple example, let file.txt content be
123
456
789
and file1.txt content be
abc123
def456
and file2.txt content be
ghi789
xyz000
and file3.txt content be
xyz000
xyz000
then
grep --count --file=file.txt file1.txt file2.txt file3.txt
gives output
file1.txt:2
file2.txt:1
file3.txt:0
Observe that no files are created and file without matches does appear in output. Disclaimer: this solution assumes file.txt does not contain character of special meaning for GNU grep, if this does not hold do not use this solution.
(tested in GNU grep 3.4)

How do I get the list of all items in dir1 which don't exist in dir2?

I want to compute the difference between two directories - but not in the sense of diff, i.e. not of file and subdirectory contents, but rather just in terms of the list of items. Thus if the directories have the following files:
dir1
dir2
f1 f2 f4
f2 f3
I want to get f1 and f4.
You can use comm to compare two listings:
comm -23 <(ls dir1) <(ls dir2)
process substitution with <(cmd) passes the output of cmd as if it were a file name. It's similar to $(cmd) but instead of capturing the output as a string it generates a dynamic file name (usually /dev/fd/###).
comm prints three columns of information: lines unique to file 1, lines unique to file 2, and lines that appear in both. -23 hides the second and third columns and shows only lines unique to file 1.
You could extend this to do a recursive diff using find. If you do that you'll need to suppress the leading directories from the output, which can be done with a couple of strategic cds.
comm -23 <(cd dir1; find) <(cd dir2; find)
Edit: A naive diff-based solution + improvement due to #JohnKugelamn! :
diff --suppress-common-lines <(\ls dir1) <(\ls dir2) | egrep "^<" | cut -c3-
Instead of working on directories, we switch to working on files; then we use regular diff, taking only lines appearing in the first file, which diff marks by < - then finally removing that marking.
Naturally one could beautify the above by checking for errors, verifying we've gotten two arguments, printing usage information otherwise etc.

Extracting a value from a same file from multiple directories

Directory name F1 F2 F3……F120
Inside each directory, a file with a common name ‘xyz.txt’
File xyz.txt has a value
Example:
F1
Xyz.txt
3.345e-2
F2
Xyz.txt
2.345e-2
F3
Xyz.txt
1.345e-2
--
F120
Xyz.txt
0.345e-2
I want to extract these values and paste them in a single file say ‘new.txt’ in a column like
New.txt
3.345e-2
2.345e-2
1.345e-2
---
0.345e-2
Any help please? Thank you so much.
If your files look very similar then you can use grep. For example:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.][0-9]{3}e-[0-9]$' > new.txt
This is a general example as any number can be anything. The regular expression says that the whole line must consist of: a any digit [0-9], a dot character [.], three digits [0-9]{3}, the letter 'e' and any digit [0-9].
If your data is more regular you can also try more simple solution:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.]345e-2$' > new.txt
In this solution only the first digit can be anything.
If your files might contain something else than the line, but the line you want to extract can be unambiguously extracted with a regex, you can use
sed -n '/^[0-9]\.[0-9]*e-*[0-9]*$/p' F*/Xyz.txt >new.txt
The same can be done with grep, but you have to separately tell it to not print the file name. The -x option can be used as a convenience to simplify the regex.
grep -h -x '[0-9]\.[0-9]*e-*[0-9]*' F*/Xyz.txt >new.txt
If you have some files which match the wildcard which should be excluded, try a more complex wildcard, or multiple wildcards which only match exactly the files you want, like maybe F[1-9]/Xyz.txt F[1-9][0-9]/Xyz.txt F1[0-9][0-9]/Xyz.txt
This might work for you (GNU parallel and grep):
parallel -k grep -hE '^[0-9][.][0-9]{3}e-[0-9]$' F{}/xyz.txt ::: {1..120}
Process files in parallel but output results in order.
If the files contain just one line, and you want the whole thing, you can use bash range expansion:
cat /path/to/F{1..120}/Xyz.txt > output.txt
(this keeps the order too).
If the files have more lines, and you need to actually extract the value, use grep -o (-o is not posix, but your grep probably has it).
grep -o '[0-9].345-e2' /path/to/F{1..120}/Xyz.txt > output.txt

How can I delete the lines in a text file that exits in another text file [duplicate]

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2

How to print all lines of a file that do not contain a *partial* pattern

We know grep -v pattern file prints lines that do not contain pattern.
My file to search is a table:
Sample File, Sample Name, Panel, Marker, Allele 1, Allele 2, GQ,
M090972.s-206_B01.fsa, M090972-206, Sample ID-1, SNPchr1, C, T,0.9933,
I want to weed out the lines that contain "M090972-206" and some more patterns like that.
My search patterns come from a directory of text files:
$ ls 20170227_snap_genotypes_1_VCF
M070370-208_S1.genome.vcf M170276-201_S20.genome.vcf
M170308-201_S5.genome.vcf
Only the part of these filenames up to the first "_" is in my table (or the first "." if I remove the ".s" in the example). It is not a constant number of characters. I could remove the characters after the first "." but could not find a way in the sed and awk documentation.
Alternatively I tried using agrep 3.441 with the "-f" option for reading the patterns from a temporary file made with
$ ls "directory" > temp.txt
$ ./agrep -v -f temp.txt $infile >> $outfile
But agrep -f does not find any match (or everything with -v).
What am I missing? Is there a better way, perhaps with sed or awk?
If you are deriving your patterns from the name of files (up to the first _) that exist in 20170227_snap_genotypes_1_VCF directory, then you could do this:
# run from the parent of 20170227_snap_genotypes_1_VCF directory
grep -vf <(cd 20170227_snap_genotypes_1_VCF; ls | cut -f1 -d_) file

Resources