Extract unmatched files in directory using a text file

Extract unmatched files in directory using a text file - bash

I have 100 files in a directory, and a text file that lists out 35 of these files.
####Directory
apple carrot orange pears bananas
###text file
apple
carrot
orange
I would like to use this text file that has filenames and compare in the directory to get unmatched filenames into a separate file. So it will be a file that lists out like below:
##unmatched text file
pears
bananas
I know to do this by using find if the search term was a particular string but could not figure out this

Assume that the text file contains a subset of the files in the directory. Also assume that the file is called list.txt and the directory is called dir1, then the following will work:
(cat list.txt; ls -1 dir1) | sort | uniq -u
Explanations
The command (cat list.txt; ls -1 dir1) starts a sub shell, executes the cat and the ls commands
The combined output is then sorted and uniq -u will picks out those that are unique (not duplicated)
I believe this is what you want. If that works, you can redirect into another file:
(cat list.txt; ls -1 dir1) | sort | uniq -u > list2.txt

Related

Is it possible to work with 'for loop grep' commands?

I have lots of files in every year directory
and in each file have long and large sentence like this for exmaple
List item
home/2001/2001ab.txt
the AAAS kill every one not but me and you and etc
the A1CF maybe color of full fill zombie
home/2002/2002ab.txt
we maybe know some how what
home/2003/2003ab.txt
Mr, Miss boston, whatever
aaas will will will long long
and in home directory, I got home/reference.txt (list of word file)
A1BG
A1CF
A2M
AAAS
I'd like to do count how many word in the file reference.txt is in every single year file
this is my code where I run in every year directory
home/2001/, home/2002/, home/2003/
# awk
function search () {
awk -v pattern="$1" '$0 ~ pattern {print}' *.txt > $1
}
# load custom.txt
for i in $(cat reference.txt)
do
search $i
done
# word count
wc -l * > line-count.txt
this is my result
home/2001/A1BG
$cat A1BG
0
home/2001/A1CF
$cat A1CF
1
home/2001/A2M
$cat A2M
0
home/2001/AAAS
$cat AAAS
1
home/2001/line-count.txt
$cat line-count.txt
2021ab.txt 2
A1BG
A1CF 1
A2M 0
AAAS 1
result line-count.txt file have all information what I want
but I have to do this work repeat manually
do cd directory
do run my code
and then cd directory
I have around 500 directory and file, it is not easy
and second problem is wasty bunch of file
create lots of file and takes too much time
because of this at first I'd likt use grep command
but I dont' know how to use list of file instead of single word
that is why I use awk
How can i do it more simple

at first I'd likt use grep command but I dont' know how to use list of
file instead of single word
You might use --file=FILE option for that purpose, selected file should hold one pattern per line.
How can i do it more simple
You might use --count option to avoid need of using wc -l for that, consider following simple example, let file.txt content be
123
456
789
and file1.txt content be
abc123
def456
and file2.txt content be
ghi789
xyz000
and file3.txt content be
xyz000
xyz000
then
grep --count --file=file.txt file1.txt file2.txt file3.txt
gives output
file1.txt:2
file2.txt:1
file3.txt:0
Observe that no files are created and file without matches does appear in output. Disclaimer: this solution assumes file.txt does not contain character of special meaning for GNU grep, if this does not hold do not use this solution.
(tested in GNU grep 3.4)

subset ls based on position

I have a list of files in a folder that need to be fed piped through to more commands, if I know the position of the files when using ls -v file_*.nc is it possible to remove/ignore files based on their position? So if ls -v file_*.nc returns 300 files, and I want files 8,73, and 151 removed from the pipe I could do something like ls -v file_*.nc | {remove 8,73,151} | do other stuff.
I don't want to delete/move the files, I just don't want them piped through to the next command.

If you wanted to filter out from the input as you said : is it possible to remove/ignore files
you can use grep -v <PATTERN> which -v is an exclusive match option.
input files
ls -v1 txt*
txt
txt-1
txt-2
txt-3
txt-4
txt-5
txt-6
txt-7
txt-8
txt-9
txt-10
then ignore any file which contains either 7, 8, 9
ls -v txt* | grep -v '[789]'
txt
txt-1
txt-2
txt-3
txt-4
txt-5
txt-6
txt-10
removed/ignored
txt-7
txt-8
txt-9

Modifying the output of ls command

I'm modifying the output of the ls command. I have achieved a long line of code that does all I want, but I feel like it could be improved a lot, in many ways, so I am asking for your opinion about it. I will explain each command so you're sure to know what I wanted to do with it.
Here is my long line of code:
ls -l |
sed 's/total/ C FILES.c/' |
cut -b 45-100 |
grep -e "\.c" |
sed 's/C FILES.C/\nC FILES\n/' |
sed 's/[0-9]//' |
sed 's/[0-9]//' |
sed 's/[0-9]//'
ls -l displays "total: " + a number, and then displays 1 file per line.
sed 's/total/' C FILES.c/ This long part is supposed to replace the "total" line with a title, and as you will see in the next command, I had to indent it this way, so it is in the same row than then names of the files displayed by ls -r. Numbers will be taken care of in a latter command.
cut -b 45-100 cuts all the stuff that is before the names of the files. As I did put 45 spaces before "C FILES.C" in the last command, it will cut them and show only the 55 next characters. (55 is large enough to contain any file name I use).
grep -e "\.c" deletes the files that don't end with .c. As the title, C FILES, is not a c file, I had to write it with a .c suffix so it doesn't disappear when using the grep command.
sed 's/C FILES.C/\nC FILES\n/' deletes the .c suffix and adds some newlines characters to separate the command line, the title, and the list of files.
The second command replaced "total" by the "C FILES" title. But it didn't replace the number that follows "total", so this number is still here and I want to get rid of it. The command doesn't know what number it will have to delete, and it could be 1, 45 or even 666. I used 3 times the command sed 's/[0-9]// to delete 3 times a number between 0 and 9 (if there is one, else it will do nothing). If the number's length is higher than 3, there will still be some numbers, but I don't think I'll ever have a total of more than 999 so repeating 3 times this command may be enough. I tried 's/[0-9][0-9][0-9]// but it only works if the number's length is atleast 3. Else, it does nothing.
That's the best I could do. I tried to use the ls command with no option, but none of the following commands worked well after that and I didn't know how to make it output a title since there was no line to replace.
Any idea or advice on how to make it better and easier to read will be welcome.

Without knowing your exact expected output (please provide that), my feeling is that you could replace all that with:
echo C FILES; ls -1 *.c
For me that gives:
C FILES
foo.c
norm.c
readfile.c
size.c
thr.c
Is this the output you're looking for? If not, what is?

I feel like the way I would do what it looks like you're doing feels rather simple but as far as I can tell, you want to simply list c files with the title "C Files"?
If I'm right then why not just do something like this:
echo "C Files" && ls *.c
of if you want it in a more "pipe"-able format:
ls *.c | cat <(echo "C Files") -
This second variant can then be sent to a file in the exact same format like this:
ls *.c | cat <(echo "C Files") - > output.txt
To get the same output but reordering the command input:
cat <(echo "C Files") <<< cat <(ls *.c)

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns

you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

Compare two archive files BashShell

I'm new in Bash and I need help.
I need to create a shell script that shall compare two gzipped archives. For each file or directory in each archive file (even in archived subdirectories), the script shall verify whether a file/directory of the same name exists in the other archive. In case of a missing directory, ignore missing files or subdirectories within this directory. The script shall list the names of all files which do not have a matching equivalent in the other archive.
The output of script when comparing archives arch1.tar.gz and archive2.tar.gz and finding differing files aa/a.txt, bb/b.txt in archive.tar.gz and c.txt v arch2.tar.gz:
arch1.tar.gz:aa/a.txt
arch1.tar.gz:bb/b.txt
arch2.tar.gz:c.txt
Here what I have:
#!/bin/bash
$1
$2
tar tf $1>> list1.txt
tar tf $2>> list2.txt
comm -23 <(sort list1.txt -o list1.txt | uniq) <(sort list2.txt -o list2.txt| uniq)
diff list1.txt list2.txt>>contestboth
The thing is that I can't image anything for output.

Try this:
diff <(sort -u list1.txt) <(sort -u list2.txt)
By this two sub processes are started (the two sort commands) and their output is associated with file descriptors. The syntax <(...) returns a file name representing this file descriptor (something like /dev/fd/63). So in the end, diff is called with two files which, when read, (seem to) contain the output of the two processes.
This method works fine for programs which read a file strictly linearly. Seeking in the "file" is not possible, of course.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio