Grep-ing a list of filename against a csv list of names - shell

I have a CSV files containing a list of ids, numbers, each on a row. Let's call that file ids.csv
In a directory i have a big number of files, name "file_123456_smth.csv", where 123456 is an id that could be found in the ids csv file
Now, what I am trying to achieve: compare the names of the files with the ids stored in ids.csv. If 123456 is found in ids.csv then the filename should be displayed.
What i've tried:
ls -a | xargs grep -L cat ../../ids.csv
Of course, this does not work, but gives an idea of my direction.

Lets see if I understood you correctly...
$ cat ids.csv
123
456
789
$ ls *.csv
file_123_smth.csv file_321_smth.csv file_789_smth.csv ids.csv
$ ./c.sh
123 found in file_123_smth.csv
789 found in file_789_smth.csv
where c.sh looks like this:
#!/bin/bash
ID="ids.csv"
for file in *.csv
do
if [[ $file =~ file ]] # just do the filtering on files
then # containing the actual string "file"
id=$(cut -d_ -f2 <<< "$file")
grep -q "$id" $ID && echo "$id found in $file"
fi
done

Related

Why is this bash loop failing to concatenate the files?

I am at my wits end as to why this loop is failing to concatenate the files the way I need it. Basically, lets say we have following files:
AB124661.lane3.R1.fastq.gz
AB124661.lane4.R1.fastq.gz
AB124661.lane3.R2.fastq.gz
AB124661.lane4.R2.fastq.gz
What we want is:
cat AB124661.lane3.R1.fastq.gz AB124661.lane4.R1.fastq.gz > AB124661.R1.fastq.gz
cat AB124661.lane3.R2.fastq.gz AB124661.lane4.R2.fastq.gz > AB124661.R2.fastq.gz
What I tried (and didn't work):
Create and save file names (AB124661) to a ID file:
ls -1 R1.gz | awk -F '.' '{print $1}' | sort | uniq > ID
This creates an ID file that stores the samples/files name.
Run the following loop:
for i in `cat ./ID`; do cat $i\.lane3.R1.fastq.gz $i\.lane4.R1.fastq.gz \> out/$i\.R1.fastq.gz; done
for i in `cat ./ID`; do cat $i\.lane3.R2.fastq.gz $i\.lane4.R2.fastq.gz \> out/$i\.R2.fastq.gz; done
The loop fails and concatenates into empty files.
Things I tried:
Yes, the ID file is definitely in the folder
When I run with echo it shows the cat command correct
Any help will be very much appreciated,
Best,
AC
why are you escaping the \> ? That's going to result in a cat: '>': No such file or directory instead of a redirection.
Don't read lines with for
while IFS= read -r id; do
cat "${id}.lane3.R1.fastq.gz" "${id}.lane4.R1.fastq.gz" > "out/${id}.R1.fastq.gz"
cat "${id}.lane3.R2.fastq.gz" "${id}.lane4.R2.fastq.gz" > "out/${id}.R2.fastq.gz"
done < ./ID
Let say you have id stored in file ./ID per line
while read -r line; do
cat "$line".lane3.R1.fastq.gz "$line".lane4.R1.fastq.gz > "$line".R1.fastq.gz
cat "$line".lane3.R2.fastq.gz "$line".lane4.R2.fastq.gz > "$line".R2.fastq.gz
done < ./ID
A pure shell solution could be like that:
for file in *.fastq.gz; do
id=${file%%.*}
[ -e "$id".R1.fastq.gz ] || cat "$id".*.R1.fastq.gz > "$id".R1.fastq.gz
[ -e "$id".R2.fastq.gz ] || cat "$id".*.R2.fastq.gz > "$id".R2.fastq.gz
done
Alternatively:
printf '%s\n' *.fastq.gz | cut -d. -f1 | sort -u |
while IFS= read -r id; do
cat "$id".*.R1.fastq.gz > "$id".R1.fastq.gz
cat "$id".*.R2.fastq.gz > "$id".R2.fastq.gz
done
This solution assumes filenames of interest don't contain newline characters.

Occurrence of a string in all the file names within a folder in Bash

I am trying to make a script which allow me to select files with 2 or more occurrences of a string in their name.
Example:
test.txt // 1 occurrence only, not ok
test-test.txt // 2 occurrences OK
test.test.txt // 2 occurrences OK
I want the script to return me only the files 2 and 3. I tried like that but this didn't work:
rows=$(ls | grep "test")
for file in $rows
do
if [[ $(wc -w $file) == 2 ]]; then
echo "the file $file matches"
fi
done
grep and wc are overkill. A simple glob will suffice:
*test*test*
You can use this like so:
ls *test*test*
or
for file in *test*test*; do
echo "$file"
done
You can use :
result=$(grep -o "test" yourfile | wc -l)
-wc is a word count
In shell script if $result>1 do stuff...

Is it possible to generate a checksum (md5 ) a string in a shell?

I would like to have a unique ID for filenames so I can iterate over the IDs and compare the checksums of the files?
Is it possible to have a checksum for the name of the file so I can have a unique ID per filename?
I would welcome other ideas.
Is it what you want?
Plain string:
serce#unit:~$ echo "Hello, checksum!" | md5sum
9f898618b071286a14d1937f9db13b8f -
And file content:
serce#unit:~$ md5sum agent.yml
3ed53c48f073bd321339cd6a4c716c17 -
Yes it is possible using md5sum and basename $0 gives the name of current file
Assuming I have the script as below named md5Gen.sh
#!/bin/bash
mdf5string=$(basename "$0" | md5sum )
echo -e `basename "$0"` $mdf5string
Running the script would give me
md5Gen.sh 911949bd2ab8467162e27c1b6b5633c0 -
Yes, it is possible to obtain the MD5 of an string:
$ printf '%s' "This-Filename" | md5sum
dd829ba5a7ba7bdf7a391f2e0bd7cd1f -
It is important to understand that there is no newline at the end of the printed string. An equivalent in bash would be to use echo -n:
$ echo -n "This-Filename" | md5sum
dd829ba5a7ba7bdf7a391f2e0bd7cd1f -
The -n (valid in bash) is important because otherwise your hash would change with the inclusion of a newline that is not part of the text:
$ echo "This-Filename" | md5sum
7ccba9dffa4baf9ca0e56c078aa09a07 -
That also apply to file contents:
$ echo -n "This-Filename" > infile
$ md5sum infile
dd829ba5a7ba7bdf7a391f2e0bd7cd1f infile
$ echo "This-Filename" > infile
$ md5sum infile
7ccba9dffa4baf9ca0e56c078aa09a07 infile

Compare File headers using Awk or Cmp

I have many flat files in 1 directory.
Each file has a header and some data in it.
I want to compare header of one file with all the other files available in that directory.
This can be achieved using shell scripting but i want to do it using a single line code.
I tried it using the awk command but it is comparing the whole file not just the header.
for i in `ls -1 *a*` ; do cmp a.dat $i ; done
Can someone please let me know how can i do that?
Also if it can be achieved using awk.
I just need to check whether the header is matching or not.
I would try this: grab the first line of every file, extract the unique lines, and count them. The result should be one.
number_uniq=$( sed '1q' * | sort -u | wc -l )
That won't tell you which file is different.
files=(*)
reference_header=$( sed '1q' "${files[0]}" )
for file in "${files[#]:1}"; do
if [[ "$reference_header" != "$( sed '1q' "$file" )" ]]; then
echo "wrong header: $file"
fi
done
For what you describe, you can use md5 or cksum to take a signature of the bytes in the header.
Given 5 files (note that File 4.txt does not match):
$ for fn in *.txt; do echo "$fn:"; cat "$fn"; printf "\n\n"; done
File 1.txt:
what a great ride! it is a lovely day
/tmp/files/File 1.txt
File 2.txt:
what a great ride! it is a lovely day
/tmp/files/File 2.txt
File 3.txt:
what a great ride! it is a lovely day
/tmp/files/File 3.txt
File 4.txt:
what an awful ride! it is a horrible day
/tmp/files/File 4.txt
reference.txt:
what a great ride! it is a lovely day
/tmp/files/reference.txt
You can use md5 to get a signature and the check if they other files are the same.
First get the reference signature:
$ sig=$(head -1 reference.txt | md5)
$ echo $sig
549560de062a87ec69afff37abe18d8f
Then loop through the files:
for fn in *.txt; do
if [[ "$sig" != "$(head -1 "$fn" | md5)" ]]; then
echo "header of \"$fn\" does not match";
fi;
done
Prints:
header of "File 4.txt" does not match

head output till a specific line

I have a bunch of files in the following format.
A.txt:
some text1
more text2
XXX
more text
....
XXX
.
.
XXX
still more text
text again
Each file has at least 3 lines that start with XXX. Now, for each file A.txt I want to write all the lines till the 3rd occurrence of XXX (in the above example it is till the line before still more text) to file A_modified.txt.
I want to do this in bash and came up with grep -n -m 3 -w "^XXX$" * | cut -d: -f2 to get the corresponding line number in each file.
Is is possible to use head along with these line numbers to generate the required output?
PS: I know a simple python script would do the job but I am trying to do in this bash for no specific reason.
A simpler method would be to use awk. Assuming there's nothing but files of interest in your present working directory, try:
for i in *; do awk '/^XXX$/ { c++ } c<=3' "$i" > "$i.modified"; done
Or if your files are very big:
for i in *; do awk '/^XXX$/ { c++ } c>=3 { exit }1' "$i" > "$i.modified"; done
head -n will print out the first 'n' lines of the file
#!/bin/sh
for f in `ls *.txt`; do
echo "searching $f"
line_number=`grep -n -m 3 -w "^XXX$" $f | cut -d: -f1 | tail -1`
# line_number now stores the line of the 3rd XXX
# now dump out the first 'line_number' of lines from this file
head -n $line_number $f
done

Resources