How to use lines in a file as keyword for grep? - bash

I've search lots of questions on here and other sites, and people have suggested things that should fix my problem, but I think there's something wrong with my code that I just don't recognize.
I have 24 .fasta files from NGS sequencing that are 150bp long. There's approximately 1M reads for each file. The reads are from targeted sequencing where we electroplated vectors with cDNA for genes of interest, and a unique barcode sequence. I need to look through the sequencing files for the presence or absence of the barcode sequence which corresponds to a specific gene.
I have a .txt list of the barcodeSequences that I want to pass to grep to look for the barcode in the .fasta file. I've tried so many variations of this command. I can give grep each barcode individually but that's so time consuming, I know it's possible to give it the list of barcode sequences and search each .fasta for each of the barcodes and record how many times each barcode is found in each file.
Here's my code where I give it each barcode individually:
# Barcode 33
mkdir --mode 755 $dir/BC33
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep 'TATTAGAGTTTGAGAATAAGTAGT' > $dir/BC33/"$f"
done
I tried to adapt it so that I don't have to feed every barcode sequence in individually:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep -c -f BarcodeScreenSeq.txt | sort > $dir/Results/"$f"
echo "Finished $f"
done
But it is not searching for the barcode sequences. With this iteration it is just returning new files in the /Results directory that are empty. I also tried a nest loop, where I tried to make the barcode sequence a variable that changed like the $FILES, but that just gave me a new file with the names of my .fasta files:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
for b in `cat /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt`; do
cat "$f" | grep -c "$b" | sort > $dir/"$f"_Barcode
done ;
done
I want a output .txt file that has:
<barcode sequence>: <# of times that bc was found>
for each .fasta file because I want to put all the samples together to make one large excel sheet which shows each barcode and how many times it was found in each sample.
Please help, I've tried everything I can think of.
EDIT
Here is what the BarcodeScreenSeq.txt file would look like. It's just a txt file where each line is a barcode sequence:
head BarcodeScreenSeq.txt
TATTATGAGAAAGTTGAATAGTAG
ATGAAAGTTAGAGTTTATGATAAG
AATAGATAAGATTGATTGTGTTTG
TGTTAAATGTATGTAGTAATTGAG
ATAGATTTAAGTGAAGAGAGTTAT
GAATGTTTGTAAATGTATAGATAG
AAATTGTGAAAGATTGTTTGTGTA
TGTAAGTGAAATAGTGAGTTATTT
GAATTGTATAAAGTATTAGATGTG
AGTGAGATTATGAGTATTGATTTA
EDIT
lozzib#gliaserver:~/AG_Barcode_Seq$ file BarcodeScreenSeq.txt
BarcodeScreenSeq.txt: ASCII text, with CRLF line terminators

Windows Line Endings
Your BarcodeScreenSeq.txt has windows line endings. Each line ends with the special characters \r\n. Linux tools such as grep only deal with linux line endings \r and interpret your file ...
TATTATG\r\n
ATGAAAG\r\n
...
to look for the patterns TATTATG\r, ATGAAAG\r, ... (note the \r at the end). Because of the \r there is no match.
Either: Convert your file once bye running dos2unix BarcodeScreenSeq.txt or sed -i 's/\r//g' BarcodeScreenSeq.txt. This will change your file.
Or: replace every BarcodeScreenSeq.txt in the following scripts by <(tr -d '\r' < BarcodeScreenSeq.txt). This won't change the file, but creates more overhead as the file is converted over and over again.
Command
grep -c has only one counter. If you pass multiple search patterns at once (for instance using -f BarcodeScreenSeq.txt) you still get only one number for all patterns together.
To count the occurrences of each pattern individually you can use the following trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
sort | uniq -c |
awk '{print $2 ": " $1 }' > "Results/$file"
done
grep -o will print each match as a single line.
sort | uniq -c will count how often each line occurs.
awk is only there to change the format from #matches pattern to pattern: #matches.
Benefit: The command should be fairly fast.
Drawback: Patterns from BarcodeScreenSeq.txt that are not found in $file won't be listed at all. Your result will leave out lines of the form pattern: 0.
If you really need the lines of the form pattern: 0 you could use another trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
cat - BarcodeScreenSeq.txt |
sort | uniq -c |
awk '{print $2 ": " ($1 - 1) }' > "Results/$file"
done
cat - BarcodeScreenSeq.txt will insert the content of BarcodeScreenSeq.txt at the end of grep's output such that #matches is one bigger than it should be. The number is corrected by awk.

You can read a text file one line at a time and process each line separately using a redirect, like so:
for f in *.fasta; do
while read -r seq; do
grep -c "${seq}" "${f}" > "${dir}"/"${f}"_Barcode
done < /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt
done

Related

Concatenate files based on numeric sort of name substring in awk w/o header

I am interested in concatenate many files together based on the numeric number and also remove the first line.
e.g. chr1_smallfiles then chr2_smallfiles then chr3_smallfiles.... etc (each without the header)
Note that chr10_smallfiles needs to come after chr9_smallfiles -- that is, this needs to be numeric sort order.
When separate the two command awk and ls -v1, each does the job properly, but when put them together, it doesn't work. Please help thanks!
awk 'FNR>1' | ls -v1 chr*_smallfiles > bigfile
The issue is with the way that you're trying to pass the list of files to awk. At the moment, you're piping the output of awk to ls, which makes no sense.
Bear in mind that, as mentioned in the comments, ls is a tool for interactive use, and in general its output shouldn't be parsed.
If sorting weren't an issue, you could just use:
awk 'FNR > 1' chr*_smallfiles > bigfile
The shell will expand the glob chr*_smallfiles into a list of files, which are passed as arguments to awk. For each filename argument, all but the first line will be printed.
Since you want to sort the files, things aren't quite so simple. If you're sure the full range of files exist, just replace chr*_smallfiles with chr{1..99}_smallfiles in the original command.
Using some Bash-specific and GNU sort features, you can also achieve the sorting like this:
printf '%s\0' chr*_smallfiles | sort -z -n -k1.4 | xargs -0 awk 'FNR > 1' > bigfile
printf '%s\0' prints each filename followed by a null-byte
sort -z sorts records separated by null-bytes
-n -k1.4 does a numeric sort, starting from the 4th character (the numeric part of the filename)
xargs -0 passes the sorted, null-separated output as arguments to awk
Otherwise, if you want to go through the files in numerical order, and you're not sure whether all the files exist, then you can use a shell loop (although it'll be significantly slower than a single awk invocation):
for file in chr{1..99}_smallfiles; do # 99 is the maximum file number
[ -f "$file" ] || continue # skip missing files
awk 'FNR > 1' "$file"
done > bigfile
You can also use tail to concatenate all the files without header
tail -q -n+2 chr*_smallfiles > bigfile
In case you want to concatenate the files in a natural sort order as described in your quesition, you can pipe the result of ls -v1 to xargs using
ls -v1 chr*_smallfiles | xargs -d $'\n' tail -q -n+2 > bigfile
(Thanks to Charles Duffy) xargs -d $'\n' sets the delimiter to a newline \n in case the filename contains white spaces or quote characters
Using a bash 4 associative array to extract only the numeric substring of each filename; sort those individually; and then retrieve and concatenate the full names in the resulting order:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "Requires bash 4.0 or newer" >&2; exit 1;; esac
# when this is done, you'll have something like:
# files=( [1]=chr_smallfiles1.txt
# [10]=chr_smallfiles10.txt
# [9]=chr_smallfiles9.txt )
declare -A files=( )
for f in chr*_smallfiles.txt; do
files[${f//[![:digit:]]/}]=$f
done
# now, emit those indexes (1, 10, 9) to "sort -n -z" to sort them as numbers
# then read those numbers, look up the filenames associated, and pass to awk.
while read -r -d '' key; do
awk 'FNR > 1' <"${files[$key]}"
done < <(printf '%s\0' "${!files[#]}" | sort -n -z) >bigfile
You can do with a for loop like below, which is working for me:-
for file in chr*_smallfiles
do
tail +2 "$file" >> bigfile
done
How will it work? For loop read all the files from current directory with wild chard character * chr*_smallfiles and assign the file name to variable file and tail +2 $file will output all the lines of that file except the first line and append in file bigfile. So finally all files will be merged (accept the first line of each file) into one i.e. file bigfile.
Just for completeness, how about a sed solution?
for file in chr*_smallfiles
do
sed -n '2,$p' $file >> bigfile
done
Hope it helps!

Removing the first line of each file from a wildcard?

I am trying to copy about 100 CSVs into a PostgreSQL database. The CSVs aren't formed perfectly for the database, so I have to do some editing, which I am trying to do on the fly with piping.
Because each CSV file has a header, I need to remove the first line to prevent the headers from being copied into the database as an entity.
My attempt at this was the following:
sed -e "s:\.00::g" -e "s/\"\"//g" *.csv | tail -n +2 | cut -d "," -f1-109 |
psql -d intelliflight_pg -U intelliflight -c "\COPY flights FROM stdin WITH DELIMITER ',' CSV"
The problem I'm having with this is that it treats *.csv as a single file, and only removes the first line of the first file it sees, and leaves the rest of the files alone.
How can I get this to remove the first line of each individual file retrieved by the *.csv wildcard?
You can combine the sed and tail steps and use find to have per-file processing, then pipe the output of that to cut and psql:
find -name '*.csv' -exec sed '1d;s/\.00//g;s/""//g' {} \; | cut ...
This uses sed to remove the first line from each file, then does the substitutions on the rest of the files. Each file is processed, and the output of it all piped to cut and the rest of your commands.
Notice the single quotes around the sed argument, simplifying things somewhat with the quoting.
This also processes .csv files in subdirectories; if you don't want that, you have to limit recursion depth with
find -maxdepth 1 -name etc.
Can't test it right now but this should do :
awk -F, '
FNR == 1 {next}
{
gsub(/\.00/, "")
gsub(/""/, "")
NF = 109
print
}
' *.csv | psql ..
The NF= 109 line will drop any field after 109.

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

"while read LINE do" and grep problems

I have two files.
file1.txt:
Afghans
Africans
Alaskans
...
where file2.txt contains the output from a wget on a webpage, so it's a big sloppy mess, but does contain many of the words from the first list.
Bashscript:
cat file1.txt | while read LINE; do grep $LINE file2.txt; done
This did not work as expected. I wondered why, so I echoed out the $LINE variable inside the loop and added a sleep 1, so i could see what was happening:
cat file1.txt | while read LINE; do echo $LINE; sleep 1; grep $LINE file2.txt; done
The output looks in terminal looks something like this:
Afghans
Africans
Alaskans
Albanians
Americans
grep: Chinese: No such file or directory
: No such file or directory
Arabians
Arabs
Arabs/East Indians
: No such file or directory
Argentinans
Armenians
Asian
Asian Indians
: No such file or directory
file2.txt: Asian Naruto
...
So you can see it did finally find the word "Asian". But why does it say:
No such file or directory
?
Is there something weird going on or am I missing something here?
What about
grep -f file1.txt file2.txt
#OP, First, use dos2unix as advised. Then use awk
awk 'FNR==NR{a[$1];next}{ for(i=1;i<=NF;i++){ if($i in a) {print $i} } } ' file1 file2_wget
Note: using while loop and grep inside the loop is not efficient, since for every iteration, you need to invoke grep on the file2.
#OP, crude explanation:
For meaning of FNR and NR, please refer to gawk manual. FNR==NR{a[1];next} means getting the contents of file1 into array a. when FNR is not equal to NR (which means reading the 2nd file now), it will check if each word in the file is in array a. If it is, print out. (the for loop is used to iterate each word)
Use more quotes and use less cat
while IFS= read -r LINE; do
grep "$LINE" file2.txt
done < file1.txt
As well as the quoting issue, the file you've downloaded contains CRLF line endings which are throwing read off. Use dos2unix to convert file1.txt before iterating over it.
Although usng awk is faster, grep produces a lot more details with less effort. So, after issuing dos2unix use:
grep -F -i -n -f <file_containing_pattern> <file_containing_data_blob>
You will have all the matches + line numbers (case insensitive)
At minimum this will suffice to find all the words from file_containing_pattern:
grep -F -f <file_containing_pattern> <file_containing_data_blob>

Resources