Trying to create a script that counts the length of a all the reads in a fastq file but getting no return - bash

I am trying go count the length of each read in a fastq file from illumina sequencing and outputting this to a tsv or any sort of file so I can then later also look at this and count the number of reads per file. So I need to cycle down the file and eactract each line that has a read on it (every 4th line) then get its length and store this as an output
num=2
for file in *.fastq
do
echo "counting $file"
function file_length(){
wc -l $file | awk '{print$FNR}'
}
for line in $file_length
do
awk 'NR==$num' $file | chrlen > ${file}read_length.tsv
num=$((num + 4))
done
done
Currently all I get the counting $file and no other output but also no errors

Your script contains a lot of errors in both syntax and algorithm. Please try shellcheck to see what is the problem. The most issue will be the $file_length part.
You may want to call a function file_length() here but it is just
an undefined variable which is evaluated as null in the for loop.
If you just want to count the length of the 4th line of *.fastq files,
please try something like:
for file in *.fastq; do
awk 'NR==4 {print length}' "$file" > "${file}_length.tsv"
done
Or if you want to put the results together in a single tsv file, try:
tsvfile="read_lenth.tsv"
for file in *.fastq; do
echo -n -e "$file\t" >> "$tsvfile"
awk 'NR==4 {print length}' "$file" >> "$tsvfile"
done
Hope this helps.

Related

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH
#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

Output matching lines in linux

I want to match the numbers in the first file with the 2nd column of second file and get the matching lines in a separate output file. Kindly let me know what is wrong with the code?
I have a list of numbers in a file IDS.txt
10028615
1003
10096344
10100
10107393
10113978
10163178
118747520
I have a second File called src1src22.txt
From src:'1' To src:'22'
CHEMBL3549542 118747520
CHEMBL548732 44526300
CHEMBL1189709 11740251
CHEMBL405440 44297517
CHEMBL310280 10335685
expected newoutput.txt
CHEMBL3549542 118747520
I have written this code
while read line; do cat src1src22.txt | grep -i -w "$line" >> newoutput.txt done<IDS.txt
Your command line works - except you're missing a semicolon:
while read line; do grep -i -w "$line" src1src22.txt; done < IDS.txt >> newoutput.txt
I have found an efficient way to perform the task. Instead of a loop try this -f gives the pattern in the file next to it and searches in the next file. The chance of invalid character length which can occur with grep is reduced and looping slows the process down.
grep -iw -f IDS.txt src1src22.tx >>newoutput.txt
Try this -
awk 'NR==FNR{a[$2]=$1;next} $1 in a{print a[$1],$0}' f2 f1
CHEMBL3549542 118747520
Where f2 is src1src22.txt

Getting different output files

I'm doing a test with these files:
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R2_001.fastq
I want to get the files that have the same code until the first _ (underscore) and have the code R1 in different output files. The output files should be called according with the code until the first _ (underscore).
-This is my code, but I'm having trouble on making the output files.
#!/bin/bash
for i in {900..995}; do
if [[ ${i} -eq ${i} ]]; then
cat comp${i}_*_R1_001.fastq
fi
done
-I want to have two outputs:
One output will have all lines from:
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
and its name should be comp900_R1.out
The other output will have lines from:
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
and its name should be comp995_R1.out
Finally, as I said, this is a small test. I want my script to work with a lot of files that have the same characteristics.
Using awk:
ls -1 *.fastq | awk -F_ '$8 == "R1" {system("cat " $0 ">>" $1 "_R1.out")}'
List all files *.fastq into awk, splitting on _. Check if 8:th part $8 is R1, then append cat >> the file into first part $1 + _R1.out, which will be comp900_R1.out or comp995_R1.out. It is assumed that no filenames contain spaces or other special characters.
Result:
File comp900_R1.out containing all lines from
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
and file comp995_R1.out containing all lines from
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
My stab at a general solution:
#!/bin/bash
for f in *_R1_*; do
code=$(echo $f | cut -d _ -f 1)
cat $f >> ${code}_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
done
Iterates over files with _R1_ in it, then appends its output to a file based on code.
cut pulls out the code by splitting the filename (-d _) and returning the first field (-f 1).

How to convert HHMMSS to HH:MM:SS Unix?

I tried to convert the HHMMSS to HH:MM:SS and I am able to convert it successfully but my script takes 2 hours to complete because of the file size. Is there any better way (fastest way) to complete this task
Data File
data.txt
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,071600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,072200,072200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,072600,072600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073200,073200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073500,073500,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,073700,073700,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,073900,073900,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,074400,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,090200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,090900,090900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,091500,091500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,091900,091900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092500,092500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092900,092900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,093200,093200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,093500,093500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,094500,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,170100,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,170400,170400,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,170700,170700,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171000,171000,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171500,171500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,171900,171900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172500,172500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172900,172900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,173500,173500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,174100,,
My code : script.sh
#!/bin/bash
awk -F"," '{print $5}' Data.txt > tmp.txt # print first line first string before , to tmp.txt i.e. all Numbers will be placed into tmp.txt
sort tmp.txt | uniq -d > Uniqe_number.txt # unique values be stored to Uniqe_number.txt
rm tmp.txt # removes tmp file
while read line; do
echo $line
cat Data.txt | grep ",$line," > Numbers/All/$line.txt # grep Number and creats files induvidtually
awk -F"," '{print $5","$4","$7","$8","$9","$10","$11}' Numbers/All/$line.txt > Numbers/All/tmp_$line.txt
mv Numbers/All/tmp_$line.txt Numbers/Final/Final_$line.txt
done < Uniqe_number.txt
ls Numbers/Final > files.txt
dos2unix files.txt
bash time_replace.sh
when you execute above script it will call time_replace.sh script
My Code for time_replace.sh
#!/bin/bash
for i in `cat files.txt`
do
while read aline
do
TimeDep=`echo $aline | awk -F"," '{print $6}'`
#echo $TimeDep
finalTimeDep=`echo $TimeDep | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeDep
##########
TimeAri=`echo $aline | awk -F"," '{print $7}'`
#echo $TimeAri
finalTimeAri=`echo $TimeAri | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeAri
sed -i 's/',$TimeDep'/',$finalTimeDep'/g' Numbers/Final/$i
sed -i 's/',$TimeAri'/',$finalTimeAri'/g' Numbers/Final/$i
############################
done < Numbers/Final/$i
done
Any better solution?
Appreciate any help.
Thanks
Sri
If there's a large quantity of files, then the pipelines are probably what are going to impact performance more than anything else - although processes can be cheap, if you're doing a huge amount of processing then cutting down the amount of time you do pass data through a pipeline can reap dividends.
So you're probably going to be better off writing the entire script in awk (or perl). For example, awk can send output to an arbitary file, so the while lop in your first file could be replaced with an awk script that does this. You also don't need to use a temporary file.
I assume the sorting is just for tracking progress easily as you know how many numbers there are. But if you don't care for the sorting, you can simply do this:
#!/bin/sh
awk -F ',' '
{
print $5","$4","$7","$8","$9","$10","$11 > Numbers/Final/Final_$line.txt
}' datafile.txt
ls Numbers/Final > files.txt
Alternatively, if you need to sort you can do sort -t, -k5,4,10 (or whichever field your sort keys actually need to be).
As for formatting the datetime, awk also does functions, so you could actually have an awk script that looks like this. This would replace both of your scripts above whilst retaining the same functionality (at least, as far as I can make out with a quick analysis) ... (Note! Untested, so may contain vauge syntax errors):
#!/usr/bin/awk
BEGIN {
FS=","
}
function formattime (t)
{
return substr(t,1,2)":"substr(t,3,2)":"substr(t,5,2)
}
{
print $5","$4","$7","$8","$9","formattime($10)","formattime($11) > Numbers/Final/Final_$line.txt
}
which you can save, chmod 700, and call directly as:
dostuff.awk filename
Other awk options include changing fields in-situ, so if you want to maintain the entire original file but with formatted datetimes, you can do a modification of the above. Change the print block to:
{
$10=formattime($10)
$11=formattime($11)
print $0
}
If this doesn't do everything you need it to, hopefully it gives some ideas that will help the code.
It's not clear what all your sorting and uniq-ing is for. I'm assuming your data file has only one entry per line, and you need to change the 10th and 11th comma-separated fields from HHMMSS to HH:MM:SS.
while IFS=, read -a line ; do
echo -n ${line[0]},${line[1]},${line[2]},${line[3]},
echo -n ${line[4]},${line[5]},${line[6]},${line[7]},
echo -n ${line[8]},${line[9]},
if [ -n "${line[10]}" ]; then
echo -n ${line[10]:0:2}:${line[10]:2:2}:${line[10]:4:2}
fi
echo -n ,
if [ -n "${line[11]}" ]; then
echo -n ${line[11]:0:2}:${line[11]:2:2}:${line[11]:4:2}
fi
echo ""
done < data.txt
The operative part is the ${variable:offset:length} construct that lets you extract substrings out of a variable.
In Perl, that's close to child's play:
#!/usr/bin/env perl
use strict;
use warnings;
use English( -no_match_vars );
local($OFS) = ",";
while (<>)
{
my(#F) = split /,/;
$F[9] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[9];
$F[10] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[10];
print #F;
}
If you don't want to use English, you can write local($,) = ","; instead; it controls the output field separator, choosing to use comma. The code reads each line in the file, splits it up on the commas, takes the last two fields, counting from zero, and (if they're not empty) inserts colons in between the pairs of digits. I'm sure a 'Code Golf' solution would be made a lot shorter, but this is semi-legible if you know any Perl.
This will be quicker by far than the script, not least because it doesn't have to sort anything, but also because all the processing is done in a single process in a single pass through the file. Running multiple processes per line of input, as in your code, is a performance disaster when the files are big.
The output on the sample data you gave is:
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,07:16:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:22:00,07:22:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,07:26:00,07:26:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:32:00,07:32:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:35:00,07:35:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,07:37:00,07:37:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,07:39:00,07:39:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:44:00,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,09:02:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:09:00,09:09:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:15:00,09:15:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,09:19:00,09:19:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:25:00,09:25:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:29:00,09:29:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,09:32:00,09:32:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,09:35:00,09:35:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:45:00,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,17:01:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,17:04:00,17:04:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,17:07:00,17:07:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:10:00,17:10:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:15:00,17:15:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,17:19:00,17:19:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:25:00,17:25:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:29:00,17:29:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:35:00,17:35:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:41:00,,

search a pattern in file and output each pattern result in its own file using awk, sed

I have a file of numbers in each new line:
$cat test
700320947
700509217
701113187
701435748
701435889
701667717
701668467
702119126
702306577
702914910
that I want to search details of from another larger file with several comma separated fields and out put results in
700320947.csv
700509217.csv
701113187.csv
701435748.csv
701435889.csv
701667717.csv
701668467.csv
702119126.csv
702306577.csv
702914910.csv
Logic:
ls test | while read file; do zgrep $line *large*file*gz >> $line.csv ; done
Please assist.
Thanks
Since nothing said about the structure of the large file, I'll just assume that the numbers in test are to be found in the second column of the large file; generalize as needed.
This can be done in a single pass through each of the files by using output redirection in awk:
awk -F"," 'FILENAME == "test" { num[$1]=1; next }
num[$2] { print > $2".csv" }' test bigfile
Unzip the large file first; using zgrep means unzipping on-the-fly for every line of the number file... very inefficient. After unzipping the big file, this will do it:
for number in `cat test`; do grep $number bigfile > $number.csv; done
Edited:
To limit hits to whole words only (eg 702119126 won't match 1702119126), add word boundaries to the regex:
for number in `cat test`; do grep \\b$number\\b bigfile > $number.csv; done

Resources