When I have a single textfile that I want to read line-by-line with bash, the command looks like:
while IFS='' read -r line || [[ -n "${line}" ]];
do
[code goes here]
done <(${filename})
Now, I have several files (named 1.txt through 10.txt), all of which have the same number of lines ( ~ 1600). Processing the while loop through each file individually takes a long time, is there a way to read and process everything in parallel (i.e., all 10 files will be read at the same time, but processed separately) with the while syntax? For example:
While IFS='' read -r line || [[ -n "${line}" ]];
do
[code goes here]
done <(1.txt; 2.txt; 3.txt; ...)
Or might there be a better method of achieving the desired multi-text processing other than creating 10 separate scripts to do this?
The overarching objective is that the files 1.txt - 10.txt consist of ~ 1600 separate ID's, in which the [code goes here] section will first:
1) read the ID line-by-line
2) based on the ID, will reference a master file which contains information about the ID, such as when the time occurred for this particular ID. Extract this time
3) Based on this extracted time information, we now build files 1 hour before, and 1 hour after at 2-minute increments. We then reference each of these 60 files, open them, and then extract a line from that file, and finally dump it to a new file.
Therefore, the process consists of opening multiple different files for referencing.
you can modify the existing script to take the filename as command-line argument.
eg. if script name is process_file.sh $./process_file.sh <file_name>
You could develop one more support script which has the list of files and loops and calls this script and pushes it to the background using "&"
eg.
declare -a arr=("1.txt" "2.txt" "3.txt")
for i in "${arr[#]}"
do
./process_file.sh $i &
done
This might be one approach you could try and check.
Related
This question already has answers here:
Read lines from a file into a Bash array [duplicate]
(6 answers)
Closed 5 months ago.
I have 20 files from which I want to grep all the lines that have inside a given id (id123), and save them in a new text file. So, in the end, I would have several txt files, as much as ids we have.
If you have a small number of Ids, you can create a script with the list inside. E.g:
list=("id123" "id124" "id125" "id126")
for i in "${list[#]}"
do
zgrep -Hx $i *.vcf.gz > /home/Roy/$i.txt
done
This would give us 4 txt files (id123.txt...) etc.
However, this list is around 500 ids, so it's much easier to read the txt file that stores the ids and iterate through it.
I was trying to do something like:
list = `cat some_data.txt`
for i in "${list[#]}"
do
zgrep -Hx $i *.vcf.gz > /home/Roy/$i.txt
done
However, this only provides the last id of the file.
If each id in the file is on a distinct line, you can do
while read i; do ...; done < panel_genes_cns.txt
If that is not the case, you can simply massage the file to make it so:
tr -s '[[:space:]]' \\n < panel_genes_cns.txt | while read i; do ...; done
There are a few caveats to be aware of. In each, the commands inside the loop are also reading from the same input stream that while reads from, and this may consume ids unexpectedly. In the second, the pipeline will (depending on the shell) run in a subshell, and any variables defined in the loop will be out of scope after the loop ends. But for your simple case, either of these should work without worrying too much about these issues.
I did not check whole code, but from initally I can see you are using wrong redirection.
You have to use >> instead of >.
> is overwrites and >> is append.
list = `cat pannel_genes_cns.txt`
for i in "${list[#]}"
do
zgrep -Hx $i *.vcf.gz >> /home/Roy/$i.txt
done
I have 40 csv files that I need to edit. 20 have matching format and the names only differ by one character, e.g., docA.csv, docB.csv, etc. The other 20 also match and are named pair_docA.csv, pair_docB.csv, etc.
I have the code written to edit and combine docA.csv and pair_docA.csv, but I'm struggling writing a loop that calls both the above files, edits them, and combines them under the name combinedA.csv, then goes on the the next pair.
Can anyone help my rudimentary bash scripting? Here's what I have thus far. I've tried in a single for loop, and now I'm trying in 2 (probably 3) for loops. I'd prefer to keep it in a single loop.
set -x
DIR=/path/to/file/location
for file in `ls $DIR/doc?.csv`
do
#code to edit the doc*.csv files ie $file
done
for pairdoc in `ls $DIR/pair_doc?.csv`
do
#code to edit the piar_doc*.csv files ie $pairdoc
done
#still need to combine the files. I have the join written for a single iteration,
#but how do I loop the code to save each join as a different file corresponding
#to combined*.csv
Something along these lines:
#!/bin/bash
dir=/path/to/file/location
cd "$dir" || exit
for file in doc?.csv; do
pair=pair_$file
# "${file#doc}" deletes the prefix "doc"
combined=combined_${file#doc}
cat "$file" "$pair" >> "$combined"
done
ls, on principle, shouldn't be used in a shell script in order to iterate over the files. It is intended to be used interactively and nearly never needed within a script. Also, all-capitalized variable names shouldn't be used as ordinary variables, since they may collide with internal shell variables or environment variables.
Below is a version without changing the directory.
#!/bin/bash
dir=/path/to/file/location
for file in "$dir/"doc?.csv; do
basename=${file#"$dir/"}
pair=$dir/pair_$basename
combined=$dir/combined_${basename#doc}
cat "$file" "$pair" >> "$combined"
done
This might work for you (GNU parallel):
parallel cat {1} {2} \> join_{1}_{2} ::: doc{A..T}.csv :::+ pair_doc{A..T}.csv
Change the cat commands to your chosen commands where {1} represents the docX.csv files and {2} represents the pair_docX.csv file.
N.B. X represents the letters A thru T
I want to rename multiple individual entries in a long file based on a comma delimited table. I figured out a way how to do it, but I feel it's highly inefficient and I'm wondering if there's a better way to do it.
My file contains >30k entries like this this:
>Gene.1::Fmerg_contig0.1::g.1::m.1 Gene.1::Fmerg_contig0.1::g.1
TPAPHKMQEPTTPFTPGGTPKPVFTKTLKGDVVEPGDGVTFVCEVAHPAAYFITWLKDSK
>Gene.17::Fmerg_Transcript_1::g.17::m.17 Gene.17::Fmerg_Transcript_1::g.17
PLDDKLADRVQQTDAGAKHALKMTDEGCKHTLQVLNCRVEDSGIYTAKATDENGVWSTCS
>Gene.15::Fmerg_Transcript_1::g.15::m.15 Gene.15::Fmerg_Transcript_1::g.15
AQLLVQELTEEERARRIAEKSPFFMVRMKPTQVIENTNLSYTIHVKGDPMPNVTFFKDDK
And the table with the renaming information looks like this:
original,renamed
Fmerg_contig0.1,Fmerg_Transcript_0
Fmerg_contig1.1,Fmerg_Transcript_1
Fmerg_contig2.1,Fmerg_Transcript_2
The inefficient solution I came up with looks like this:
#!/bin/bash
#script to revert dammit name changes
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" Fmerg_final.fasta.transdecoder_test.pep
done < Fmerg_final.fasta.dammit.namemap.csv
However, this means that sed iterates over the table once per entry to be renamed.
I could imagine there is a way to only access each line once and iterate over the name list that way, but I'm not sure how to tackle this. I chose bash because this is the language that I'm most fluent in. But I'm not adverse to use perl or python if they offer an easier solution.
This is On problem and you solved it with On solution so I wouldn't consider it inefficient. However, if you are good with bash you can do more it no problem.
Divide and conquer.
I have done this many times as you can reduce the work time closer to the time it takes one item to be processed ..
Take this pseudo code, I call a method that cuts up the 30K file into say X parts, then I call it in a loop with the & option to run as threads.
declare -a file_part_names
# cut files into parts
function cut_file_into_parts() {
orig_file="$1"
number_parts="$1"
}
# call method to handle renaming a file
function rename_fields_in_file() {
file_part="$1"
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" "$tmp_file"
done < "$file_part"
}
# main
cut_file_into_parts "Fmerg_final.fasta.dammit.namemap.csv"
for each file_part ;do
if threads_pids < 100
rename_fields_in_file $each &
else
sleep 10
fi
done
wait
#Now that you have a pile of temp files processed, combine them all.
for each temp file do
temp_$i.txt >> final_result.txt
done
In summary, cut the big file into say 500 tmp files labled file1, file2 etc. in say /tmp/folder. Then go through them one at a time but launch them as child processes up to say 100 running at the same time, keep the pipe full by checking that if over 100 do nothing (sleep 10) if under add more. When done, one more loop to combine file1_finish.txt to file2_finish.txt etc. which is super quick.
NOTE: if this is too much you can always just break the file up and call the the same script X times for each file instead of using threads.
I have a number (say, 100) of CSV files, out of which some (say, 20) are empty (i.e., 0 bytes file). I would like to concatenate the files into one single CSV file (say, assorted.csv), with the following requirement met:
For each empty file, there must be a blank line in assorted.csv.
It appears that simply doing cat *.csv >> assorted.csv skips the empty files completely in the sense that they do not have any lines and hence there is nothing to concatenate.
Though I can solve this problem using any high-level programming language, I would like to know if and how to make it possible using Bash.
Just make a loop and detect if the file is not empty. If it's empty, just echo the file name+comma in it: that will create a near blank line. Otherwise, prefix each line with the file name+comma.
#!/bin/bash
out=assorted.csv
# delete the file prior to doing concatenation
# or if ran twice it would be counted in the input files!
rm -f "$out"
for f in *.csv
do
if [ -s "$f" ] ; then
#cat "$f" | sed 's/^/$f,/' # cat+sed is too much here
sed "s/^/$f,/" "$f"
else
echo "$f,"
fi
done > $out
How do I merge ls with wc -l to get the name of a file, modification time and number of rows in a file?
thanks!
There are a number of ways you can approach this from the shell or your programming language of choice, but there's really no "right" way to do this, since you need to both stat and read each file in order to form your custom output. You can do this without pipelines inside a basic for-loop by using command substitution:
custom_ls () {
for file in "$#"; do
echo "$file, $(date -r "$file" '+%T'), $(wc -l < "$file")"
done
}
This will generate output like this:
$ custom_ls .git*
.gitconfig, 14:02:56, 44
.gitignore, 17:07:13, 21
There are certainly other ways to do it, but command substitution allows the intent of the format string to remain short and clear, without complex pipelines or temporary variables.