I want to rename multiple individual entries in a long file based on a comma delimited table. I figured out a way how to do it, but I feel it's highly inefficient and I'm wondering if there's a better way to do it.
My file contains >30k entries like this this:
>Gene.1::Fmerg_contig0.1::g.1::m.1 Gene.1::Fmerg_contig0.1::g.1
TPAPHKMQEPTTPFTPGGTPKPVFTKTLKGDVVEPGDGVTFVCEVAHPAAYFITWLKDSK
>Gene.17::Fmerg_Transcript_1::g.17::m.17 Gene.17::Fmerg_Transcript_1::g.17
PLDDKLADRVQQTDAGAKHALKMTDEGCKHTLQVLNCRVEDSGIYTAKATDENGVWSTCS
>Gene.15::Fmerg_Transcript_1::g.15::m.15 Gene.15::Fmerg_Transcript_1::g.15
AQLLVQELTEEERARRIAEKSPFFMVRMKPTQVIENTNLSYTIHVKGDPMPNVTFFKDDK
And the table with the renaming information looks like this:
original,renamed
Fmerg_contig0.1,Fmerg_Transcript_0
Fmerg_contig1.1,Fmerg_Transcript_1
Fmerg_contig2.1,Fmerg_Transcript_2
The inefficient solution I came up with looks like this:
#!/bin/bash
#script to revert dammit name changes
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" Fmerg_final.fasta.transdecoder_test.pep
done < Fmerg_final.fasta.dammit.namemap.csv
However, this means that sed iterates over the table once per entry to be renamed.
I could imagine there is a way to only access each line once and iterate over the name list that way, but I'm not sure how to tackle this. I chose bash because this is the language that I'm most fluent in. But I'm not adverse to use perl or python if they offer an easier solution.
This is On problem and you solved it with On solution so I wouldn't consider it inefficient. However, if you are good with bash you can do more it no problem.
Divide and conquer.
I have done this many times as you can reduce the work time closer to the time it takes one item to be processed ..
Take this pseudo code, I call a method that cuts up the 30K file into say X parts, then I call it in a loop with the & option to run as threads.
declare -a file_part_names
# cut files into parts
function cut_file_into_parts() {
orig_file="$1"
number_parts="$1"
}
# call method to handle renaming a file
function rename_fields_in_file() {
file_part="$1"
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" "$tmp_file"
done < "$file_part"
}
# main
cut_file_into_parts "Fmerg_final.fasta.dammit.namemap.csv"
for each file_part ;do
if threads_pids < 100
rename_fields_in_file $each &
else
sleep 10
fi
done
wait
#Now that you have a pile of temp files processed, combine them all.
for each temp file do
temp_$i.txt >> final_result.txt
done
In summary, cut the big file into say 500 tmp files labled file1, file2 etc. in say /tmp/folder. Then go through them one at a time but launch them as child processes up to say 100 running at the same time, keep the pipe full by checking that if over 100 do nothing (sleep 10) if under add more. When done, one more loop to combine file1_finish.txt to file2_finish.txt etc. which is super quick.
NOTE: if this is too much you can always just break the file up and call the the same script X times for each file instead of using threads.
Related
This question already has answers here:
Read lines from a file into a Bash array [duplicate]
(6 answers)
Closed 5 months ago.
I have 20 files from which I want to grep all the lines that have inside a given id (id123), and save them in a new text file. So, in the end, I would have several txt files, as much as ids we have.
If you have a small number of Ids, you can create a script with the list inside. E.g:
list=("id123" "id124" "id125" "id126")
for i in "${list[#]}"
do
zgrep -Hx $i *.vcf.gz > /home/Roy/$i.txt
done
This would give us 4 txt files (id123.txt...) etc.
However, this list is around 500 ids, so it's much easier to read the txt file that stores the ids and iterate through it.
I was trying to do something like:
list = `cat some_data.txt`
for i in "${list[#]}"
do
zgrep -Hx $i *.vcf.gz > /home/Roy/$i.txt
done
However, this only provides the last id of the file.
If each id in the file is on a distinct line, you can do
while read i; do ...; done < panel_genes_cns.txt
If that is not the case, you can simply massage the file to make it so:
tr -s '[[:space:]]' \\n < panel_genes_cns.txt | while read i; do ...; done
There are a few caveats to be aware of. In each, the commands inside the loop are also reading from the same input stream that while reads from, and this may consume ids unexpectedly. In the second, the pipeline will (depending on the shell) run in a subshell, and any variables defined in the loop will be out of scope after the loop ends. But for your simple case, either of these should work without worrying too much about these issues.
I did not check whole code, but from initally I can see you are using wrong redirection.
You have to use >> instead of >.
> is overwrites and >> is append.
list = `cat pannel_genes_cns.txt`
for i in "${list[#]}"
do
zgrep -Hx $i *.vcf.gz >> /home/Roy/$i.txt
done
I have 40 csv files that I need to edit. 20 have matching format and the names only differ by one character, e.g., docA.csv, docB.csv, etc. The other 20 also match and are named pair_docA.csv, pair_docB.csv, etc.
I have the code written to edit and combine docA.csv and pair_docA.csv, but I'm struggling writing a loop that calls both the above files, edits them, and combines them under the name combinedA.csv, then goes on the the next pair.
Can anyone help my rudimentary bash scripting? Here's what I have thus far. I've tried in a single for loop, and now I'm trying in 2 (probably 3) for loops. I'd prefer to keep it in a single loop.
set -x
DIR=/path/to/file/location
for file in `ls $DIR/doc?.csv`
do
#code to edit the doc*.csv files ie $file
done
for pairdoc in `ls $DIR/pair_doc?.csv`
do
#code to edit the piar_doc*.csv files ie $pairdoc
done
#still need to combine the files. I have the join written for a single iteration,
#but how do I loop the code to save each join as a different file corresponding
#to combined*.csv
Something along these lines:
#!/bin/bash
dir=/path/to/file/location
cd "$dir" || exit
for file in doc?.csv; do
pair=pair_$file
# "${file#doc}" deletes the prefix "doc"
combined=combined_${file#doc}
cat "$file" "$pair" >> "$combined"
done
ls, on principle, shouldn't be used in a shell script in order to iterate over the files. It is intended to be used interactively and nearly never needed within a script. Also, all-capitalized variable names shouldn't be used as ordinary variables, since they may collide with internal shell variables or environment variables.
Below is a version without changing the directory.
#!/bin/bash
dir=/path/to/file/location
for file in "$dir/"doc?.csv; do
basename=${file#"$dir/"}
pair=$dir/pair_$basename
combined=$dir/combined_${basename#doc}
cat "$file" "$pair" >> "$combined"
done
This might work for you (GNU parallel):
parallel cat {1} {2} \> join_{1}_{2} ::: doc{A..T}.csv :::+ pair_doc{A..T}.csv
Change the cat commands to your chosen commands where {1} represents the docX.csv files and {2} represents the pair_docX.csv file.
N.B. X represents the letters A thru T
When I have a single textfile that I want to read line-by-line with bash, the command looks like:
while IFS='' read -r line || [[ -n "${line}" ]];
do
[code goes here]
done <(${filename})
Now, I have several files (named 1.txt through 10.txt), all of which have the same number of lines ( ~ 1600). Processing the while loop through each file individually takes a long time, is there a way to read and process everything in parallel (i.e., all 10 files will be read at the same time, but processed separately) with the while syntax? For example:
While IFS='' read -r line || [[ -n "${line}" ]];
do
[code goes here]
done <(1.txt; 2.txt; 3.txt; ...)
Or might there be a better method of achieving the desired multi-text processing other than creating 10 separate scripts to do this?
The overarching objective is that the files 1.txt - 10.txt consist of ~ 1600 separate ID's, in which the [code goes here] section will first:
1) read the ID line-by-line
2) based on the ID, will reference a master file which contains information about the ID, such as when the time occurred for this particular ID. Extract this time
3) Based on this extracted time information, we now build files 1 hour before, and 1 hour after at 2-minute increments. We then reference each of these 60 files, open them, and then extract a line from that file, and finally dump it to a new file.
Therefore, the process consists of opening multiple different files for referencing.
you can modify the existing script to take the filename as command-line argument.
eg. if script name is process_file.sh $./process_file.sh <file_name>
You could develop one more support script which has the list of files and loops and calls this script and pushes it to the background using "&"
eg.
declare -a arr=("1.txt" "2.txt" "3.txt")
for i in "${arr[#]}"
do
./process_file.sh $i &
done
This might be one approach you could try and check.
so I'm trying to get a simple bash script to continuously read a directory and update a list of files to play through a command. However, I'm having some trouble thinking out the logic in it. What I need to do is put the current items in the directory into the list, have each item in the directory run through a program, and when a new item comes in, just append it to the list. I'm attempting to use inotifywait but can't seem to think of the proper logic. I may need it to run in the background, as the process that is running on these files will run before inotifywait is read again, at which point it will not pick up any new files that have been added as it only checks when it runs. Here's the code so hopefully it makes more sense.
#!/bin/bash
#Initial check to see if files are converted.
if [ ! -d "/home/pi/rpitx/converted" ]; then
echo "Converted directory does not exist, cannot play!"
exit 1
fi
CYAN='\e[36m'
NC='\e[39m'
LGREEN='\e[92m'
#iterate through directory first and act upon each item
for f in $FILES
do
echo -e "${CYAN}Now playing ${f##*/}...${NC}"
#Figure out a way to always watch directory even when it is playing
inotifywait -m /home/pi/rpitx/converted -e create -e moved_to |
while read path action file; do
echo -e "${LGREEN}New file found: ${CYAN}${file}${NC}"
FILES+=($file)
done
# take action on each file. $f store current file name
sudo ./rpitx -m RF -i "${f}" -f 101100
done
exit 0
So for example. if rpitx is currently playing something, and a file is converted, it won't pick up the latest file and add it to the list, nor will it make it since it's always reading. Is there a way to get inotifywait to run in the background of this script somehow? Thanks.
This is actually quite a difficult problem to get 100% perfect, but it is possible to get pretty close.
It is easy to get all the files in a directory, and it is easy to use inotifywait to get iteratively informed of new files being placed into the directory. The issue is getting the two to be consistent. If inotifywait isn't started until all the files have been processed (or even just listed), then you might miss new files created between the listing and the invocation of inotifywait. If, on the other hand, you start inotifywait first, then a file created after the invocation of inotifywait and the extraction of the current file list will be listed twice.
Since it is easier to filter duplicates than notice orphans, the recommended approach is the second one.
As a first approximation, we could ignore the duplicate problem on the assumption that the window of vulnerability is pretty short and so it is probably unlikely to happen. This simplifies the code, but it's not that difficult to track and eliminate duplicates: we could, for example, store each filename as the key in an associative array, ignoring the file if the key already exists.
We need three processes: one to execute inotifywait; one to produce the list of initial files; and one to handle each file as it is identified. So the basic structure of the code will be:
list_new_files |
{ list_existing_files; pass_through; } |
while read action file; do
handle -r "$action" "$file"
done
Note that the second process first produces the existing files, and then calls pass_through, which reads from standard input and writes to standard output, thus passing through the files being discovered by list_new_files. Since pipes have a finite capacity, it is possible that the execution of list_existing_files will block a few times (if there are lots of existing files and handling them takes a long time), so when pass_through finally gets executed, it could have quite a bit of queued-up input to pass through. That doesn't matter, unless the first pipe also fills up, which will happen if a large number of new files are created. And that still won't matter as long as inotifywait doesn't lose notifications while it is blocked on a write. (This may actually be a problem, since the manpage for inotifywait on my system includes in the "BUGS" section the note, "It is assumed the inotify event queue will never overflow." We could fix the problem by inserting another process which carefully buffers inotifywait's output, but that shouldn't be necessary unless you intend to flood the directory with lots of files.)
Now, let's examine each of the functions in turn.
list_new_files could be just the call to inotifywait from your original script:
inotifywait -m /home/pi/rpitx/converted -e create -e moved_to
Listing existing files is also easy. Here's one simple solution:
printf "%s\n" /home/pi/rpitx/converted/*
However, that will print out the full file path, which is different from the output from inotifywait. To make them the same, we cd into the directory in order to do the listing. Since we might not actually want to change the working directory, we use a subshell by surrounding the commands inside parentheses:
( cd /home/pie/rpitx/converted; printf "%s\n" *; )
The printf just prints its arguments each on a separate line. Since glob-expansions are not word-split or recursively glob-expanded, this is safe against whitespace or metacharacters in filenames, except newline characters. Filenames with newline characters are pretty rare; for now, I'll ignore the issue but I'll indicate how to handle it at the end.
Even with the change indicated above, the output from these two commands is not compatible: the first one outputs three things on each line (directory, action, filename), and the second one just one thing (the filename). In the listing below, you'll see how we modify the format to printf and introduce a format for inotifywait in order to make the outputs fully compatible, with the "action" for existing files set to EXISTING.
pass_through could, in theory, just be cat, and that's how I've coded it below. However, it is important that it operate in line-buffered mode; otherwise, nothing will happen until "enough" files have been written by list_existing_files. On my system, cat in this configuration works perfectly; if that doesn't work for you or you don't want to count on it, you could write it explicitly as a while read loop:
pass_through() {
while read -r line; do echo "$line"; done
}
Finally, handle is essentially the code from the original post, but modified a bit to take the new format into account, and to do the right thing with action EXISTING.
# Colours. Note the use of `$'...'` to actually store the code,
# thereby avoiding the need to later reinterpret backslash sequences
CYAN=$'\e[36m'
NC=$'\e[39m'
LGREEN=$'\e[92m'
converted=/home/pi/rpitx/converted
list_new_files() {
inotifywait -m "$converted" -e create -e moved_to --format "%e %f"
}
# Note the use of ( ) around the body instead of { }
# This is the same as `{( ... )}'; it makes the `cd` local to the function.
list_existing_files() (
cd "$converted"
printf "EXISTING %s\n" *
)
# Invoked as `handle action filename`
handle() {
case "$1" in
EXISTING)
echo "${CYAN}Now playing ${2}...${NC}"
;;
*)
echo "${LGREEN}New file found: ${CYAN}${file}${NC}"
;;
esac
sudo ./rpitx -m RF -i "${f}" -f 101100
}
# Put everything together
list_new_files |
{ list_existing_files; cat; } |
while read -r action file; do handle "$action" "$file"; done
What if we thought a filename might have a newline character in it? There are two "safe" characters which could be used to delimit the filenames, in the sense that they cannot appear inside a filename. One is /, which can obviously appear in a path, but cannot appear in a simple filename, which is what we're working with here. The other one is the NUL character, which cannot appear inside a filename at all, but can sometimes be a bit annoying to deal with.
Normally, faced with this problem, we would use a NUL, but that depends on the various utilities we're using allowing the separation of data with NUL instead of newline. That's not the case for inotifywait, which always outputs a newline after a notification line. So in this case it seems simpler to use a /. First we modify the formats:
inotifywait -m "$converted" -e create -e moved_to --format "%e %f/"
printf "%s/\n" *
Now, when we're reading the lines, we need to read until we find a line ending with / (and remember to remove it). read doesn't allow two-character line terminators, so we need to accumulate the lines ourselves:
while read -r action file; do
# If file doesn't end with a slash, we need to read another line
while [[ file != */ ]] && read -r line; do
file+=$'\n'"$line"
done
# Remember to remove the trailing slash
handle "$action" "${file%/}"
done
How do I merge ls with wc -l to get the name of a file, modification time and number of rows in a file?
thanks!
There are a number of ways you can approach this from the shell or your programming language of choice, but there's really no "right" way to do this, since you need to both stat and read each file in order to form your custom output. You can do this without pipelines inside a basic for-loop by using command substitution:
custom_ls () {
for file in "$#"; do
echo "$file, $(date -r "$file" '+%T'), $(wc -l < "$file")"
done
}
This will generate output like this:
$ custom_ls .git*
.gitconfig, 14:02:56, 44
.gitignore, 17:07:13, 21
There are certainly other ways to do it, but command substitution allows the intent of the format string to remain short and clear, without complex pipelines or temporary variables.