How to name hundreds of files in increasing order in bash? - bash

I need to download 300 files on the cloud, and name them one by one in increasing order. I can achieve one time by running the following code. The pathname before '>' is the location of the initial files, the pathname after '>' is where I want to save.
/Applications/samtools-1.14/samtools depth -r dna /Volumes/lab/plants/aligned_data/S1_dedup.bam > /Volumes/lab/students/test1.txt
My question is how to change the numbers in 'S1_dedup.bam' and 'test1.txt' from 1 to 300 in a loop (or something), instead of hardcode the numbers 300 times by hand.

for ((i=1;i<=300;i++))
do
/Applications/samtools-1.14/samtools depth -r nda /Volumes/lab/plants/aligned_data/S${i}_dedup.bam > /Volumes/lab/students/test${i}.txt
done

you can use a for loop
for i in {1..300}
do
/Applications/samtools-1.14/samtools depth -r nda /Volumes/lab/plants/aligned_data/S${i}_dedup.bam > /Volumes/lab/students/test${i}.txt
done

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Merging CSVs into one sees exponentially bigger size

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?
I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

find missing files in a very long list where files are numbered sequentially

I have a directory which has more than 330000 files and unfortunately I cannot use ls.
In order to list them I use find and I have printed the output in a file list of files
These files are named sequentially, therefore there is a long list that goes Blast0_1.txt.gz Blast0_2.txt.gz Blast0_3.txt.gz....
and these numbers go up to 587, hence the total of the files should 588x588=345744 (because numbering for both before and after the underscore starts at 0
There are some combinations that are missing, because the total should be 345744 but unfortunately it is 331357.
Is there an easy way to find the missing combinations through bash? I saw that there are available some solutions online but they do not work for me and I cannot figure how to adapt any of them in my dataset.
any help is greatly appreciated
You could iterate through all possible filenames and check whether the file exists. On my laptop, this took around 8 seconds for 588x588 combinations.
for i in {0..588}; do
for j in {0..588}; do
file_name="Blast${i}_${j}.txt.gz"
[ ! -f $file_name ] && echo "$file_name"
done
done
This will go through all possible combinations, check whether the file exists and if not, print its filename to the console.
Depending on your naming scheme, you might have to zero pad the numbers.

Bash script that creates files of a set size

I'm trying to set up a script that will create empty .txt files with the size of 24MB in the /tmp/ directory. The idea behind this script is that Zabbix, a monitoring service, will notice that the directory is full and wipe it completely with the usage of a recovery expression.
However, I'm new to Linux and seem to be stuck on the script that generates the files. This is what I've currently written out.
today="$( date +¨%Y%m%d" )"
number=0
while test -e ¨$today$suffix.txt¨; do
(( ++number ))
suffix=¨$( printf -- %02d ¨$number¨ )
done
fname=¨$today$suffix.txt¨
printf ´Will use ¨%s¨ as filename\n´ ¨$fname¨
printf -c 24m /tmp/testf > ¨$fname¨
I'm thinking what I'm doing wrong has to do with the printf command. But some input, advice and/or directions to a guide to scripting are very welcome.
Many thanks,
Melanchole
I guess that it doesn't matter what bytes are actually in that file, as long as it fills up the temp dir. For that reason, the right tool to create the file is dd, which is available in every Linux distribution, often installed by default.
Check the manpage for different options, but the most important ones are
if: the input file, /dev/zero probably which is just an endless stream of bytes with value zero
of: the output file, you can keep the code you have to generate it
count: number of blocks to copy, just use 24 here
bs: size of each block, use 1MB for that

Bash/batch multiple file, single folder, incrimental rename script; user provided filename prefix parameter

I have a folder of files which need to be renamed.
Instead of a simple incrimental numeric rename function I need to first provide a naming convention which will then incriment in order to ensure file name integrity within the folder.
say i have files:
wei12346.txt
wifr5678.txt
dkgj5678.txt
which need to be renamed to:
Eac-345-018.txt
Eac-345-019.txt
Eac-345-020.txt
Each time i run the script the naming could be different and the numeric incriment to go along with it may also be ddifferent:
Ebc-345-010.pdf
Ebc-345-011.pdf
Ebc-345-012.pdf
So i need to ask for a provided parameter from the user, i was thinking this might be useful as the previous file name in the list of files to be indexed eg: Eac-345-017.txt
The other thing I am unsure about with the incriment is how the script would deal with incrimenting 099 to 100 or 999 to 1000 as i am not aware of how this process is carried out.
I have been told that this is an easy script in perl however I am running cygwin on a windows machine in work and have access to only bash and windows shells in order to execute the script.
Any pointers to get me going would be greatly appreciated, i have some experience programming but scripting is almost entirely new.
Thanks,
Craig
(i understand there are allot of posts on this type of thing already but none seem to offer any concise answer, hence my question)
#!/bin/bash
prefix="$1"
shift
base_n="$1"
shift
step="$1"
shift
n=$base_n
for file in "$#" ; do
formatted_n=$(printf "%03d" $n)
# re-use original file extension whilke we're at it.
mv "$file" "${prefix}-${formatted_n}.${file##*.}"
let n=n+$step
done
Save the file, invoke it like this:
bash fancy_rename.sh Ebc-345- 10 1 /path/to/files/*
Note: In your example you "renamed" a .txt to a .pdf, but above I presumed the extension would stay the same. If you really wanted to just change the extension then it would be a trivial change. If you wanted to actually convert the file format then it would be a little more complex.
Note also that I have formatted the incrementing number with %03d. This means that your number sequence will be e.g.
010
011
012
...
099
100
101
...
999
1000
Meaning that it will be zero padded to three places but will automatically overflow if the number is larger. If you prefer consistency (always 4 digits) you should change the padding to %04d.
OK, you can do the following. You can ask the user first the prefix and then the starting sequence number. Then, you can use the built-in printf from bash to do the correct formatting on the numbers, but you may have to decide to provide enough number width to hold all the sequence, because this will result in a more homogeneous names. You can use read to read user input:
echo -n "Insert the prefix: "
read prefix
echo -n "Insert the sequence number: "
read sn
for i in * ; do
fp=`printf %04d $sn`
mv "$i" "$prefix-$fp.txt"
sn=`expr $sn + 1`
done
Note: You can extract the extension also. That wouldn't be a problem. Also, here I selected 4 numbers fot the sequence number, calculated into the variable $fp.

Resources