Handling empty files when concatenating files in bash - bash

I have a number (say, 100) of CSV files, out of which some (say, 20) are empty (i.e., 0 bytes file). I would like to concatenate the files into one single CSV file (say, assorted.csv), with the following requirement met:
For each empty file, there must be a blank line in assorted.csv.
It appears that simply doing cat *.csv >> assorted.csv skips the empty files completely in the sense that they do not have any lines and hence there is nothing to concatenate.
Though I can solve this problem using any high-level programming language, I would like to know if and how to make it possible using Bash.

Just make a loop and detect if the file is not empty. If it's empty, just echo the file name+comma in it: that will create a near blank line. Otherwise, prefix each line with the file name+comma.
#!/bin/bash
out=assorted.csv
# delete the file prior to doing concatenation
# or if ran twice it would be counted in the input files!
rm -f "$out"
for f in *.csv
do
if [ -s "$f" ] ; then
#cat "$f" | sed 's/^/$f,/' # cat+sed is too much here
sed "s/^/$f,/" "$f"
else
echo "$f,"
fi
done > $out

Related

Loop through all the files with .txt extension in bash [duplicate]

This question already has answers here:
Loop through all the files with a specific extension
(7 answers)
Closed 4 years ago.
I am trying to loop over files in a folder and test for .txt extensions.
But I get the following error: "awk: cannot open = (No such file or directory)
Here's my code:
!/bin/bash
files=$(ls);
for file in $files
do
# extension=$($file | awk -F . '{ print $NF }');
if [ $file | awk -F . "{ print $NF }" = txt ]
then
echo $file;
else
echo "Not a .txt file";
fi;
done;
The way you are doing this is wrong in many ways.
You should never parse output of ls. It does not handle the filename containing special characters intuitively See Why you shouldn't parse the output of ls(1)
Don't use variables to store multi-line data. The output of ls in a variable is expected to undergo word splitting. In your case files is being referenced as a plain variable, and without a delimiter set, you can't go through the multiple files stored.
Using awk is absolutely unnecessary here, the part $file | awk -F . "{ print $NF }" = txt is totally wrong, you are not passing the name the file to the pipe, just the variable $file, it should have been echo "$file"
The right interpreter she-bang should have been set as #!/bin/bash in your script if you were planning to run it as an executable, i.e. ./script.sh. The more recommended way would be to say #!/usr/bin/env bash to let the shell identify the default version of the bash installed.
As such your requirement could be simply reduced to
for file in *.txt; do
[ -f "$file" ] || continue
echo "$file"
done
This is a simple example using a glob pattern using *.txt which does pathname expansion on the all the files ending with the txt format. Before the loop is processed, the glob is expanded as the list of files i.e. assuming the folder has files as 1.txt, 2.txt and foo.txt, the loop is generated to
for file in 1.txt 2.txt foo.txt; do
Even in the presence of no files, i.e. when the glob matches empty (no text files found), the condition [ -f "$file" ] || continue would ensure the loop is exit gracefully by checking if the glob returned any valid file results or just an un-expanded string. The condition [ -f "$file" ] would fail for everything if except a valid file argument.
Or if you are targeting scripts for bourne again shell, enable glob options to remove non-matching globs, rather than preserving them
shopt -s nullglob
for file in *.txt; do
echo "$file"
done
Another way using shell array to store the glob results and parse them over later to do a specific action on them. This way is useful when doing a list of files as an argument list to another command. Using a proper quoted expansion "${filesList[#]}" will preserve the spacing/tabs/newlines and other meta characters in filenames.
shopt -s nullglob
filesList=(*.txt)
for file in "${filesList[#]}"; do
echo "$file"
done

Counter in bash script

I have a script that extracts filenames from an input file and is supposed to read each line (filename) and unzip the specified file, saving the unzipped content as individual files. However, I can't get my counter to work and just get all the unzipped files in one large file.
Input file contains a list:
ens/484/59/traj.pdb 0.001353
ens/263/39/traj.pdb 0.004178
ens/400/35/traj.pdb 0.004191
I'm using the regex /.*?/.*?/ to extract the file that I'd like to unzip and name each output{1..40}.pdb -- instead I get one output file: output1.pdb which contains all the contents of the 40 unzipped files.
My question is: how do I correct my counter in order to achieve the desired naming scheme?
#!/bin/bash
file="/home/input.txt"
grep -Po '/.*?/.*?/' $file > filenames.txt
i=$((i+1))
structures='filenames.txt'
while IFS= read line
do
gunzip -c 'ens'$line'traj.pdb.gz' >> 'output'$i'.pdb'
done <"$structures"
rm "$structures"
file="/home/input.txt"
grep -Po '/.*?/.*?/' $file > filenames.txt
structures='filenames.txt'
i=1
while IFS= read "line"
do
gunzip -c 'ens'$line'traj.pdb.gz' >> 'output'$i'.pdb'
i=$(expr $i + 1)
done <$structures
rm $structures
couple of logical mistakes, the counter has to be fined as one out of the while loop and the counter +1 should be inside the loop, also for the counter to work you have to use expr, in this case i made the counter start from 1, so the first entry will get this value. Also on the parameter for the while loop i dont really understand what you are doing, if it works as you have it then cool or else use a test statement after while and before the parameters.

Redirecting the result files to different variable file names

I have a folder with, say, ten data files I01.txt, ..., I10.txt.. Each file, when executed using the command /a.out, gives me five output files, namely f1.txt, f2.txt, ... f5.txt.
I have written a simple bash program to execute all the files and save the output printed on the screen to a variable file using the command
./ cosima_peaks_444_temp_muuttuva -args > $counter-res.txt.
Using this, I am able to save the on screen output to the file. But the five files f1 to f5 are altered to store results of the last file run, in this case I10, and the results of the first nine files are lost.
So I want to save the output of each I*.txt file (f1 ... f5) to a a different file such that, when the program executes I01.txt, using ./a.out it stores the output of the files
f1>var1-f1.txt , f2>var1-f2.txt... f5 > var1-f5.txt
and then repeats the same for I02 (f1>var2-f1.txt ...).
#!/bin/bash
# echo "for looping over all the .txt files"
echo -e "Enter the name of the file or q to quit "
read dir
if [[ $dir = q ]]
then
exit
fi
filename="$dir*.txt"
counter=0
if [[ $dir == I ]]
then
for f in $filename ; do
echo "output of $filename"
((counter ++))
./cosima_peaks_444_temp_muuttuva $f -m202.75 -c1 -ng0.5 -a0.0 -b1.0 -e1.0 -lg > $counter-res.txt
echo "counter $counter"
done
fi
If I understand you want to pass files l01.txt, l02.txt, ... to a.out and save the output for each execution of a.out to a separate file like f01.txt, f02.txt, ..., then you could use a short script that reads each file named l*.txt in the directory and passes the value to a.out redirecting the output to a file fN.txt (were N is the same number in the lN.txt filename.) This presumes you are passing each filename to a.out and that a.out is not reading the entire directory automatically.
for i in l*.txt; do
num=$(sed 's/^l\(.*\)[.]txt/\1/' <<<"$i")
./a.out "$i" > "f${num}.txt"
done
(note: that is 's/(lowercase L) ... /\one..')
note: if you do not want the same N from the filename (with its leading '0'), then you can trim the leading '0' from the N value for the output filename.
(you can use a counter as you have shown in your edited post, but you have no guarantee in sort order of the filenames used by the loop unless you explicitly sort them)
note:, this presumes NO spaces or embedded newline or other odd characters in the filename. If your lN.txt names can have odd characters or spaces, then feeding a while loop with find can avoid the odd character issues.
With f1 - f5 Created Each Run
You know the format for the output file name, so you can test for the existence of an existing file name and set a prefix or suffix to provide unique names. For example, if your first pass creates filenames 'pass1-f01.txt', 'pass1-f02.txt', then you can check for that pattern (in several ways) and increment your 'passN' prefix as required:
for f in "$filename"; do
num=$(sed 's/l*\(.*\)[.]txt/\1/' <<<"$f")
count=$(sed 's/^0*//' <<<"$num")
while [ -f "pass${count}-f${num}.txt" ]; do
((count++))
done
./a.out "$f" > "pass${count}-f${num}.txt"
done
Give that a try and let me know if that isn't closer to what you need.
(note: the use of the herestring (<<<) is bash-only, if you need a portable solution, pipe the output of echo "$var" to sed, e.g. count=$(echo "$num" | sed 's/^0*//') )
I replaced your cosima_peaks_444_temp_muuttuva with a function myprog.
The OP asked for more explanation, so I put in a lot of comment:
# This function makes 5 output files for testing the construction
function myprog {
# Fill the test output file f1.txt with the input filename and a datestamp
echo "Output run $1 on $(date)" > f1.txt
# The original prog makes 5 output files, so I copy the new testfile 4 times
cp f1.txt f2.txt
cp f1.txt f3.txt
cp f1.txt f4.txt
cp f1.txt f5.txt
}
# Use the number in the inputfile for making a unique filename and move the output
function move_output {
# The parameter ${1} is filled with something like I03.txt
# You can get the number with a sed action, but it is more efficient to use
# bash functions, even in 2 steps.
# First step: Cut off from the end as much as possiple (%%) starting with a dot.
Inumber=${1%%.*}
# Step 2: Remove the I from the Inumber (that is filled with something like "I03").
number=${Inumber#I}
# Move all outputfiles from last run
for outputfile in f*txt; do
# Put the number in front of the original name
mv "${outputfile}" "${number}_${outputfile}"
done
}
# Start the main processing. You will perform the same logic for all input files,
# so make a loop for all files. I guess all input files start with an "I",
# followed by 2 characters (a number), and .txt. No need to use ls for listing those.
for input in I??.txt; do
# Call the dummy prog above with the name of the first input file as a parameter
myprog "${input}"
# Now finally the show starts.
# Call the function for moving the 5 outputfiles to another name.
move_output "${input}"
done
I guess you have the source code of this a.out binary. If so, I would modify it so that it outputs to several fds instead of several files. Then you can solve this very cleanly using redirects:
./a.out 3> fileX.1 4> fileX.2 5> fileX.3
and so on for every file you want to output. Writing to a file or to a (redirected) fd is equivalent in most programs (notable exception: memory mapped I/O, but that is not commonly used for such scripts - look for mmap calls)
Note that this is not very esoteric, but a very well known technique that is regularly used to separate output (stdout, fd=1) from errors (stderr, fd=2).

Execute Script to Run on Multiple Files

I have a script that I need to run on a large number of files.
This is the script and how it is run:
./tag-lbk.sh test.txt > output.txt
It takes a file as input and creates an output file. I need to run this on several input files, and I want a different output file for each input file.
How would I go about doing this? Can I make a script (I have not much experience writing bash scripts).
[edits]:
#fedorqui asked: Where are the names of the input files and output files stored?
There are several thousand files, each with a unique name. I was thinking maybe there is a way to recursively iterate through all the files (they are all .txt files). The output files should have names that are generated recursively, but in a random fashion.
Simple solution: Use two folders.
for input in /path/to/folder/*.txt ; do
name=$(basename "$input")
./tag-lbk.sh "$input" > "/path/to/output-folder/$name"
done
or, if you want everything in the same folder:
for input in *.txt ; do
if [[ "$input" = *-tagged.txt ]]; then
continue # skip output
fi
name=$(basename "$input" .txt)-tagged.txt
./tag-lbk.sh "$input" > "$name"
done
Try this with a small set of inputs somewhere where it doesn't matter when files get deleted, corrupted and overwritten.
The below script will find the files with extension .txt and redirect the output of the tag-1bk script to the randomly generated log file log.123 ..
#!/bin/bash
declare -a ar
# Find the files and store it in an array
# This way you don't iterate over the output files
# generated by this script
ar=($(find . -iname "*.txt"))
#Now iterate over the files and run your script
for i in "${ar[#]}"
do
#Create a random file in the format log.123,log.345
tmp_f=$(mktemp log.XXX)
#Redirect your output to the log file
./tag-lbk.sh "$i" > $tmp_f
done

In shell, how do I delete numbered duplicate files?

I've got a directory with a few thousand files in it, named things like:
filename.ext
filename (1).ext
filename (2).ext
otherfile.ext
otherfile (1).ext
etc.
Most of the files with bracketed numbers are duplicates of the original, but in some cases they're not.
How can I keep my original files, delete the duplicates, but not lose the files that are different?
I know that I could rm *\).ext, but that obviously doesn't make sure that files match the original.
I'm using OS X, so I have a md5 program that functions sort of like md5sum in Linux, though it puts the hash at the end of the line instead of the beginning. I was thinking I could use an awk script to take the output of md5 *.ext | awk 'some script', find duplicates by md5, and delete them, but the command line is too long (bash: /sbin/md5: Argument list too long).
And I don't know what to write in the script. I was thinking of storing things in an array with this:
awk '{a[$NF]++} a[$NF]>1{sub(/).*/,""); sub(/.*(/,""); system("rm " $0);}'
But that always seems to delete my original.
What am I doing wrong? How do I do it right?
Thanks.
Your awk script deletes original files because when you sort your files, . (period) sorts after (space). SO the first file that's seen is numbered, not the original, and subsequent checks (including the one against the original) compare files to the first numbered one.
Not only does rm *\).txt fail to match the original, it loses files that may not have an original in the first place.
I wouldn't do this quite this way. Rather than checking every numbered file and verifying whether it matches an original, you can go through your list of originals, then delete the numbered files that match them.
Instead:
$ for file in *[^\)].txt; do echo "-- Found: $file"; rm -v $(basename "$file" .txt)\ \(*\).txt; done
You can expand this to check MD5's along the way. But it's more code, so I'll break it into multiple lines, in a script:
#!/bin/bash
shopt -s nullglob # Show nothing if a fileglob matches no files
for file in *[^\)].ext; do
md5=$(md5 -q "$file") # The -q option gives you only the message digest
echo "-- Found: $file ($md5)"
for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
if [[ "$md5" = "$(md5 -q "$duplicate")" ]]; then
rm -v "$duplicate"
fi
done
done
As an alternative, you can probably get away with doing this a little more simply, with less CPU overhead than calculating MD5 digests. Unix and Linux have a shell tool called cmp, which is like diff without the output. So:
#!/bin/bash
shopt -s nullglob
for file in *[^\)].ext; do
for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
  if cmp "$file" "$duplicate"; then
rm -v "$file"
fi
done
done
If you don't need to use AWK, you could maybe do something simpler in bash:
for file in *\([0-9]*\)*; do
[ -e "$(echo "$file" | sed -e 's/ ([0-9]\+)//')" ] && rm "$file"
done
Hope this helps a little =)

Resources