Paste files conditionally with bash if and awk loop - bash

I have a list of files that I want to paste to a master file (bar) if some awk condition is fulfilled.
for foo in *;
do
if awk '*condition* {exit 1}' $foo
then
:
else
paste $foo > bar
fi
done
However, it looks like only the last pasted file is in bar. Shouldn't paste add new columns to bar every time, without overwriting all the data completely?
File1 File2 Expected_Output Actual_Output
1 4 1 NaN 1 4 1 NaN 1 NaN
2 5 2 7 2 5 2 7 2 7
3 6 3 8 3 6 3 8 3 8

Your paste command overwrites file bar at each iteration in the loop, which explains that at the end you only have the last file.
declare -a FILES=()
for foo in *;
do
if awk '*condition* {exit 1}' $foo
then
:
else
FILES+=("$foo")
fi
done
paste "${FILES[#]}" > bar
This code accumulates all filenames that match your condition in an array named FILES, and calls paste only once, expanding all filenames into individual, quoted arguments (this is what "${FILES[#]}" does) and redirecting output to the bar file.
Additionally, you can replace the whole if/then/else block with :
awk '*condition* {exit 1}' "$foo" || FILES+=("$foo")
The || expresses a condition, and because of Bash performing a lazy evaluation of logical operators, the statement to the right is only executed if awk returns a non-zero return code.
Please note I quoted "$foo" (when passing it to awk) for the cases the name of your files would contain special characters.

Related

Convert range to string

If I run the
echo {0..9}
command, then I get the following output:
0 1 2 3 4 5 6 7 8 9
Can I somehow put the string "0 1 2 3 4 5 6 7 8 9" into a variable inside bash script? I only found a way using echo:
x=`echo {0..9}`
But this method implies the execution of an external program. Is it possible to somehow manage only with bash?
Interested, rather than a way to convert a range to a string, but additionally concatenate with a string, for example:
datafiles=`echo data{0..9}.txt`
First of all,
x=`echo {0..9}`
doesn't call an external program (echo is a built-in) but creates a subshell. If it isn't desired you can use printf (a built-in as well) with -v option:
printf -v x ' %s' {0..9}
x=${x:1} # strip off the leading space
or
printf -v datafiles ' data%s.txt' {0..9}
datafiles=${datafiles:1}
or you may want storing them in an array:
datafiles=(data{0..9}.txt)
echo "${datafiles[#]}"
This last method will work correctly even if filenames contain whitespace characters:
datafiles=(data\ {0..9}\ .txt)
printf '%s\n' "${datafiles[#]}"

Make cat command to operate recursively looping through a directory

I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.
"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii
Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.
You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

setting awk variables through inlining

I've got this:
./awktest -v fields=`cat testfile`
which ought to set fields variable to '1 2 3 4 5' which is all that testfile contains
It returns:
gawk: ./awktest:9: fatal: cannot open file `2' for reading (No such file or directory)
When I do this it works fine.
./awktest -v fields='1 2 3 4 5'
printing fields at the time of error yields:
1
printing fields in the second instance yields:
1 2 3 4 5
When I try it with 12345 instead of 1 2 3 4 5 it works fine for both, so it's a problem with the white space. What is this problem? And how do I fix it.
This is most likely not an awk question. Most likely, it is your shell that is the culprit.
For example, if awktest is:
#!/bin/bash
i=1
for arg in "$#"; do
printf "%d\t%s\n" $i "$arg"
((i++))
done
Then you get:
$ ./awktest -v fields=`cat testfile`
1 -v
2 fields=1
3 2
4 3
5 4
6 5
You see that the file contents are not being handled as a single word.
Simple solution: use double quotes on the command line:
$ ./awktest -v fields="$(< testfile)"
1 -v
2 fields=1 2 3 4 5
The $(< file) construct is a bash shortcut for `cat file` that does not need to spawn an external process.
Or, read the first line of the file in the awk BEGIN block
awk '
BEGIN {getline fields < "testfile"}
rest of awk program ...
'
./awktest -v fields="`cat testfile`"
#note that:
#./awktest -v fields='`cat testfile`'
#does not work

BASH: Iterating two v ariables in a for loop

I am having two files numbers.txt(1 \n 2 \n 3 \n 4 \n 5 \n) and alpha.txt (a \n n \n c \n d \n e \n)
Now I want to iterate both the files at the same time something like.
for num in `cat numbers.txt` && alpha in `cat alpha.txt`
do
echo $num "blah" $alpha
done
Or other idea I was having is
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
break
done
done
but this kind of code always take the first value of $alpha.
I hope my problem is clear enough.
Thanks in advance.
Here it is what I actually intended to do. (Its just an example)
I am having one more file say template.txt having content.
variable1= NUMBER
variable2= ALPHA
I wanted to take the output from two files i.e numbers.txt and alpha.txt(one line from both at a time) and want to replace the NUMBER and ALPHA with the respective content from those two files.
so here it what I did as i got to know how to iterate both files together.
paste number.txt alpha.txt | while read num alpha
do
cp template.txt temp.txt
sed -i "{s/NUMBER/$num/g}" temp.txt
sed -i "{s/ALPHA/$alpha/g}" temp.txt
cat temp.txt >> final.txt
done
Now what i am having in final.txt is:
variable1= 1
variable2= a
variable1= 2
variable2= b
variable1= 3
variable2= c
variable1= 4
variable2= d
variable1= 5
variable2= e
variable1= 6
variable2= f
variable1= 7
variable2= g
variable1= 8
variable2= h
variable1= 9
variable2= i
variable1= 10
variable2= j
Its very simple and stupid approach. I wanted to know is there any other way to do this??
Any suggestion will be appreciated.
No, your question isn't clear enough. Specifically, the way you wish to iterate through your files is unclear, but assuming you want to have an output such as:
1 blah a
2 blah b
3 blah c
4 blah d
5 blah e
you can use the paste utility, like this:
paste number.txt alpha.txt | while read alpha num ; do
echo "$num and $alpha"
done
or even:
paste -d# alpha num | sed 's/#/ blah /'
Your first loop is impossible in bash. Your second one, without the break, would combine each line from numbers.txt with each line from alpha.txt, like this:
1 AND a
1 AND n
1 AND c
...
2 AND a
...
3 AND a
...
4 AND a
...
Your break makes it skip all lines from the alpha.txt, except the 1st one (bmk has already explained it in his answer)
It should be possible to organize the correct loop using the while loop construction, but it would be rather ugly.
There're lots of easier alternatives which maybe a better choice, depending on specifics of your task. For example, you could try this:
paste numbers.txt alpha.txt
or, if you really want your "AND"s, then, something like this:
paste numbers.txt alpha.txt | sed 's/\t/ AND /'
And if your numbers are really sequential (and you can live without 'AND'), you can simply do:
cat -n alpha.txt
Here is an alternate solution according to the first model you suggested:
while read -u 5 a && read -u 6 b
do
echo $a $b
done 5<numbers.txt 6<alpha.txt
The notation 5<numbers.txt tells the shell to open numbers.txt using file descriptor 5. read -u 5 a means read from a value for a from file descriptor 5, which has been associated with numbers.txt.
The advantage of this approach over paste is that it gives you fine-grain control over how you merge the two files. For example you could read one line from the first file and twice from the second file.
In your second example the inner loop is executed only once because of the break. It will simply jump out of the loop, i.e. you will always only get the first element of alpha.txt. Therefore I think you should remove it:
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
done
done
If multiple loop isn't specifically your requirement but getting corresponding lines is then you may try the following code:
for line in `cat numbers.txt`
do
echo $line "and" $(cat alpha.txt| head -n$line | tail -n1 )
done
The head gets you the number of lines equal to the value of line and tail gets you the last element.
#tollboy, I think the answer you are looking for is this:
count=1
for item in $(paste number.txt alpha.txt); do
if [[ "${item}" =~ [a-zA-Z] ]]; then
echo "variable${count}= ${item}" >> final.txt
elif [[ "${item}" =~ [0-9] ]]; then
echo "variable${count}= ${item}" >> final.txt
fi
count=$((count+1))
done
When you type paste number.txt alpha.txt in your console, you see:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
From bash's point of view $(paste number.txt alpha.txt) it looks like this:
1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j
So for each item in that list, figure out if it is alpha or numeric, and print it to the output file.
Lastly, increment the count.

Resources