add a column to multiple text files with bash - for-loop

I am trying to add a formatted column between columns of multiple text files. By using
awk 'NR==FNR{c[NR]=$3;l=NR;next}{$2=($3+c[1])*c[l]" "$2}7' file file
I can convert a file that has a form
1 2 3 4
10 20 30 40
100 200 300 400
to
1 1800 3 4
10 9900 30 40
100 90900 300 400
How can I do above operation to multiple .dat files?

tmp="/usr/tmp/tmp$$"
for file in *
do
awk '...' "$file" "$file" > "$tmp" && mv "$tmp" "$file"
done
wrt your script, though:
awk 'NR==FNR{c[NR]=$3;l=NR;next}{$2=($3+c[1])*c[l]" "$2}7' file file
Never use the letter l (el) as a variable name as it looks far too much like the number 1 (one). I'd actually write it as:
awk 'NR==FNR{c[++n]=$3;next}{$2=($3+c[1])*c[n]" "$2}7' file file
or if memory is a concern for a large file:
awk 'NR==FNR{c[NR==1]=$3;next}{$2=($3+c[1])*c[0]" "$2}7' file file

Related

bash + awk: extract specific information from ensemble of filles

I am using bash script to extract some information from log files located within the directory and save the summary in the separate file.
In the bottom of each log file, there is a table like:
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -6.961 0 0
2 -6.797 2.908 4.673
3 -6.639 27.93 30.19
4 -6.204 2.949 6.422
5 -6.111 24.92 28.55
6 -6.058 2.836 7.608
7 -5.986 6.448 10.53
8 -5.95 19.32 23.99
9 -5.927 27.63 30.04
10 -5.916 27.17 31.29
11 -5.895 25.88 30.23
12 -5.835 26.24 30.36
from this I need to take only the value from the second column of the first line (-6.961) and add it together with the name of the log as one string in new ranking_${output}.log
log_name -6.961
so for 5 processed logs it should be something like:
# ranking_${output}.log
log_name1 -X.XXX
log_name2 -X.XXX
log_name3 -X.XXX
log_name4 -X.XXX
log_name5 -X.XXX
Here is a simple bash workflow, which takes ALL THE LINES from ranking table and saves it together with the name of the LOG file:
#!/bin/bash
home="$PWD"
#folder contained all *.log files
results="${home}"/results
# loop each log file and take its name + all the ranking table
for log in ${results}/*.log; do
log_name=$(basename "$log" .log)
echo "$log_name" >> ${results}/ranking_${output}.log
cat $log | tail -n 12 >> ${results}/ranking_${output}.log
done
Could you suggest me an AWK routine which would select only the top value located on the first line of each table?
This is an AWK example that I had used for another format, which does not work there:
awk -F', *' 'FNR==2 {f=FILENAME;
sub(/.*\//,"",f);
sub(/_.*/ ,"",f);
printf("%s: %s\n", f, $5) }' ${results}/*.log >> ${results}/ranking_${output}.log
With awk. If first column contains 1 print filename and second column to file output:
awk '$1=="1"{print FILENAME, $2}' *.log > output
Update to remove path and suffix (.log):
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); print FILENAME, $2}'

Split large csv file into multiple files and keep header in each part

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?
How can I achieve this
h1 h2
a aa
b bb
c cc
.
.
12483720 rows
into
h1 h2
a aa
b bb
.
.
.
1000 rows
And
h1 h2
x xx
y yy
.
.
.
10000 rows
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?
awk -v l=1000 '(NR==1){header=$0;next}
(n==l) {
c=sprintf("%0.5d",c+1);
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
print header > file
n=0;l*=10
}
{print $0 > file; n++}' file.csv
This works in the following way:
(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
close(file): close the file you just wrote too.
file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
print header > file: open the file and write the header to that file.
n=0: reset the current record count
l*=10: increase the maximum record count for the next file
{print $0 > file; n++}: write the entries to the file and increment the record count
Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.
# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv
# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'
# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'
# Walk the files that do not have headers. For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
mv tmp.txt $f
done
Here's a first approach:
#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
cp header h$f
cat $f >>h$f
done
rm -f header
rm -f y*
The following bash solution should work nicely :
IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
curr_file_name="file_with_${curr_file_max_rows}_rows"
echo "$header" > "$curr_file_name"
for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
IFS='' read -r row || break 2
echo "$row" >> "$curr_file_name"
}
}
We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.
Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.
You can try it here.

using bash scripting for extracting data to various files

using bash scripting for extracting data to various files
awk '{for(i=1; i<=10; i++){if($1== 2**($i)){getline; print}}}' test.csv>> test/test_$i.csv
Description; I want to extract data to multiple files where column 1 of input file has sizes in power of 2. I want to extract rows having same size into a different file.
input file:
4 10.06 9.64 10.36 1000
8 10.16 9.79 10.48 1000
16 10.49 10.02 10.86 1000
32 10.54 10.13 10.91 1000
4 10.76 9.64 10.36 1000
8 10.90 9.79 10.48 1000
awk 'log($1)/log(2) == int(log($1)/log(2)) { out="pow-" $1; print >out }' file.in
This will, for the given data, create the files pow-N for N equal to 4, 8, 16 and 32.
It will skip lines that does not have a number that is a power of 2 in its first column.
Thanks for your help.
I figured out the possible solution:
for i in `seq 0 $numline`
do
if [ -e $InDir/$file ]
then
awk -v itr=$i '{if($2== 2**(itr)) {print $0}}' $file >> $OutDir/$(awk "BEGIN{print (2 ** $i)}")mb_$file
fi
done

Bash script cut at specific ranges

I have a log file with a plenty of collected logs, I already made a grep command with a regex that outputs the number of lines that matches it.
This is the grep command I'm using to output the matched lines:
grep -n -E 'START_REGEX|END_REGEX' Example.log | cut -d ':' -f 1 > ranges.txt
The regex is conditional it can match the begin of a specific log or its end, thus the output is something like:
12
45
128
136
...
The idea is to use this as a source of ranges to make specific cut on the log file from first number to the second and save them on another file.
The ranges are made by couples of the output, according to the example the first range is 12,45 and the second 128,136.
I expect to see in the final file all the text from line 12 to 45 and then from 128 to 136.
The problem I'm facing is that the sed command seems to work with only one range at time.
sed -E -iTMP "$START_RANGE,$END_RANGE! d;$END_RANGEq" $FILE_NAME
Is there any way (maybe with awk) to do that just in one "cycle"?
Constraints: I can only use supported bash command.
You can use an awk statement, too
awk '(NR>=12 && NR<=45) || (NR>=128 && NR<=136)' file
where, NR is a special variable in Awk which keep tracks of the line number as it processes the file.
An example,
seq 1 10 > file
cat file
1
2
3
4
5
6
7
8
9
10
awk '(NR>=1 && NR<=3) || (NR>=8 && NR<=10)' file
1
2
3
8
9
10
You can also avoid, hard-coding the line numbers by using the -v variable option,
awk -v start1=1 -v end1=3 -v start2=8 -v end2=10 '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2)' file
1
2
3
8
9
10
With sed you can do multiple ranges of lines like so:
sed -n '12,45p;128,136p'
This would output lines 12-45, then 128-136.

How to extract 100 numbers at a time from a text file?

I am absolutely new to bash scripting but I need to perform some task with it. I have a file with just one column of numbers (6250000). I need to extract 100 at a time, put them into a new file and submit each 100 to another program. I think this should be some kind of a loop going through my file each 100 numbers and submitting them to the program.
Let's say my numbers in the file would look like this.
1.6435
-1.2903
1.1782
-0.7192
-0.4098
-1.7354
-0.4194
0.2427
0.2852
I need to feed each of those 62500 output files to a program which has a parameter file. I was doing something like this:
lossopt()
{
cat<<END>temp.par
Parameters for LOSSOPT
***********************
START OF PARAMETERS:
lossin.out \Input file with distribution
1 \column number
lossopt.out \Output file
-3.0 3.0 0.01 \xmin, xmax, xinc
-3.0 1
0.0 0.0
0.0 0.0
3.0 0.12
END
}
for i in {1..62500}
do
sed -n 1,100p ./rearnum.out > ./lossin.out
echo temp.par | ./lossopt >> lossopt.out
rm lossin.out
cut -d " " -f 101- rearnum.out > rearnum.out
done
rearnum is my big initial file
If you need to split it into files containing 100 lines each, I'd use split -l 100 <source> which will create a lot of files named like xaa, xab, xac, ... each of which contain at most 100 lines of the source file (the last file may contain fewer). If you want the names to start with something other than x you can give the prefix those names should use as the last argument to split as in split -l 100 <source> OUT which will now give files like OUTaa, OUTab, ...
Then you can loop over those files and process them however you like. If you need to run a script with them you could do something like
for file in OUT*; do
<other_script> "$file"
done
You can still use a read loop and redirection:
#!/bin/bash
fnbase=${1:-file}
increment=${2:-100}
declare -i count=0
declare -i fcount=1
fname="$(printf "%s_%08d" "$fnbase" $((fcount)))"
while read -r line; do
((count == 0)) && :> "$fname"
((count++))
echo "$line" >> "$fname"
((count % increment == 0)) && {
count=0
((fcount++))
fname="$(printf "%s_%08d" "$fnbase" $((fcount)))"
}
done
exit 0
use/output
$ bash script.sh yourprefix <yourfile
Which will take yourfile with many thousands of lines and write every 100 lines out to yourprefix_00000001 -> yourprefix_99999999 (default is file_000000001, etc..). Each new filename is truncated to 0 lines before writing begins.
Again you can specify on the command line the number of lines to write to each file. E.g.:
$ bash script.sh yourprefix 20 <yourfile
Which will write 20 lines per file to yourprefix_00000001 -> yourprefix_99999999
Even though it may seem stupid for professional in bash, I will take the risk and post my own answer to my question
cat<<END>temp.par
Parameters for LOSSOPT
***********************
START OF PARAMETERS:
lossin.out \Input file with distribution
1 \column number
lossopt.out \Output file
-3.0 3.0 0.01 \xmin, xmax, xinc
-3.0 1
0.0 0.0
0.0 0.0
3.0 0.12
END
for i in {1..62500}
do
sed -n 1,100p ./rearnum.out >> ./lossin.out
echo temp.par | ./lossopt >> sdis.out
rm lossin.out
tail -n +101 rearnum.out > temp
tail -n +1 temp > rearnum.out
rm temp
done
This script consequentially "eats" the big initial file and puts the "pieces" into the external program. After it takes one portion of 100 number, it deletes this portion from the big file. Then, the process repeats until the big file is empty. It is not an elegant solution but it worked for me.

Resources