Problems for Extracting data from multiple files with awk - bash

If I have 300 files:
f_1.dat, f_2.dat,......, f_300.dat
For example, the file f_114.dat has the structure like
atom(index):
114
Ave: cosp1 cosp2 cosp3:
-0.74 -0.54 -0.37
...
I want to extract the 2nd line (eg, the number 114) and the 4th line (the three numbers) from certain files (these files are f_114.dat, f_182.dat,...,f_249.dat ) among the 300 files and merge them to the same file , eg:
114 -0.74 -0.54 -0.37
182 -0.72 -0.59 -0.37
…
I tried with for structure, the command is
for i in `114,182,251,131,183,257,140,191,31,148,192,48,151,195,51,92,177,249`; do awk -v num=$i 'NR=2&&NR=4{print $0}' f_$i.dat > $i.dat; done
But an error shows there is a grammar mistake.
Could you give a solution for the problem ?

I think you want something like:
for i in {114,182,251,131,183,257,140,191,31,148,192,48,151,195,51,92,177,249} ; do
awk 'BEGIN {ORS=" "} NR==2 || NR==4' f_$i.dat
echo ""
done > merged-file.dat
There were a few problems with your for-loop
do was missing from the for-loop;
for-loops are structured like for i in $LIST ; do ... ; done
the $LIST part was invalid
-v num=$i was not needed as the num variable wasn't even used in the awk expression
you wanted to merge the output in the same file, so putting > $i.dat in the for-loop wouldn't work as they output to separate files;
you need to place > $OUTFILE.dat outside the for-loop or
append (>>) the output of each iteration of the for-loop to the same file with something like: for i in ... ; do ... >> OUTFILE.dat ; done

Related

Make cat command to operate recursively looping through a directory

I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.
"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii
Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.
You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?
Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file
Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).
Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.
Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt
You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Bash - select lines of a file based on values in another file

I have 2 files; let's call them file1 and file2. file1 contains a start and an end coordinate in each row, e.g.:
start end
2000 2696
3465 3688
8904 9546
etc.
file2 has several columns, of which the first is the most relevant for the question:
position v2 v3 v4
3546 value12 value13 value14
9847 value22 value23 value24
12000 value32 value33 value34
Now, I need to output a new file, which will contain only the lines of file2 for which the 'position' value (1st column) is in between the 'start' and the 'end' value of any of the columns of file1. In R I'd just make a double loop, but it takes too much time (the files are large), so need to do it in bash. In case the question is unclear, here's the R loop that would do the job:
for(i in 1:dim(file1)[1]){
for(j in 1:dim(file2)[1]){
if(file2[j,1]>file1$start[i] & file2[j,1]<file1$end[i]) file2$select=1 else file2$select=0
}
}
Very sure there's a simple way of doing this using bash / awk...
The awk will look like this, but you'll need to remove the first line from file1 and file2 first:
awk 'FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
The bit in curly braces after "FNR==NR" only applies to the processing of file1 and it says to store field1 in array x[] and field2 in array y[] so we have the upper and lower bounds of each range. The bit in the second set of curly braces applies to procesing file2 only. It says to iterate through all the bounds in array x[] and y[] and see if field1 is between the bounds, and print the whole reocrd if it is.
If you don't want to remove the header line at the start, you can make the awk a little more complicated and ignore it like this:
awk 'FNR==1{next}FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
EDITED
Ok, I have added code to check "chromosome" (whatever that is!) assuming it is in the first field in both files, like this:
File1
x 2000 2696
x 3465 3688
x 8904 9546
File2
x 3546 value12 value13 value14
y 3467 value12 value13 value14
x 9847 value22 value23 value24
x 12000 value32 value33 value34
So the code now stores the chromosome in array c[] as well and checks they are equal before outputting.
awk 'BEGIN{i=0}FNR==NR{c[i]=$1;x[i]=$2;y[i++]=$3;next}{for(j=0;j<i;j++){if(c[j]==$1&&$2>=x[j]&&$2<=y[j]){print $0;next}}}' file1 file2
Don't know how to do this in bash...
I would try a perl script, reading the first file and storing it in memory (if it's possible, it depends on its size) and then going through the second file line by line and doing the comparisons to output the line or not.
I think you can do this in R too... the same way: storing the first file, looping for each line of the second file .
Moreover if the intervals don't overlap, you can do a sort on the files to speed up your algorithm.
This should be faster than the for loop
res <- apply(file2, 1, function(row)
{
any(row$position > file1$start & row$position < file1$end)
})
Assuming the delimiters for the files are spaces (if not - change -d estting).
The script uses cut to extract the first field of file2.
Then a simple grep searches for the field in file1. If present, the line from file2 is printed.
#!/bin/bash
while read line
do
word=$(echo $line | cut -f1 -d" ")
grep -c $word file1 >/dev/null
if [ $? -eq 0 ];then
echo "$line"
fi
done < file2

Using awk with Operations on Variables

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!
That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.
I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Resources