Apply an gawk script to multiple files in a folder - bash

I would like to use the following awk line to remove every even line (and keep the odd lines) in a text file.
awk 'NR%2==1' filename.txt > output
The problem is that I struggle to either loop properly in awk or build a shell script to apply this to all *.txt fies in a folder. I tried to use this one-liner
gawk 'FNR==1{if(o)close(o);o=FILENAME;
sub(/\.txt/,"_oddlines.txt",o)}{NR%2==1; print>o}'
but that didn't remove the even lines. And I am even less familiar with shell scripting. I use gawk under win7 or cygwin with bash. Many thanks for any kind of idea.

Your existing gawk one-liner is really close. Here it is formatted as a more readable script:
FNR == 1 {
if (o)
close(o)
o = FILENAME
sub(/\.txt/, "_oddlines.txt", o)
}
{
NR % 2 == 1
print > o
}
This should make the error obvious1. So now we remove that error:
FNR == 1 {
if (o)
close(o)
o = FILENAME
sub(/\.txt/, "_oddlines.txt", o)
}
NR % 2 == 1 {
print > o
}
$ awk -f foo.awk *.txt
and it works (and of course you can re-one-line-ize this).
(Normally I would do this with a for like the other answers, but I wanted to show you how close you were!)
1Per comment, maybe not quite so obvious?
Awk's basic language construct is the "pattern-action" statement. An awk program is just a list of such statements. The "pattern" is so named because originally they were mostly grep-like regular expression patterns:
$ awk '/^be.*st$/' < /usr/share/dict/web2
beanfeast
beast
[snip]
(Except for the slashes, this is basically just running grep, since it uses the default action, print.)
Patterns can actually contain two addresses, but it's more typical to use one, as in these cases. Patterns not enclosed within slashes allow tests like FNR == 1 (File-specific Number of this Record equals 1) or NR % 2 == 1 (Number of this Record—cumulative across all files!—mod 2 equals 1).
Once you hit the open brace, though, you're into the "action" part. Now NR % 2 == 1 simply calculates the result (true or false) and then throws it away. If you leave out the "pattern" part entirely, the "action" part is run on every input line. So this prints every line.
Note that the test NR % 2 == 1 is testing the cumulative record-number. So if some file has an odd number of lines ("records"), the next file will print out every even-numbered line (and this will persist until you hit another file with an odd number of lines).
For instance, suppose the two input files are A.txt and B.txt. Awk starts reading A.txt and has both FNR and NR set to 1 for the first line, which might be, e.g., file A, line 1. Since FNR == 1 the first "action" is done, setting o. Then awk tests the second pattern. NR is 1, so NR % 2 is 1, so the second "action" is done, printing that line to A_oddlines.txt.
Now suppose file A.txt contains only that one line. Awk now goes on to file B.txt, resetting FNR but leaving NR cumulative. The first line of B might be file B, line 1. Awk tries the first "pattern", and indeed, FNR == 1 so this closes the old o and sets up the new one.
But NR is 2, because NR is cumulative across all input files. So the second pattern (NR % 2 == 1) computes 2 % 2 (which is 0) and compares == 1 which is false, and thus awk skips the second "action" for line 1 of file B.txt. Line 2, if it exists, will have FNR == 2 and NR == 3, so that line will be copied out.
(I originally assumed, since your script was close to working, that you intended this and were just stuck a bit on syntax.)

With GNU awk you could just do:
$ awk 'FNR%2{print > (FILENAME".odd")}' *.txt
This will create a .odd file for every .txt file in the current directory containing only the odd lines.
However sed has the upper hand on conciseness here. The following GNU sed command will remove all even lines and store the old file with the extension .bck for all .txt files in the current directory:
$ sed -ni.bck '1~2p' *txt
Demo:
$ ls
f1.txt f2.txt
$ cat f1.txt
1
2
3
4
5
$ cat f2.txt
6
7
8
9
10
$ sed -ni.bck '1~2p' *txt
$ ls
f1.txt f1.txt.bck f2.txt f2.txt.bck
$ cat f1.txt
1
3
5
$ cat f1.txt.bck
1
2
3
4
5
$ cat f2.txt
6
8
10
$ cat f2.txt.bck
6
7
8
9
10
If you don't won't the back up files then simply:
$ sed -ni '1~2p' *txt

Personally, I'd use
for filename in *.txt; do
awk 'NR%2==1' "$filename" > "oddlines-$filename"
done
EDIT: quote filenames

You can try a for loop :
#!/bin/bash
for file in dir/*.txt
do
oddfile=$(echo "$file" | sed -e 's|\.txt|_odd\.txt|g') #This will create file_odd.txt
awk 'NR%2==1' "$file" > "$oddfile" # This will output it in the same dir.
done

Your problem is that NR%2==1 is inside the {NR%2==1; print>o} 'action block' and is not kicking in as a 'condition'. Use this instead:
gawk 'FNR==1{if(o)close(o);o=FILENAME;sub(/\.txt/,"_oddlines.txt",o)};
FNR%2==1{print > o}' *.txt

Related

Adding the last number in each file to the numbers in the following file

I have some directories, each of these contains a file with list of integers 1-N which are not necessarily consecutive and they may be different lengths. What I want to achieve is a single file with a list of all those integers as though they had been generated in one list.
What I am trying to do is to add the final value N from file 1 to all the values in file 2, then take the new final value of file 2 and add it to all the values in file 3 etc.
I have tried this by setting a counter and looping over the files, resetting the counter when I get to the end of the file. The problem is the p=0 will continue to reset which is kind of obvious in the code but I am not sure how else to do it.
What I tried:
p=0
for i in dirx/dir_*; do
(cd "$i" || exit;
awk -v p=$p 'NR>1{print last+p} {last=$0} END{$0=last; p=last; print}' file >> /someplace/bigfile)
done
Which is similar to the answer suggested in this question Replacing value in column with another value in txt file using awk
Now I'm wondering whether I need an if else, if it's the first dir then p=0 if not then p=last value from the first file though I'm not sure on that or how I'd get it to take the last value. I used awk because that's what I understand a small amount of and would usually use.
With GNU awk
gawk '{print $1 + last} ENDFILE {last = last + $1}' file ...
Demo:
$ cat a
1
2
4
6
8
$ cat b
2
3
5
7
$ cat c
1
2
3
$ gawk '{print $1 + last} ENDFILE {last = last + $1}' a b c
1
2
4
6
8
10
11
13
15
16
17
18

Delete lines from a text file except the first and every nth

I have a long text file comprised of numbers, such as:
1
2
9.252
9.252
9.272
1
1
6.11
6.11
6.129
I would like to keep the first line, delete the subsequent three and then keep the next one. I would like to do this process for the whole file. Following that logic, considered the input above, I would like to have the following output:
1
9.272
1
6.129
Using GNU sed (needed for the ~ extension):
sed -n '1~5p;5~5p' file
Saving your numbers in a "textfile.txt" I can use the following with sed:
sed -n 'p;n;n;n;n;p;' textfile.txt
Sed prints the first line, reads the next 4 and prints the last line.
Or the following using while read in bash:
while read -r firstline && read -r nextone1 && read -r nextone2 && read -r nextone3 && read -r lastone; do
printf "%s\n" "$firstline" "$lastone";
done < textfile.txt
This just reads 5 lines at a time and prints only the first and 5th lines.
You can simply say:
awk 'NR%5<2' input.txt
Explanation: Considering the entire pattern repeats every five lines, let's start with applying modulo operation to the line number NR by five. Then we'll see the 1st line of the five-line block yields "1" and the 5th line of the block yields "0". Now they can be separated from other lines by comparing it to two.
To print the 1st and 5th line of every block of 5 lines (remember that 5%5 = 0):
$ awk '(NR%5) ~ /[10]/' file
1
9.272
1
6.129
If you want to print the 2nd, 3rd, and 4th line of every block of 5 lines instead of the 1st and 5th:
$ awk '(NR%5) ~ /[234]/' file
2
9.252
9.252
1
6.11
6.11
If you wanted to print the 27th and 53rd line of every block of 100:
awk '(NR%100) ~ /^(27|53)$/' file
We couldn't use a bracket expression there as we're now beyond single char numbers.
This might work for you (GNU sed):
sed '2~5,+2d' file
Starting from line 2, delete the next three lines using modulo 5.
An alternative:
sed -n '1p;5~5,+1p' file
Considering your groups are packed as 5 lines, you could use awk with a mod 5 operation.
awk '{i=(NR-1)%5;if(i==0||i==4)print $0}' input.txt
With indentation it looks like this:
{
i=(NR-1)%5;
if (i==0||i==4)
print $0;
}
i=(NR-1)%5 gets the line number and computes the modulo with 5, but since the line numbers start at 1 (instead of 0), you need to subtract 1 to it before computing the modulo.
This leaves you with an integer i that ranges from 0 to 4. You want to print the first line (index 0), skip the next three lines (indexes 1-3) and print the last line (index 4), which is exactly what does if (i==0||i==4) print $0
Alternately you can do the same thing with a shorter (and probably slightly more optimized version):
awk '((NR-1)%5==0||(NR-1)%5==4)' input.txt
This tells awk to do something for every 1st out of 5 lines and every 5th out of 5 lines. Since the "something" is not defined, by default it outputs the current line. If it helps, this is strictly equivalent to:
awk '((NR-1)%5==0||(NR-1)%5==4){print $0}' input.txt

Delete range of lines when line number of known or not in unix using head and tail?

This is my sample file.
I want to do this.
I have fixed requirement to delete 2nd and 3rd line keeping the 1st line.
From the bottom, I want to delete above 2 lines excluding last line, as I wouldn't know what my last line number is as it depends on file.
Once I delete my 2nd and 3rd line 4th line should ideally come at 2nd and so on, same for a bottom after delete.
I want to use head/tail command and modify the existing file only. as Changes to write back to the same file.
Sample file text format.
Input File
> This is First Line
> Delete Delete Delete This Line
> Delete Delete Delete This Line
> ..
> ..
> ..
> ..
> Delete Delete Delete This Line
> Delete Delete Delete This Line
> This is Last Line, should not be deleted It could be come at any line
number (variable)
Output file (same file modified)
This is First Line
..
..
..
..
This is Last Line, should not be deleted It could be come at any line number (variable)
Edit - Because of compatibility issues on Unix (Using HP Unix on ksh shell) I want to implement this using head/tail/awk. not sed.
Adding solution as per OP's request to make it genuine solution.
Approach: In this solution OP could provide lines from starting point and from ending point of any Input_file and those lines will be skipped.
What code will do: I have written code in that way it will generate an awk code as per your given lines to be skipped then and will run it too.
cat print_lines.ksh
start_line="2,3"
end_line="2,3"
total_lines=$(wc -l<Input_file)
awk -v len="$total_lines" -v OFS="||" -v s1="'" -v start="$start_line" -v end="$end_line" -v lines=$(wc -l <Input_file) '
BEGIN{
num_start=split(start, a,",");
num_end=split(end, b,",");
for(i=1;i<=num_start;i++){
val=val?val OFS "FNR=="a[i]:"FNR=="a[i]};
for(j=1;j<=num_end;j++){
b[j]=b[j]>1?len-(b[j]-1):b[j];
val=val?val OFS "FNR=="b[j]:"FNR=="b[j]};
print "awk " s1 val "{next} 1" s1" Input_file"}
' | sh
Change Input_file name to your actual file name and let me know how it goes then.
Following awk may help you in same(Since I don't have Hp system so didn't test it).
awk -v lines=$(wc -l <Input_file) 'FNR==2 || FNR==3 || FNR==(lines-1) || FNR==(lines-2){next} 1' Input_file
EDIT: Adding non-one liner form of solution too now.
awk -v lines=$(wc -l <Input_file) '
FNR==2 || FNR==3 || FNR==(lines-1) || FNR==(lines-2){
next}
1
' Input_file
wc + sed solution:
len=$(wc -l inpfile | cut -d' ' -f1)
sed "$(echo "$((len-2)),$((len-1))")d; 2,3d" inpfile > tmp_f && mv tmp_f inpfile
$ cat inputfile
> This is First Line
> ..
> ..
> ..
> ..
> This is Last Line, should not be deleted It could be come at any line
Perl suggestion... read whole file into array #L, get index of last line. Delete 2nd last, 3rd last, 3rd and 2nd line. Print what's left.
perl -e '#L=<>; $p=$#L; delete $L[$p-1]; delete $L[$p-2]; delete $L[2]; delete $L[1]; print #L' file.txt
Or, maybe a little more succinctly with splice:
perl -e '#L=<>; splice #L,1,2; splice #L,$#L-2,2; print #L' file.txt
If you wish to have some flexibility a ksh script approach may work, though little expensive in terms of resources :
#!/bin/ksh
[ -f "$1" ] || echo "Input is not a file" || exit 1
total=$(wc -l "$1" | cut -d' ' -f1 )
echo "How many lines to delete at the end?"
read no
[ -z "$no" ] && echo "Not sure how many lines to delete, aborting" && exit 1
sed "2,3d;$((total-no)),$((total-1))d" "$1" >tempfile && mv tempfile "$1"
And feed the file as argument to the script.
Notes
This deletes second and third lines.
Plus no number of lines from last excluding last as read from user.
Note: My ksh version is 93u+ 2012-08-01
awk '{printf "%d\t%s\n", NR, $0}' < file | sed '2,3d;N;$!P;D' file
The awk here serves the purpose of providing line numbers and then passing the output to the sed which uses the line numbers to do the required operations.
%d : Used to print the numbers. You can also use '%i'
'\t' : used to place a tab between the number and string
%s : to print the string of charaters
'\n' : To create a new line
NR : to print lines numbers starting from 1
For sed
N: Read/append the next line of input into the pattern space.
$! : is for not deleting the last line
D : This is used when pattern space contains no new lines normal and start a new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the specified lines, and restart cycle with the resultant pattern space, without reading a new line of input.
P : Print up to the first embedded newline of the current pattern space.This
prints the lines after removing the subjected lines.
I enjoyed this task and wrote awk script for more scaleable case (huge files).
Reading/scanning the input file once (no need to know line count), not storing the whole file in memory.
script.awk
BEGIN { range = 3} # define sliding window range
{lines[NR] = $0} # capture each line in array
NR == 1 {print} # print 1st line
NR > range * 2{ # for lines in sliding window range bottom
print lines[NR - range]; # print sliding window top line
delete lines[NR - range]; # delete sliding window top line
}
END {print} # print last line
running:
awk -f script.awk input.txt
input.txt
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
output:
line 1
line 4
line 5
line 6
line 7
line 10

SED: copy lines from a file to specific line in another file

I can do this using the following example. The 1st command will output the lines 16...80 from file1 to patch, while the 2nd will insert the contents of patch after line 18 to file2:
sed -n 16,80p file1>patch
sed -i 18rpatch file2
However, I would like to copy directly from one file to another without using a temporary file in-between, in one command using sed (not awk, etc.). I'm pretty sure this is possible, just don't know how.
Doing this with sed requires some additional shell trickery. Assuming bash, you could use
sed -i 18r<(sed '16,80!d' file1) file2
Where <(sed '16,80!d' file1) is substituted with the name of a pipe from which the output of sed '16,80!d' file1 can be read.
Generally, I feel that it is nicer to do this with awk (if a little longer), because awk is better equipped to handle multiple input files. For example:
awk 'NR == FNR { if(FNR >= 16 && FNR <= 80) { patch = patch $0 ORS }; next } FNR == 18 { $0 = patch $0 } 1' file1 file2
This works as follows:
NR == FNR { # While processing the first file
if(FNR >= 16 && FNR <= 80) { # remember the patch lines
patch = patch $0 ORS
}
next # and do nothing else
}
FNR == 18 { # after that, while processing the first file:
$0 = patch $0 # prepend the patch to line 18
}
1 # and print regardless of whether the current
# line was patched.
However, this approach does not lend itself to in-place editing of files. This is not usually a problem; I'd simply use
cp file2 file2~
awk ... file1 file2~ > file2
with the added advantage of having a backup in case things go pear-shaped, but in the end it's up to you.
I have done something similar using:
head -80 file | tail -16 > patch
Check the documentation for your local versions of head and tail, and change the two integers to suit your requirements.
sed -i '1,15 d
34 r patch
81,$ d' YourFile
# oneliner version
sed -i -e '1,15 d' -e '34 r patch' -e '81,$ d' YourFile
order of line is not important.
You can adapt a bit or batch it with variable like this
sed -i "1,16 d
$(( 16 + 18 )) r patch
81,$ d" YourFile
but add some security about line count in this case.
if the r line is more than 1 line, following line are still counted from original place and final file is bigger than 80 - 16 lines
i dont exactly test for line taken,excluded or modified (like 34 is the 18th line of cropped file),but principe is the same
Explaination for Lines index references used in this sample:
1,15 are the heading line to remove, so file take care lines from 16 in this case
34 is the line to change the content and is the result of 18th line AFTER the first new content (line 16 in our case) so 16 + 18 = 34
81,$ are trailing lines to remove, $ mean last line and 81 is the first line (after 80 that is taken) of the unwanted trailing lines.
i had this problem, i did it in 2 steps(1-tail 2-head), for example in a text file with 20 lines(test.txt), we want to copy lines from 13 to 17 to another file(final.txt),
tail -8 test.txt > temp.txt
head -5 temp.txt > final.txt

Extract specified lines from a file

I have a file and I want to extract specific lines from that file like lines 2, 10, 15,21, .... and so on. There are around 200 thousand lines to be extracted from the file. How can I do it efficiently in bash
Maybe looking for:
sed -n -e 1p -e 4p afile
Put the linenumbers of the lines you want in a file called "wanted", like this:
2
10
15
21
Then run this script:
#!/bin/bash
while read w
do
sed -n ${w}p yourfile
done < wanted
TOTALLY ALTERNATIVE METHOD
Or you could let "awk" do it all for you, like this which is probably miles faster since you won't have to create 200,000 sed processes:
awk 'FNR==NR{a[$1]=1;next}{if(FNR in a){print;}}' wanted yourfile
The FNR==NR portion detects when awk is reading the file called "wanted" and if so, it sets element "$1" of array "a" to "1" so we know that this line number is wanted. The stuff in the second set of curly braces is active when processing your bigger file only and it prints the current line if its linenumber is in the array "a" we created when reading the "wanted" file.
$ gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines
Wanted line numbers have to be stored in lines delimited by newline and they may safely be in random order. It almost exactly the same as #Mark Setchell’s second method, but uses a little more clear way to determine which file is current. Although this ARGIND is GNU extension, so gawk. If you are limited to original AWK or mawk, you can write it as:
$ awk 'FILENAME==ARGV[1] { L[$0]++ }; FILENAME==ARGV[2] && FNR in L' lines file > file.lines
Efficiency test:
$ awk 'BEGIN { for (i=1; i<=1000000; i++) print i }' > file
$ shuf -i 1-1000000 -n 200000 > lines
$ time gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines
real 0m1.734s
user 0m1.460s
sys 0m0.052s
UPD:
As #Costi Ciudatu pointed out, there is room for impovement for the case when all wanted lines are in the head of a file.
#!/usr/bin/gawk -f
ARGIND==1 { L[$0]++ }
ENDFILE { L_COUNT = FNR }
ARGIND==2 && FNR in L { L_PRINTED++; print }
ARGIND==2 && L_PRINTED == L_COUNT { exit 0 }
Sript interrupts when last line is printed, so now it take few milliseconds to filter out 2000 random lines from first 1 % of a one million lines file.
$ time ./getlines.awk lines file > file.lines
real 0m0.016s
user 0m0.012s
sys 0m0.000s
While reading a whole file still takes about a second.
$ time gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines
real 0m0.780s
user 0m0.756s
sys 0m0.016s
Provided your system supports sed -f - (i.e. for sed to read its script on standard input; it works on Linux, but not on some other platforms) you can turn the file of line numbers into a sed script, naturally using sed:
sed 's/$/p/' lines | sed -n -f - inputfile >output
If the lines you're interested in are close to the beginning of the file, you can make use of head and tail to efficiently extract specific lines.
For your example line numbers (assuming that list doesn't go on until close to 200,000), a dummy but still efficient approach to read those lines would be the following:
for n in 2 10 15 21; do
head -n $n /your/large/file | tail -1
done
sed Example
sed -n '2p' file
awk Example
awk 'NR==2' file
this will print 2nd line of file
use same logic in loop & try.
say a for loop
for VARIABLE in 2 10 15 21
do
awk "NR==$VARIABLE" file
done
Give your line numbers this way..

Resources