Delete lines from a text file except the first and every nth - bash

I have a long text file comprised of numbers, such as:
1
2
9.252
9.252
9.272
1
1
6.11
6.11
6.129
I would like to keep the first line, delete the subsequent three and then keep the next one. I would like to do this process for the whole file. Following that logic, considered the input above, I would like to have the following output:
1
9.272
1
6.129

Using GNU sed (needed for the ~ extension):
sed -n '1~5p;5~5p' file

Saving your numbers in a "textfile.txt" I can use the following with sed:
sed -n 'p;n;n;n;n;p;' textfile.txt
Sed prints the first line, reads the next 4 and prints the last line.
Or the following using while read in bash:
while read -r firstline && read -r nextone1 && read -r nextone2 && read -r nextone3 && read -r lastone; do
printf "%s\n" "$firstline" "$lastone";
done < textfile.txt
This just reads 5 lines at a time and prints only the first and 5th lines.

You can simply say:
awk 'NR%5<2' input.txt
Explanation: Considering the entire pattern repeats every five lines, let's start with applying modulo operation to the line number NR by five. Then we'll see the 1st line of the five-line block yields "1" and the 5th line of the block yields "0". Now they can be separated from other lines by comparing it to two.

To print the 1st and 5th line of every block of 5 lines (remember that 5%5 = 0):
$ awk '(NR%5) ~ /[10]/' file
1
9.272
1
6.129
If you want to print the 2nd, 3rd, and 4th line of every block of 5 lines instead of the 1st and 5th:
$ awk '(NR%5) ~ /[234]/' file
2
9.252
9.252
1
6.11
6.11
If you wanted to print the 27th and 53rd line of every block of 100:
awk '(NR%100) ~ /^(27|53)$/' file
We couldn't use a bracket expression there as we're now beyond single char numbers.

This might work for you (GNU sed):
sed '2~5,+2d' file
Starting from line 2, delete the next three lines using modulo 5.
An alternative:
sed -n '1p;5~5,+1p' file

Considering your groups are packed as 5 lines, you could use awk with a mod 5 operation.
awk '{i=(NR-1)%5;if(i==0||i==4)print $0}' input.txt
With indentation it looks like this:
{
i=(NR-1)%5;
if (i==0||i==4)
print $0;
}
i=(NR-1)%5 gets the line number and computes the modulo with 5, but since the line numbers start at 1 (instead of 0), you need to subtract 1 to it before computing the modulo.
This leaves you with an integer i that ranges from 0 to 4. You want to print the first line (index 0), skip the next three lines (indexes 1-3) and print the last line (index 4), which is exactly what does if (i==0||i==4) print $0
Alternately you can do the same thing with a shorter (and probably slightly more optimized version):
awk '((NR-1)%5==0||(NR-1)%5==4)' input.txt
This tells awk to do something for every 1st out of 5 lines and every 5th out of 5 lines. Since the "something" is not defined, by default it outputs the current line. If it helps, this is strictly equivalent to:
awk '((NR-1)%5==0||(NR-1)%5==4){print $0}' input.txt

Related

to check if a line before and after a string empty

I need to delete certain numbers of line before a desired text but only if a line before and after searched string is empty.
E.g (line number, content)
1
2
3 Hello
4
5 yellow
in this case, if lines before and after line containing Hello are empty (line 2 and 4), i have to delete lines from 3 to 1.
I can delete lines from 3 to 1 using tac and sed command but m having difficulty in putting tht if condition.
tac file1|sed -e '/Hello/,+3d'|tac
This might work for you (GNU sed):
sed ':a;N;s/\n/&/3;Ta;/\n\n.*Hello.*\n$/s/.*\n//;ta;P;D' file
Gather up 4 lines in the pattern space and if the 2nd and the 4th are empty and the 3rd contains Hello, delete the first three lines and repeat. Otherwise print the first line and repeat.
Could you please try following if you are ok with awk.
awk -v string="Hello" '
FNR==NR{
a[FNR]=$0
next
}
($0==string) && a[FNR-1]=="" && a[FNR+1]==""{
a[FNR-1]=a[FNR]=a[FNR-2]="del_flag"
}
END{
for(i=1;i<=length(a);i++){
if(a[i]!="del_flag"){
print a[i]
}
}
}
' Input_file Input_file
With GNU sed option -z you can match
some_line
empty line
line With Hello
empty line
and replace this with an empty line.
sed -rz 's/(^|\n)[^\n]*\n\nHello\n\n/\1\n/g' file1
EDIT: added g for multiple segments.

Converting file with single field into multiple comma separated fields

I have a .dat file in which there is no delimiter between fields.
Eg: 2014HELLO2500
I have to convert the file into a comma separated file with commas at specific positions
i.e 2014,HELLO,2500
I could convert the file using for loop. But can it be done using a single command.
I tried using --output-delimiter option of cut command. But it does not work.
I am using AIX OS.
Thanks
Assuming your field widths are known, you can use gawk like this:
awk -v FIELDWIDTHS="4 5 4 ..." -v OFS=, '{print $1,$2,$3,$4,$5...}' file
Using awk
Assuming that you know the lengths of the fields, say, for example, 4 characters for the first field and 5 for the second, then try this:
$ awk -v s='4 5' 'BEGIN{n=split(s,a)} {pos=1; for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}; print substr($0,pos)}' file
2014,HELLO,2500
As an example of the exact same code but applied with many fields, consider this test file:
$ cat alphabet
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Let's divide it up:
$ awk -v s='1 2 3 2 1 2 3 2 1 2 3 2' 'BEGIN{n=split(s,a)} {pos=1; for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}; print substr($0,pos)}' alphabet
A,BC,DEF,GH,I,JK,LMN,OP,Q,RS,TUV,WX,YZ
How it works:
-v s='1 2 3 2 1 2 3 2 1 2 3 2'
This creates a variable s which defines the lengths of all but the last field. (There is no need to specify a length of the last field.)
BEGIN{n=split(s,a)}
This converts the string variable s to an array with each number as an element of the array.
pos=1
At the beginning of each line, we initialize the position variable, pos, to the value 1.
for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}
For each element in array a, we print the required number of characters starting at position pos followed by a comma. After each print, we increment position pos so that the next print will start with the next character.
print substr($0,pos)
We print the last field on the line using however many character are left after position pos.
Using sed
Assuming that you know the lengths of the fields, say, for example, 4 characters for the first field and 5 for the second, then try this:
$ sed -E 's/(.{4})(.{5})/\1,\2,/' file
2014,HELLO,2500
This approach can be used for up to nine fields at a time. To get 15 fields, two passes would be needed.
Assuming you want a delimiter always between characters and number then you can use this:
$ sed -r -e 's/([A-Za-z])([0-9])/\1,\2/g' -e 's/([0-9])([A-Za-z])/\1,\2/g' <<< "2014HELLO2500"
2014,HELLO,2500
$
When numbers and strings alternate, you can use
echo "2014HELLO2500other_string121312Other_word10" |
sed 's/\([A-Za-z]\)\([0-9]\)/\1,\2/g; s/\([0-9]\)\([A-Za-z]\)/\1,\2/g'
echo TEP_CHECK.20180627023645.txt | cut -d'.' -f2 | awk 'BEGIN{OFS="_"} {print substr($1,1,4),substr($1,5,2),substr($1,7,2),substr($1,9,2),substr($1,11,2),substr($1,13,2)}'
Output:
2018_06_27_02_36_45

Getting repeated lines with awk in Bash

I'm trying to know which are the lines that are repeated X times in a text file, and I'm using awk but I see that awk in my command, not work with lines that begin with the same characters or words. That is, does not recognize the full line individually.
Using this command I try to get the lines that are repeated 3 times:
awk '++A[$1]==3' ./textfile > ./log
This is what you need hopefully:
awk '{a[$0]++}END{for(i in a){if(a[i]==3)print i}}' File
Increment array a with the line($0) as index for each line. In the end, for each index ($0), check if the count(a[i] which is the original a[$0]) equals 3. If so, print the line (i which is the original $0 / line). Hope it's clear.
This returns lines repeated 3 times but adds a space at the beginning of each 3x-repeated line:
sort ./textfile | uniq -c | awk '$1 == 3 {$1 = ""; print}' > ./log

Apply an gawk script to multiple files in a folder

I would like to use the following awk line to remove every even line (and keep the odd lines) in a text file.
awk 'NR%2==1' filename.txt > output
The problem is that I struggle to either loop properly in awk or build a shell script to apply this to all *.txt fies in a folder. I tried to use this one-liner
gawk 'FNR==1{if(o)close(o);o=FILENAME;
sub(/\.txt/,"_oddlines.txt",o)}{NR%2==1; print>o}'
but that didn't remove the even lines. And I am even less familiar with shell scripting. I use gawk under win7 or cygwin with bash. Many thanks for any kind of idea.
Your existing gawk one-liner is really close. Here it is formatted as a more readable script:
FNR == 1 {
if (o)
close(o)
o = FILENAME
sub(/\.txt/, "_oddlines.txt", o)
}
{
NR % 2 == 1
print > o
}
This should make the error obvious1. So now we remove that error:
FNR == 1 {
if (o)
close(o)
o = FILENAME
sub(/\.txt/, "_oddlines.txt", o)
}
NR % 2 == 1 {
print > o
}
$ awk -f foo.awk *.txt
and it works (and of course you can re-one-line-ize this).
(Normally I would do this with a for like the other answers, but I wanted to show you how close you were!)
1Per comment, maybe not quite so obvious?
Awk's basic language construct is the "pattern-action" statement. An awk program is just a list of such statements. The "pattern" is so named because originally they were mostly grep-like regular expression patterns:
$ awk '/^be.*st$/' < /usr/share/dict/web2
beanfeast
beast
[snip]
(Except for the slashes, this is basically just running grep, since it uses the default action, print.)
Patterns can actually contain two addresses, but it's more typical to use one, as in these cases. Patterns not enclosed within slashes allow tests like FNR == 1 (File-specific Number of this Record equals 1) or NR % 2 == 1 (Number of this Record—cumulative across all files!—mod 2 equals 1).
Once you hit the open brace, though, you're into the "action" part. Now NR % 2 == 1 simply calculates the result (true or false) and then throws it away. If you leave out the "pattern" part entirely, the "action" part is run on every input line. So this prints every line.
Note that the test NR % 2 == 1 is testing the cumulative record-number. So if some file has an odd number of lines ("records"), the next file will print out every even-numbered line (and this will persist until you hit another file with an odd number of lines).
For instance, suppose the two input files are A.txt and B.txt. Awk starts reading A.txt and has both FNR and NR set to 1 for the first line, which might be, e.g., file A, line 1. Since FNR == 1 the first "action" is done, setting o. Then awk tests the second pattern. NR is 1, so NR % 2 is 1, so the second "action" is done, printing that line to A_oddlines.txt.
Now suppose file A.txt contains only that one line. Awk now goes on to file B.txt, resetting FNR but leaving NR cumulative. The first line of B might be file B, line 1. Awk tries the first "pattern", and indeed, FNR == 1 so this closes the old o and sets up the new one.
But NR is 2, because NR is cumulative across all input files. So the second pattern (NR % 2 == 1) computes 2 % 2 (which is 0) and compares == 1 which is false, and thus awk skips the second "action" for line 1 of file B.txt. Line 2, if it exists, will have FNR == 2 and NR == 3, so that line will be copied out.
(I originally assumed, since your script was close to working, that you intended this and were just stuck a bit on syntax.)
With GNU awk you could just do:
$ awk 'FNR%2{print > (FILENAME".odd")}' *.txt
This will create a .odd file for every .txt file in the current directory containing only the odd lines.
However sed has the upper hand on conciseness here. The following GNU sed command will remove all even lines and store the old file with the extension .bck for all .txt files in the current directory:
$ sed -ni.bck '1~2p' *txt
Demo:
$ ls
f1.txt f2.txt
$ cat f1.txt
1
2
3
4
5
$ cat f2.txt
6
7
8
9
10
$ sed -ni.bck '1~2p' *txt
$ ls
f1.txt f1.txt.bck f2.txt f2.txt.bck
$ cat f1.txt
1
3
5
$ cat f1.txt.bck
1
2
3
4
5
$ cat f2.txt
6
8
10
$ cat f2.txt.bck
6
7
8
9
10
If you don't won't the back up files then simply:
$ sed -ni '1~2p' *txt
Personally, I'd use
for filename in *.txt; do
awk 'NR%2==1' "$filename" > "oddlines-$filename"
done
EDIT: quote filenames
You can try a for loop :
#!/bin/bash
for file in dir/*.txt
do
oddfile=$(echo "$file" | sed -e 's|\.txt|_odd\.txt|g') #This will create file_odd.txt
awk 'NR%2==1' "$file" > "$oddfile" # This will output it in the same dir.
done
Your problem is that NR%2==1 is inside the {NR%2==1; print>o} 'action block' and is not kicking in as a 'condition'. Use this instead:
gawk 'FNR==1{if(o)close(o);o=FILENAME;sub(/\.txt/,"_oddlines.txt",o)};
FNR%2==1{print > o}' *.txt

Delete line and update order

I have a file that contains lines starting with a number, for example
1 This is the first line
2 this is the second line
3 this is the third line
4 This is the fourth line
What I want to do is delete a line for example line 2 and update the numbering so the file would look like the following, I want to do this in a bash script.
1 This is the first line
2 this is the third line
3 This is the fourth line
Thanks
IMO it might be a little easier with awk:
awk '!/regex/ {$1=++x; print}' inputFile
In the /.../ you can put the regex that occurs on the line that needs to be deleted.
Test:
$ cat inputFile
1 This is the first line
2 this is the second line
3 this is the third line
4 This is the fourth line
$ awk '!/second/ {$1=++x; print}' inputFile
1 This is the first line
2 this is the third line
3 This is the fourth line
$ awk '!/third/ {$1=++x; print}' inputFile
1 This is the first line
2 this is the second line
3 This is the fourth line
$ awk '!/first/ {$1=++x; print}' inputFile
1 this is the second line
2 this is the third line
3 This is the fourth line
Note: Since we are re-constructing the $1 field, any white space sequences will get removed.
You can use this set of commands:
grep -v '^2 ' file | cut -d' ' -f2- | nl -w1 -s' '
Using grep with -v option allows to remove line #2.
cut program cuts the first column which is line number.
Finally, we just need to renumber the lines so we use nl.

Resources