grep: remove lines with the same number twice - bash

I have a .txt file and on each line is some amount of numbers. What I need is to filtrate these which does not contain the same number. So I want the output to be only the lines which have all the numbers different. I have to use command grep!
Example:
File_input:
1 1 2 3 4 5
1 2 3 4 5 6
6 6 6 6 6 6
What I want
File_output:
1 2 3 4 5 6
First and third lines contains same numbers so these has to be filtrated out.

This should work for your example:
grep -v "\([0-9]\).*\1" myfile
Idea is to catch any single digit [0-9] and store it \(\) and search for the existing same pattern \1 on the same line. You can easily extend to any word made of digits.

With the given input you can use
sed -r '/([0-9]+).+\1/d' File_input
You will have problems with suubstrings: 1 matches 12 and 12 matches 1.
ou can add word boundaries \b with
sed -r '/\b([0-9]+)\b.*\b\1\b/d' File_input

Related

Replace the nth field of every mth line using awk or bash

For a file that contains entries similar to as follows:
foo 1 6 0
fam 5 11 3
wam 7 23 8
woo 2 8 4
kaz 6 4 9
faz 5 8 8
How would you replace the nth field of every mth line with the same element using bash or awk?
For example, if n = 1 and m = 3 and the element = wot, the output would be:
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
I understand you can call / print every mth line using e.g.
awk 'NR%7==0' file
So far I have tried to keep this in memory but to no avail... I need to keep the rest of the file as well.
I would prefer answers using bash or awk, but sed solutions would also be helpful. I'm a beginner in all three. Please explain your solution.
awk -v m=3 -v n=1 -v el='wot' 'NR % m == 0 { $n = el } 1' file
Note, however, that the inter-field whitespace is not guaranteed to be preserved as-is, because awk splits a line into fields by any run of whitespace; as written, the output fields of modified lines will be separated by a single space.
If your input fields are consistently separated by 2 spaces, however, you can effectively preserve the input whitespace by adding -F' ' -v OFS=' ' to the awk invocation.
-v m=3 -v n=1 -v el='wot' defines Awk variables m, n, and el
NR % m == 0 is a pattern (condition) that evaluates to true for every m-th line.
{ $n = el } is the associated action that replaces the nth field of the input line with variable el, causing the line to be rebuilt, implicitly using OFS, the output-field separator, which defaults to a space.
1 is a common Awk shorthand for printing the (possibly modified) input line at hand.
Great little exercise. While I would probably lean toward an awk solution, in bash you can also rely on parameter expansion with substring replacement to replace the nth field of every mth line. Essentially, you can read every line, preserving whitespace, then check your line count, e.g. if c is your line counter and m your variable for mth line, you could use:
if (( $((c % m )) == 0)) ## test for mth line
If the line is a replacement line, you can read each word into an array after restoring default word-splitting and then use your array element index n-1 to provide the replacement (e.g. ${line/find/replace} with ${line/"${array[$((n-1))]}"/replace}).
If it isn't a replacement line, simply output the line unchanged. A short example could be similar to the following (to which you can add additional validations as required)
#!/bin/bash
[ -n "$1" -a -r "$1" ] || { ## filename given an readable
printf "error: insufficient or unreadable input.\n"
exit 1
}
n=${2:-1} ## variables with default n=1, m=3, e=wot
m=${3:-3}
e=${4:-wot}
c=1 ## line count
while IFS= read -r line; do
if (( $((c % m )) == 0)) ## test for mth line
then
IFS=$' \t\n'
a=( $line ) ## split into array
IFS=
echo "${line/"${a[$((n-1))]}"/$e}" ## nth replaced with e
else
echo "$line" ## otherwise just output line
fi
((c++)) ## advance counter
done <"$1"
Example Use/Output
n=1, m=3, e=wot
$ bash replmn.sh dat/repl.txt
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
n=1, m=2, e=baz
$ bash replmn.sh dat/repl.txt 1 2 baz
foo 1 6 0
baz 5 11 3
wam 7 23 8
baz 2 8 4
kaz 6 4 9
baz 5 8 8
n=3, m=2, e=99
$ bash replmn.sh dat/repl.txt 3 2 99
foo 1 6 0
fam 5 99 3
wam 7 23 8
woo 2 99 4
kaz 6 4 9
faz 5 99 8
An awk solution is shorter (and avoids problems with duplicate occurrences of the replacement string in $line), but both would need similar validation of field existence, etc.. Learn from both and let me know if you have any questions.

How does negative matching work in extglob in parameter expansion

Problem
The behaviour of
!(pattern-list)
does not work the way I would expect when used in parameter expansion, specifically
${parameter/pattern/string}
Input
a="1 2 3 4 5 6 7 8 9 10"
Test cases
$ printf "%s\n" "${a/!([0-9])/}"
[blank]
#expected 12 3 4 5 6 7 8 9 10
$ printf "%s\n" "${a/!(2)/}"
[blank]
#expected 2 3 4 5 6 7 8 9 10
$ printf "%s\n" "${a/!(*2*)/}"
2 3 4 5 6 7 8 9 10
#Produces the behaviour expected in previous one, not sure why though
$ printf "%s\n" "${a/!(*2*)/,}"
,2 3 4 5 6 7 8 9 10
#Expected after previous worked
$ printf "%s\n" "${a//!(*2*)/}"
2
#Expected again previous worked
$ printf "%s\n" "${a//!(*2*)/,}"
,,2,
#Why are there 3 commas???
Specs
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)
Notes
These are very basic examples, so if it is possible to include more complex examples with explanations in the answer then please do.
Any more info or examples needed let me know in the comments.
Have already looked at How does extglob work with shell parameter expansion?, and have even commented on what the problem is with that particular problem, so please don't mark as a dupe.
Parameter expansion of the form ${parameter/pattern/string} (where pattern doesn't start with a /) works by finding the leftmost longest substring in the value of the variable parameter that matches the pattern pattern and replacing it with string. In other words, $parameter is decomposed into three parts prefix,match, and suffix such that
$parameter == "${prefix}${match}${suffix}"
$prefix is the shortest possible string enabling the other requirements to be fulfilled (i.e. the match, if at all possible, occurs in the leftmost position)
$match matches pattern and is as long as possible
any of $prefix, $match and/or $suffix can be empty
and the result of ${parameter/pattern/string} is "${prefix}string${suffix}".
For the global replacement form (${parameter//pattern/string}) of this type of parameter expansion, the same process is recursively performed for the suffix part, however a zero-length match is handled as a special case (in order to prevent infinite recursion):
if "${prefix}${match}" != ""
"${parameter//pattern/string}" = "${prefix}string${suffix//pattern/string}"
else suffix=${parameter:1} and
"${parameter//pattern/string}" = "string${parameter:0:1}${suffix}//pattern/string}"
Now let's analyze the cases individually:
"${a/!([0-9])/}" --> prefix='' match='1 2 3 4 5 6 7 8 9 10' suffix=''. Indeed, '1 2 3 4 5 6 7 8 9 10' is not a string consisting of a single digit, and therefore it matches the pattern !([0-9]). Hence the empty result of expansion.
"${a/!(2)/}" --> prefix='' match='1 2 3 4 5 6 7 8 9 10' suffix=''. Similar to the above, '1 2 3 4 5 6 7 8 9 10' is not a string consisting of the single character '2', and therefore it matches the pattern !(2). Hence the empty result of expansion.
"${a/!(*2*)/}" --> prefix='' match='1 ' suffix='2 3 4 5 6 7 8 9 10'. The substring '1 ' doesn't match the pattern *2*, and therefore it matches the pattern !(*2*).
"${a/!(*2*)/,}". There were no surprises here, so no need to elaborate.
"${a//!(*2*)/}". There were no surprises here, so no need to elaborate.
"${a//!(*2*)/,}" --> prefix='' match='1 ' suffix='2 3 4 5 6 7 8 9 10'. Then ${suffix//!(*2*)/,} expands to ",2," as follows. The empty string in the beginning of suffix matches the pattern !(*2*), producing an extra comma in the result. Since the zero-length match special case (described above) was triggered, the first character of suffix is forcibly consumed, leaving us with ' 3 4 5 6 7 8 9 10', which matches the !(*2*) pattern in its entirety and is replaced with the last comma that we see in the final result of the expansion.

Process a line based on lines before and after in bash

I am trying to figure out how to write a bash script which uses the lines immediately before and after a line as a condition. I will give an example in a python-like pseudocode which makes sense to me.
Basically:
for line in FILE:
if line_minus_1 == line_plus_one:
line = line_minus_1
What would be the best way to do this?
So if I have an input file that reads:
3
1
1
1
2
2
1
2
1
1
1
2
2
1
2
my output would be:
3
1
1
1
2
2
2
2
1
1
1
2
2
2
2
Notice that it starts from the first line until the last line and respects changes made in earlier lines so if I have:
2
1
2
1
2
2
I would get:
2
2
2
2
2
2
and not:
2
1
1
1
2
2
$ awk 'minus2==$0{minus1=$0} NR>1{print minus1} {minus2=minus1; minus1=$0} END{print minus1}' file
3
1
1
1
2
2
2
2
1
1
1
2
2
2
2
How it works
minus2==$0{minus1=$0}
If the line from 2 lines ago is the same as the current line, then set the line from 1 line ago equal to the current line.
NR>1{print minus1}
If we are past the first line, then print the line from 1 line ago.
minus2=minus1; minus1=$0
Update the variables.
END{print minus1}
After we have finished reading the file, print the last line.
Multiple line version
For those who like their code spread over multiple lines:
awk '
minus2==$0{
minus1=$0
}
NR>1{
print minus1
}
{
minus2=minus1
minus1=$0
}
END{
print minus1
}
' file
Here is a (GNU) sed solution:
$ sed -r '1N;N;/^(.*)\n.*\n\1$/s/^(.*\n).*\n/\1\1/;P;D' infile
3
1
1
1
2
2
2
2
1
1
1
2
2
2
2
This works with a moving three line window. A bit more readable:
sed -r ' # -r for extended regular expressions: () instead of \(\)
1N # On first line, append second line to pattern space
N # On all lines, append third line to pattern space
/^(.*)\n.*\n\1$/s/^(.*\n).*\n/\1\1/ # See below
P # Print first line of pattern space
D # Delete first line of pattern space
' infile
N;P;D is the idiomatic way to get a moving two line window: append a line, print first line, delete first line of pattern space. To get a moving three line window, we read an additional line, but only once, namely when processing the first line (1N).
The complicated bit is checking if the first and third line of the pattern space are identical, and if they are, replacing the second line with the first line. To check if we have to make the substitution, we use the address
/^(.*)\n.*\n\1$/
The anchors ^ and $ are not really required as we'll always have exactly to newlines in the pattern space, but it makes it more clear that we want to match the complete pattern space. We put the first line into a capture group and see if it is repeated on the third line by using a backreference.
Then, if this is the case, we perform the substitution
s/^(.*\n).*\n/\1\1/
This captures the first line including the newline, matches the second line including the newline, and substitutes with twice the first line. P and D then print and remove the first line.
When reaching the end, the whole pattern space is printed so we're not swallowing any lines.
This also works with the second input example:
$ sed -r '1N;N;/^(.*)\n.*\n\1$/s/^(.*\n).*\n/\1\1/;P;D' infile2
2
2
2
2
2
2
To use with BSD sed (as found in OS X), you'd either have to use the -E instead of the -r option, or use no option, i.e., basic regular expressions and escape all parentheses (\(\)) in the capture groups. The newline matching should work, but I didn't test it. If in doubt, check this great answer lining out all the differences.

How to sequence lines in files if some lines are strings

I encountered a problem with bash, I started using it recently.
I realize that lot of magic stuff can be done with just one line, as my previous question was solved by it.
This time question is simple:
I have a file which has this format
2 2 10
custom
8 10
3 5 18
custom
1 5
some of the lines equal to string custom (it can be any line!) and other lines have 2 or 3 numbers in it.
I want a file which will sequence the line with numbers but keep the lines with custom (order also must be the same), so desired output is
2 4 6 8 10
custom
8 9 10
3 8 13 18
custom
1 2 3 4 5
I also wish to overwrite input file with this one.
I know that with seq I can do the sequencing, but I wish elegant way to do it on file.
You can use awk like this:
awk '/^([[:blank:]]*[[:digit:]]+){2,3}[[:blank:]]*$/ {
j = (NF==3) ? $2 : 1
s=""
for(i=$1; i<=$NF; i+=j)
s = sprintf("%s%s%s", s, (i==$1)?"":OFS, i)
$0=s
} 1' file
2 4 6 8 10
custom
8 9 10
3 8 13 18
custom
1 2 3 4 5
Explanation:
/^([[:blank:]]*[[:digit:]]+){2,3}[[:blank:]]*$/ - match only lines with 2 or 3 numbers.
j = (NF==3) ? $2 : 1 - set variable j to $2 if there are 3 columns otherwise set j to 1
for(i=$1; i<=$NF; i+=j) run a loop from 1st col to last col, increment by j
sprintf is used for formatting the generated sequence
1 is default awk action to print each line
This might work for you (GNU sed, seq and paste):
sed '/^[0-9]/s/.*/seq & | paste -sd\\ /e' file
If a line begins with a digit use the lines values as parameters for the seq command which is then piped to paste command. The RHS of the substitute command is evaluated using the e flag (GNU sed specific).

How to delete leading newline in a string in bash?

I'm having the following issue. I have an array of numbers:
text="\n1\t2\t3\t4\t5\n6\t7\t8\t9\t0"
And I'd like to delete the leading newline.
I've tried
sed 's/.//' <<< "$text"
cut -c 1- <<< "$text"
and some iterations. But the issue is that both of those delete the first character AFTER EVERY newline. Resulting in this:
text="\n\t2\t3\t4\t5\n\t7\t8\t9\t0"
This is not what I want and there doesn't seem to be an answer to this case.
Is there a way to tell either of those commands to treat newlines like characters and the entire string as one entity?
awk to the rescue!
awk 'NR>1'
of course you can do the same with tail -n +2 or sed 1d as well.
You can probably use the substitution modifier (see parameter expansion and ANSI C quoting in the Bash manual):
$ text=$'\n1\t2\t3\t4\t5\n6\t7\t8\t9\t0'
$ echo "$text"
1 2 3 4 5
6 7 8 9 0
$ echo "${text/$'\n'/}"
1 2 3 4 5
6 7 8 9 0
$
It replaces the first newline with nothing, as requested. However, note that it is not anchored to the first character:
$ alt="${text/$'\n'/}"
$ echo "${alt/$'\n'/}"
1 2 3 4 56 7 8 9 0
$
Using a caret ^ before the newline doesn't help — it just means there's no match.
As pointed out by rici in the comments, if you read the manual page I referenced, you can find how to anchor the pattern at the start with a # prefix:
$ echo "${text/#$'\n'/}"
1 2 3 4 5
6 7 8 9 0
$ echo "${alt/#$'\n'/}"
1 2 3 4 5
6 7 8 9 0
$
The notation bears no obvious resemblance to other regex systems; you just have to know it.

Resources