Bash sed deleting lines with words existing in another pattern - bash

I've got console output, sth like:
SECTION/foo
SECTION/fo1
SECTION/fo3
Foo = N
Fo1 = N
Fo2 = N
Fo3 = N
Bar = Y
as an output, I want to have:
Foo = N
Fo1 = N
Fo3 = N
Any (simple) solution?
Thanks in advance!

Using awk you can do:
awk -F' *[/=] *' '$1 == "SECTION" {a[tolower($2)]} tolower($1) in a' file
Foo = N
Fo1 = N
Fo3 = N
Description:
We split each line using custom field separator as ' *[/=] *' which means / or = surrounded with 0 or more spaces on each side.
When first field is SECTION then we store each lowercase column 2 into an array a
Later when lowercase first column is found in array a then we print each line (default action).

Perl to the rescue!
perl -ne ' $h{ ucfirst $1 } = 1 if m(SECTION/(.*));
print if /(.*) = / && $h{$1};
' < input
A hash table is created from lines containing SECTION/. If the line contains = and its left hand side is stored in the hash, it gets printed.

This might work for you (GNU sed):
sed -nr '/SECTION/H;s/.*/&\n&/;G;s/\n.*/\L&/;/\n(.*) .*\n.*\/\1/P' file
Collect all SECTION lines in the hold space (HS). Double the line and delimit by a newline. Append the collected lines from the HS and convert everything from the first newline to the end to lowercase. Using a backreference match the variable to the section suffix and if so print only the first line i.e. the original line unadulterated.
N.B. the -n invokes the grep-like nature of sed and the -r reduces the number of backslashes needed to write a regexp.

awk '$1 ~ /Foo|Fo1|Fo3/' file
Foo = N
Fo1 = N
Fo3 = N

Related

How to replace text in file between known start and stop positions with a command line utility like sed or awk?

I have been tinkering with this for a while but can't quite figure it out. A sample line within the file looks like this:
"...~236 characters of data...Y YYY. Y...many more characters of data"
How would I use sed or awk to replace spaces with a B character only between positions 236 and 246? In that example string it starts at character 29 and ends at character 39 within the string. I would want to preserve all the text preceding and following the target chunk of data within the line.
For clarification based on the comments, it should be applied to all lines in the file and expected output would be:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
With GNU awk:
$ awk -v FIELDWIDTHS='29 10 *' -v OFS= '{gsub(/ /, "B", $2)} 1' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
FIELDWIDTHS='29 10 *' means 29 characters for first field, next 10 characters for second field and the rest for third field. OFS is set to empty, otherwise you'll get space added between the fields.
With perl:
$ perl -pe 's/^.{29}\K.{10}/$&=~tr| |B|r/e' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
^.{29}\K match and ignore first 29 characters
.{10} match 10 characters
e flag to allow Perl code instead of string in replacement section
$&=~tr| |B|r convert space to B for the matched portion
Use this Perl one-liner with substr and tr. Note that this uses the fact that you can assign to substr, which changes the original string:
perl -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file > out_file
To change the file in-place, use:
perl -i.bak -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
I would use GNU AWK following way, for simplicity sake say we have file.txt content
S o m e s t r i n g
and want to change spaces from 5 (inclusive) to 10 (inclusive) position then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=5;i<=10;i+=1)$i=($i==" "?"B":$i);print}' file.txt
output is
S o mBeBsBt r i n g
Explanation: I set field pattern (FPAT) to any single character and output field seperator (OFS) to empty string, thus every field is populated by single characters and I do not get superfluous space when print-ing. I use for loop to access desired fields and for every one I check if it is space, if it is I assign B here otherwise I assign original value, finally I print whole changed line.
Using GNU awk:
awk -v strt=29 -v end=39 '{ ram=substr($0,strt,(end-strt));gsub(" ","B",ram);print substr($0,1,(strt-1)) ram substr($0,(end)) }' file
Explanation:
awk -v strt=29 -v end=39 '{ # Pass the start and end character positions as strt and end respectively
ram=substr($0,strt,(end-strt)); # Extract the 29th to the 39th characters of the line and read into variable ram
gsub(" ","B",ram); # Replace spaces with B in ram
print substr($0,1,(strt-1)) ram substr($0,(end)) # Rebuild the line incorporating raw and printing the result
}'file
This is certainly a suitable task for perl, and saddens me that my perl has become so rusty that this is the best I can come up with at the moment:
perl -e 'local $/=\1;while(<>) { s/ /B/ if $. >= 236 && $. <= 246; print }' input;
Another awk but using FS="":
$ awk 'BEGIN{FS=OFS=""}{for(i=29;i<=39;i++)sub(/ /,"B",$i)}1' file
Output:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
Explained:
$ awk ' # yes awk yes
BEGIN {
FS=OFS="" # set empty field delimiters
}
{
for(i=29;i<=39;i++) # between desired indexes
sub(/ /,"B",$i) # replace space with B
# if($i==" ") # couldve taken this route, too
# $i="B"
}1' file # implicit output
With sed :
sed '
H
s/\(.\{236\}\)\(.\{11\}\).*/\2/
s/ /B/g
H
g
s/\n//g
s/\(.\{236\}\)\(.\{11\}\)\(.*\)\(.\{11\}\)/\1\4\3/
x
s/.*//
x' infile
When you have an input string without \r, you can use:
sed -r 's/(.{236})(.{10})(.*)/\1\r\2\r\3/;:a;s/(\r.*) (.*\r)/\1B\2/;ta;s/\r//g' input
Explanation:
First put \r around the area that you want to change.
Next introduce a label to jump back to.
Next replace a space between 2 markers.
Repeat until all spaces are replaced.
Remove the markers.
In your case, where the length doesn't change, you can do without the markers.
Replace a space after 236..245 characters and try again when it succeeds.
sed -r ':a; s/^(.{236})([^ ]{0,9}) /\1\2B/;ta' input
This might work for you (GNU sed):
sed -E 's/./&\n/245;s//\n&/236/;h;y/ /B/;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Divide the problem into 2 lines, one with spaces and one with B's where there were spaces.
Then using pattern matching make a composite line from the two lines.
N.B. The newline can be used as a delimiter as it is guaranteed not to be in seds pattern space.

Display column from empty column (fixed width and space delimited) in bash

I have log file (in txt) with the following text
UNIT PHYS STATE LOCATION INFO
TCSM-1098 SE-NH -
ETPE-5-0 1403 SE-OU BCSU-1 ACTV FLTY
ETIP-6 1402 SE-NH -
They r delimited by space...
How am I acquired the output like below?
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
Thank in advance
This is what I've tried so far
cat file.txt | awk 'BEGIN { FS = "[[:space:]][[:space:]]+" } {print $1,$2,$3,$4}' | sed 's/ /|/g'
It produces output like this
|UNIT|PHYS|STATE|LOCATION|INFO|
|TCSM-1098|SE-NH|-|
|ETPE-5-0|1403|SE-OU|BCSU-1|ACTV|FLTY
|ETIP-6|1402|SE-NH|-|
The column isn't excatly like what I hope for
It seems it's not delimited but fixed-width format.
$ perl -ple '
$_ = join "|",
map {s/^\s+|\s+$//g;$_}
unpack ("a11 a5 a6 a22 a30",$_);
' <file.txt
how it works
-p switch : loop over input lines (default var: $_) and print it
-l switch : chomp line ending (\n) and add it to output
-e : inline command
unpack function : takes defined format and input line and returns an array
map function : apply block to each element of array: regex to remove heading trailing spaces
join function : takes delimiter and array and gives string
$_ = : affects the string to default var for output
Perl to the rescue!
perl -wE 'my #lengths;
$_ = <>;
push #lengths, length $1 while /(\S+\s*)/g;
$lengths[-1] = "*";
my $f;
say join "|",
map s/^\s+|\s+$//gr,
unpack "A" . join("A", #lengths), $_
while (!$f++ or $_ = <>);' -- infile
The format is not whitespace separated, it's a fixed-width.
The #lengths array will be populated by the widths of the columns taken from the first line of the input. The last column width is replaced with *, as its width can't be deduced from the header.
Then, an unpack template is created from the lengths that's used to parse the file.
$f is just a flag that makes it possible to apply the template to the header line itself.
With GNU awk for FIELDWITDHS to handle fixed-width fields:
awk -v FIELDWIDTHS='11 5 6 22 99' -v OFS='|' '{$1=$1; gsub(/ *\| */,"|"); sub(/ +$/,"")}1' file
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
I think it's pretty clear and self-explanatory but let me know if you have any questions.
Manually, in awk:
$ awk 'BEGIN{split("11 5 6 23 99", cols); }
{s=0;
for (i in cols) {
field = substr($0, s, cols[i]);
s += cols[i];
sub(/^ */, "", field);
sub(/ *$/, "", field);
printf "%s|", field;
};
printf "\n" } ' file
UNIT|PHYS|STATE|LOCATION|INFO|
TCSM-1098||SE-NH||-|
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY|
ETIP-6|1402|SE-NH||-|
The widths of the columns are set in the BEGIN block, then for each line we take substrings of the line of the required length. s counts the starting position of the current column, the sub() calls remove leading and trailing spaces. The code as such prints a trailing | on each line, but that can be worked around by making the first or last column a special case.
Note that the last field is not like in your output, it's hard to tell where the split between ACTV and FLTY should be. Is that fixed width too, or is the space a separator there?

how to sort the line of a file by to the second word from the end

I want to sort the line accorfing to the last number just before the space. And this is a simplified example:
c3_abl_eerf_14 sasw
a.bla_haha_2 dnkww
s.hey_3 ddd
And this is the results I want:
a.bla_haha_2 dnkww
s.hey_3 ddd
c3_abl_eerf_14 sasw
I don't know how to do this, maybe by the command sort? And, sometimes I used the sort command, it may wrongly treat the 14 less than 2, I don't want this to happen.
this command chain works for your example:
sed -r 's/.*_([0-9]+) .*/\1 &/' file|sort -n|sed 's/[^ ]* //'
The idea is
extract the number first, add to the beginning of the line
sort all lines by this number
remove the number
update
sort by last number in the line, no matter where the number is:
awk -F'[^0-9]+' '{$0=(length($NF)?$NF:$(NF-1)) OFS $0}7' file|sort -n|sed 's/[^ ]* //'
If you want to do it with GNU awk, try this:
BEGIN { FS = "[ _]+" }
{ data[$(NF-1)] = data[$(NF-1)] "\n" $0}
END {
n = asorti(data, sorted, "#val_num_asc");
for (i = 1; i <= n; i++) {
print substr(data[sorted[i]], 2);
}
}
This works as follows: The BEGIN rule sets the field separator (you could also do this on the command line). The second rule applies to all lines of the input and puts them into an associative array indexed by the number in the second but last field. The END rule sorts the indices of this array into a second array, and the following loops prints the values, now sorted.

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Deleting characters from a column if they appear fewer than 20 times

I have a CSV file with two columns:
cat # c a t
dog # d o g
bat # b a t
To simplify communication, I've used English letters for this example, but I'm dealing with CJK in UTF-8.
I would like to delete any character appearing in the second column, which appears on fewer than 20 lines within the first column (characters could be anything from numbers, letters, to Chinese characters, and punctuation, but not spaces).
For e.g., if "o" appears on 15 lines in the first column, all appearances of "o" are deleted from the second column. If "a" appears on 35 lines in the first column, no change is made.
The first column must not be changed.
I don't need to count multiple appearances of a letter on a single line. For e.g. "robot" has 2 o's, but this detail is not important, only that "robot" has an "o", so that is counted as one line.
How can I delete the characters that appear less than 20 times?
Here is a script using awk. Change the var num to be your frequency cutoff point. I've set it to 1 to show how it works against a small sample file. Note how f is still deleted even though it shows up three times on a single line. Also, passing the same input file twice is not a typo.
awk -v num=1 '
BEGIN { OFS=FS="#" }
FNR==NR{
split($1,a,"")
for (x in a)
if(a[x] != " " && !c[a[x]]++)
l[a[x]]++
delete c
next
}
!flag++{
for (x in l)
if (l[x] <= num)
cclass = cclass x
}
{
gsub("["cclass"]", " " , $2)
}1' ./infile.csv ./infile.csv
Sample Input
$ cat ./infile
fff # f f f
cat # c a t
dog # d o g
bat # b a t
Output
$ ./delchar.sh
fff #
cat # a t
dog #
bat # a t
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<:utf8', $ARGV[0] or die $!;
my %chars;
while (<$IN>) {
chomp;
my #cols = split /#/;
my %linechars;
undef #linechars{ split //, $cols[0] };
$chars{$_}++ for keys %linechars;
}
seek $IN, 0, 0;
my #remove = grep $chars{$_} < 20, keys %chars;
my $remove_reg = '[' . join(q{}, #remove) . ']';
warn $remove_reg;
while (<$IN>) {
my #cols = split /#/;
$cols[1] =~ s/$remove_reg//g;
print join '#', #cols;
}
I am not sure how whitespace should be handled, so you might need to adjust the script.
the answer is:
cut -d " " -f #column $file | sed -e 's/\.//g' -e 's/\,//g' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
where $file is your text file and $column is the column you need to look for its frequency. It gives you out the list of their frequency
then you can go on looping on those results which have the first digit greater than your treshold and grepping on the whole lines.

Resources