How to replace last n characters in the kth occurence of a line containing a certain substring using sed or awk? - bash

Suppose I have a file that resembles the following format:
\\ Random other lines \\
...
27861NA+ NA+89122 13.480 11.554 10.082
27862NA+ NA+89123 2.166 5.896 10.108
27863NA+ NA+89124 8.289 6.843 3.090
27864NA+ NA+89125 12.972 5.936 4.498
27865CL- CL-89126 13.914 2.125 12.915
27866CL- CL-89127 12.050 13.907 3.559
...
\\ Random other lines \\
I am trying to find a way of replacing the last 24 characters of each line with a string that I have prepared, for the first 3 instances of lines in the file that contain the string "NA+".
For example, my output would ideally look like:
\\ Random other lines \\
...
27861NA+ NA+89122 my first string hello
27862NA+ NA+89123 my second string foo
27863NA+ NA+89124 my final string bar $$
27864NA+ NA+89125 12.972 5.936 4.498
27865CL- CL-89126 13.914 2.125 12.915
27866CL- CL-89127 12.050 13.907 3.559
...
\\ Random other lines \\
So far, I have found a sed command that will remove the last 24 characters from every line in the file:
sed 's/.\{24\}$//' myfile.txt
And also an awk command that will return the kth line that contains the desired substring:
awk '/NA+/{i++}i==1' myfile.txt
Does anyone have an idea about how I could replace the last 24 characters in the 1st, 2nd, and 3rd lines of my file that each contain a certain substring?

With single awk:
awk -v str="my string" '!f && /NA\+/{ f=1; n=NR+3 }n && n>NR{ $4=$5=""; $3=str }1' myfile.txt

string="my first string hello"
awk -v string="$string" '{ if ( $0 ~ "NA" ) {cnt++} if (cnt < 4 ) { print substr($0,1,length($0)-23)string } else { print }}' NA
Using awk, set a string and pass it awk with -v. Search for strings containing NA and the increment the variable cnt. When cnt is less that 4, print the everything but the last 23 characters adding the string passed to the end. Otherwise print the line.

This might work for you (GNU sed):
sed '/NA+/{x;s/\n/&/3;x;ta;H;s/.\{24\}$/some string/;b;:a;n;ba}' file
This uses the hold space (HS) to keep a count of the number of lines the script has seen of the required string (NA+). Once it has seen n (in this case n=3) such lines it just prints the remainder of the file.

Related

Prepending letter to field value

I have a file 0.txt containing the following value fields contents in parentheses:
(bread,milk,),
(rice,brand B,),
(pan,eggs,Brandc,),
I'm looking in OS and elsewhere for how to prepend the letter x to the beginning of each value between commas so that my output file becomes (using bash unix):
(xbread,xmilk,),
(xrice,xbrand B,),
(xpan,xeggs,xBrand C,),
the only thing I've really tried but not enough is:
awk '{gsub(/,/,",x");print}' 0.txt
for all purposes the prefix should not be applied to the last commas at the end of each line.
With awk
awk 'BEGIN{FS=OFS=","}{$1="(x"substr($1,2);for(i=2;i<=NF-2;i++){$i="x"$i}}1'
Explanation:
# Before you start, set the input and output delimiter
BEGIN{
FS=OFS=","
}
# The first field is special, the x has to be inserted
# after the opening (
$1="(x"substr($1,2)
# Prepend 'x' from field 2 until the previous to last field
for(i=2;i<=NF-2;i++){
$i="x"$i
}
# 1 is always true. awk will print in that case
1
The trick is to anchor the regexp so that it matches the whole comma-terminated substring you want to work with, not just the comma (and avoids other “special” characters in the syntax).
awk '{ gsub(/[^,()]+,/, "x&") } 1' 0.txt
sed -r 's/([^,()]+,)/x\1/g' 0.txt

How to replace text in file between known start and stop positions with a command line utility like sed or awk?

I have been tinkering with this for a while but can't quite figure it out. A sample line within the file looks like this:
"...~236 characters of data...Y YYY. Y...many more characters of data"
How would I use sed or awk to replace spaces with a B character only between positions 236 and 246? In that example string it starts at character 29 and ends at character 39 within the string. I would want to preserve all the text preceding and following the target chunk of data within the line.
For clarification based on the comments, it should be applied to all lines in the file and expected output would be:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
With GNU awk:
$ awk -v FIELDWIDTHS='29 10 *' -v OFS= '{gsub(/ /, "B", $2)} 1' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
FIELDWIDTHS='29 10 *' means 29 characters for first field, next 10 characters for second field and the rest for third field. OFS is set to empty, otherwise you'll get space added between the fields.
With perl:
$ perl -pe 's/^.{29}\K.{10}/$&=~tr| |B|r/e' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
^.{29}\K match and ignore first 29 characters
.{10} match 10 characters
e flag to allow Perl code instead of string in replacement section
$&=~tr| |B|r convert space to B for the matched portion
Use this Perl one-liner with substr and tr. Note that this uses the fact that you can assign to substr, which changes the original string:
perl -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file > out_file
To change the file in-place, use:
perl -i.bak -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
I would use GNU AWK following way, for simplicity sake say we have file.txt content
S o m e s t r i n g
and want to change spaces from 5 (inclusive) to 10 (inclusive) position then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=5;i<=10;i+=1)$i=($i==" "?"B":$i);print}' file.txt
output is
S o mBeBsBt r i n g
Explanation: I set field pattern (FPAT) to any single character and output field seperator (OFS) to empty string, thus every field is populated by single characters and I do not get superfluous space when print-ing. I use for loop to access desired fields and for every one I check if it is space, if it is I assign B here otherwise I assign original value, finally I print whole changed line.
Using GNU awk:
awk -v strt=29 -v end=39 '{ ram=substr($0,strt,(end-strt));gsub(" ","B",ram);print substr($0,1,(strt-1)) ram substr($0,(end)) }' file
Explanation:
awk -v strt=29 -v end=39 '{ # Pass the start and end character positions as strt and end respectively
ram=substr($0,strt,(end-strt)); # Extract the 29th to the 39th characters of the line and read into variable ram
gsub(" ","B",ram); # Replace spaces with B in ram
print substr($0,1,(strt-1)) ram substr($0,(end)) # Rebuild the line incorporating raw and printing the result
}'file
This is certainly a suitable task for perl, and saddens me that my perl has become so rusty that this is the best I can come up with at the moment:
perl -e 'local $/=\1;while(<>) { s/ /B/ if $. >= 236 && $. <= 246; print }' input;
Another awk but using FS="":
$ awk 'BEGIN{FS=OFS=""}{for(i=29;i<=39;i++)sub(/ /,"B",$i)}1' file
Output:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
Explained:
$ awk ' # yes awk yes
BEGIN {
FS=OFS="" # set empty field delimiters
}
{
for(i=29;i<=39;i++) # between desired indexes
sub(/ /,"B",$i) # replace space with B
# if($i==" ") # couldve taken this route, too
# $i="B"
}1' file # implicit output
With sed :
sed '
H
s/\(.\{236\}\)\(.\{11\}\).*/\2/
s/ /B/g
H
g
s/\n//g
s/\(.\{236\}\)\(.\{11\}\)\(.*\)\(.\{11\}\)/\1\4\3/
x
s/.*//
x' infile
When you have an input string without \r, you can use:
sed -r 's/(.{236})(.{10})(.*)/\1\r\2\r\3/;:a;s/(\r.*) (.*\r)/\1B\2/;ta;s/\r//g' input
Explanation:
First put \r around the area that you want to change.
Next introduce a label to jump back to.
Next replace a space between 2 markers.
Repeat until all spaces are replaced.
Remove the markers.
In your case, where the length doesn't change, you can do without the markers.
Replace a space after 236..245 characters and try again when it succeeds.
sed -r ':a; s/^(.{236})([^ ]{0,9}) /\1\2B/;ta' input
This might work for you (GNU sed):
sed -E 's/./&\n/245;s//\n&/236/;h;y/ /B/;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Divide the problem into 2 lines, one with spaces and one with B's where there were spaces.
Then using pattern matching make a composite line from the two lines.
N.B. The newline can be used as a delimiter as it is guaranteed not to be in seds pattern space.

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Fill lines with a certain character until it contains a given amount of them

I have various flat files that contain roughly 30k records that I need to reformat using a script to ensure they all have the same amount of ~'s.
For example below are two of the sample records in the flat file. The first record containing 8 ~'s and the second containing 10 ~'s.
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~
I need both records to contain 12 ~'s so I need code that will loop through the file add pad out each line to contain the correct number of ~'s. The desired result would be as follows.
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~
I have the following bit of code which will display the number of ~'s in each file but I'm sure how to proceed from here.
sed 's/[^~]//g' inputfile.text| awk '{ print length }'
Set the field separator to ~ and keep adding ~s until you have enough:
$ awk -F"~" -v cols=12 'NF<=cols{for (i=NF;i<=cols;i++) $0=$0 FS}1' file
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~
awk -v FS='~' -v count=12 '{line = ""; for (i = 1; i <= count; i++) { line = line "~" $i } print line }' tildes.txt
Here's a bash builtin-only solution:
a=~~~~~~~~~~~~;
while read -r line; do
n=${line//[!~]};
echo "$line${a/$n}";
done < file
Explanation
n=${l//[!~]} makes a string from all ~ characters in $l.
${a/$n} takes the string of 12 ~ characters and deletes the first match of $n, so when we append it to $l, there will be exactly 12 tildes in the echoed line.
Warning: If there are more than 12 tildes in one line in file, the corresponding outputted line will have more than 24 tildes in it, as there are no checks to make sure ${#n} is less than ${#a}.
$ awk -F"~" -v c=12 'BEGIN{OFS=FS;c++}{$c=$c}1' file
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~

Resources