To Remove Comma at the end of each line in a very huge File (5GB) in Unix - bash

Input
"India","Australia",1991-07-03,99,
1991-07-03,99,"India","Australia",
Above is just a sample lines in the file. The length of 1 line is 1800 (approx). The size of the file is 5 GB. Each Line ends with <,CRLF> (Carraige Feed Line Feed) Character. I need to remove the , from it.
Output
"India","Australia",1991-07-03,99
1991-07-03,99,"India","Australia"
Command I Used
cat InputFile | sed 's~,\r~\r~g' > OutputFile.
Problem
The Command is working fine , but it's taking 15 minutes to get the changes done.
Question
Is there any other fast/better way to do this quickly?

If you want significant speed-up, I'm afraid you're going to need to go to a compiled-code solution. Perl, Java, c. Here is c code that I have tested and works for your case:
#include <stdio.h>
int main(){
int c, d;
c = getchar();
if (c == EOF) return 0; // edge case, empty file
for (d = getchar(); d != EOF; c = d, d = getchar())
if (c != ',' || d != '\r') putchar(c);
putchar(c); // last char in file
}
I guess I should add how to bare-bones run that code. Of course you'll need a c compiler, cc. Assuming so, put the above code into a file comma.c, then:
$ cc comma.c
$ ./a <InputFile >OutputFile

If you want to make this faster, you can try using split. https://kb.iu.edu/d/afar
Split the file into numerous smaller files, then do a threaded loop against the resulting smaller files, and output the sed of each smaller file into a new results file.

The simple solution to remove comma at the end of every line is with sed command:
sed -i 's/,$//' input-file
If you don't want to modify the original file you can create a new output file like this:
sed 's/,$//' input-file > output-file

Related

Remove comma from last element in each block

I've got a file with the following contents, and want to remove the last comma (in this case, the comma after the 'c' and 'f').
heading1(
a,
b,
c,
);
some more text
heading2(
d,
e,
f,
);
This has to be used using bash and not Perl or Python etc as these are not installed on my target system. I can use sed, awk etc, but I cannot use sed with the -z argument as I'm using an old version of the utility.
So sed -zi 's/,\n);/\n);/g' $file is off the table.
Any help would be greatly appreciated. Thanks
This might work in your version of sed. Then again it might not.
sed 'x;1d;G;/;$/s/,//;$!s/\n.*//' $file
Rough translation: "Swap this line with the hold space. If this is the first line, do no more with it. Append the hold space to the line in the buffer (so that you're looking at the last line and the current one). If what you have ends with a semicolon, delete the comma. If you're not on the last line of the file, delete the second of the two lines you have (i.e. the current line, which we'll deal with after we see the next one)."
Using awk, RS="^$" to read in the whole file and regex to replace parts of the text:
$ awk -v RS=^$ '{gsub(/,\n\);/,"\n);")}1' file
Some output:
heading1(
a,
b,
c
);
...
This should work with GNU sed and BSD sed on the shown input:
sed -e ':a' -e '/,\n);$/!{N' -e 'ba' -e '}' -e 's/,\n);$/\n);/' file.txt
We concatenate lines in the pattern space until it ends with ,\n);. Then we delete the comma, print (the default) and restart the cycle with a new line.
Simpler and more readable version with GNU sed (that you do not have):
sed ':a;/,\n);$/!{N;ba};s/,\n);$/\n);/' file.txt
Using awk:
awk '
$0==");" {sub(/,$/, "", l)}
FNR!=1 {print l}
{l=$0}
END {print l}'
This might work for you (GNU sed):
sed '/,$/{N;/);$/Ms/,$//M;P;D}' file
If a line ends with a comma, fetch the next line and if this ends in );, remove the comma.
Otherwise, if the following line does not match as above, print/delete the first of the lines and repeat.
Using sed there are broadly two approaches:
Keep multiple lines in the pattern space; or
Keep the previous line in the hold space.
Using just the pattern space means a very concise version:
sed 'N; s/,[[:space:]]*\n*[[:space:]]*)/)/; P; D'
This relies on the pattern space being able to hold multiple lines, and being able to match the newline with \n. Not all versions of sed can do this, but GNU sed can.
This also relies on the implicit behaviours of N, P, and D, which change depending on when end-of-input is reached. Read man sed for the gory details.
Unrolling this to one command per line gets:
sed '
N
s/,[[:space:]]*\n*[[:space:]]*)/)/
P
D
'
If you have only a POSIX version of sed available, you'll need to use the hold space as well. In this case the idea is that when you see the ) in the pattern space, you edit the line that's in the hold space to remove the comma:
sed '1 { h; d; }; /^)/ { x; s/,[[:space:]]*$//; x; }; x; $ { p; x; s/,$//; }'
Unrolling that we get:
sed '
1 {
h
d
}
/^)/ {
x
s/,[[:space:]]*$//
x
}
x
$ {
p
x
s/,[[:space:]]*$//
}
'
Breaking that apart: what follows is a "sed script"; so just put '' around it and "sed" in front of it:
sed '
Start by unconditionally copying the first line from the pattern space to the hold space, and then deleting the pattern space (which forces a skip to the next line)
1 {
h
d
}
For each line that starts with ')', swap the pattern space and hold space (so you now have the previous line in the pattern space), remove the trailing comma (if any), and then swap back again:
/^)/ {
x
s/,[[:space:]]*$//
x
}
Now swap the pattern space with the hold space, so that the hold space now hold the current line and pattern space holds the previous line.
x
Normally contents of the pattern space will be sent to output when the end of the script is reached, but we have one more case to take care of first.
On the last line, print the previous line, then swap to retrieve the last line and then (because we reach the end of the script) print it too. This code will also remove a trailing comma from the last line, but that's optional; you can remove the s command in the following if you don't want that.
$ {
p
x
s/,[[:space:]]*$//
}
Upon reaching the end of the sed script, the pattern space will be printed; so there's no "p" at the end.
As mentioned before, close the quote from the beginning.
'
Note:
If you need to scan ahead more than one line, instead of "x" to swap one line, use "H;g" to append to the hold space and then copy the hold space to the pattern space, then "P;D" to print and remove up to the first newline. (H, P & D are GNU extensions.)

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

Display data between two fixed patterns

I've random data coming in from a source into a file. I have to read thru the file and extract only that portion of data which falls between particular patterns.
Example: Let's suppose the file myfile.out looks like this.
info-data
some more info-data
=================================================================
some-data
some-data
some-data
=================================================================
======================= CONFIG PARMS : ==========================
some-data
some-data
some-data
=================================================================
======================= REQUEST PARAMS : ========================
some-data
some-data
some-data
=================================================================
===================== REQUEST RESULTS ===========================
some-data
=================================================================
some-data
some-data
=================================================================
Data-I-Need
Data-I-Need
...
...
...
Data-I-Need
==========================F I N I S H============================
some-info-data
I'm looking for the data that matches this particular pattern only
=================================================================
Data-I-Need
Data-I-Need
...
...
...
Data-I-Need
==========================F I N I S H============================
I did try to look around a bit, like
How to select lines between two marker patterns which may occur multiple times with awk/sed
Bash. How to get multiline text between tags
But the awk, sed solutions given there doesn't seem to work, the commands don't give any errors or outputs.
I tried this
PATTERN1="================================================================="
PATTERN2="==========================F I N I S H============================"
awk -v PAT1="$PATTERN1" -v PAT2="$PATTERN2" 'flag{ if (/PAT2/){printf "%s", buf; flag=0; buf=""} else buf = buf $0 ORS}; /PAT1/{flag=1}' myfile.out
and
PATTERN1="================================================================="
PATTERN2="==========================F I N I S H============================"
awk -v PAT1="$PATTERN1" -v PAT2="$PATTERN2" 'PAT1 {flag=1;next} PAT2 {flag=0} flag { print }' file
Maybe it is due to the pattern? Or I'm doing something wrong.
Script will run on RHEL 6.5.
This might work for you (GNU sed):
sed -r '/^=+$/h;//!H;/^=+F I N I S H=+$/!d;x;s/^[^\n]*\n|\n[^\n]*$//g' file
Store a line containing only ='s in the hold space (replacing anything that was there before). Append all other lines to hold space. If the current line is not a line containing ='s followed by F I N I S H followed by ='s, delete it. Otherwise, swap to the hold space, remove the first and last lines and print the remainder.
Assuming you only need the data and not the pattern, using GNU awk:
awk -v RS='\n={26,}[ A-Z]*={28,}\n' 'RT~/F I N I S H/' file
The record separator RS is set to match lines with a series of = and some optional uppercase characters inbetween.
The only statement is check if the record terminator RT (of the current record) has the FINISH keyword in it. If so, awk will print the whole record consisting of multiple lines.
sed can handle this.
Assuming you want to keep the header and footer lines -
$: sed -En '/^=+$/,/^=+F I N I S H=+$/ { /^=+$/ { x; d; }; /^[^=]/ { H; d; }; /^=+F I N I S H=+$/{ H; x; p; q; }; }' infile
=================================================================
Data-I-Need
Data-I-Need
...
...
...
Data-I-Need
==========================F I N I S H============================
If not, use
sed -En '/^=+$/,/^=+F I N I S H=+$/ { /^=+$/ { s/.*//g; x; d; }; /^[^=]/ { H; d; }; /^=+F I N I S H=+$/{ x; p; q; }; }' infile
Note that if you aren't using GNU sed you'll need to insert newlines instead of all those semicolons.
sed -En '
/^=+$/,/^=+F I N I S H=+$/ {
/^=+$/ {
s/.*//g
x
d
}
/^[^=]/ {
H
d
}
/^=+F I N I S H=+$/{
x
p
q
}
}' infile
Data-I-Need
Data-I-Need
...
...
...
Data-I-Need
Breaking it down -
sed -En '...'
The -En says to use extended pattern matching (the -E, which I really only used for the +'s), and not to output anything unless specifically asked (the -n).
/^=+$/,/^=+F I N I S H=+$/ {...}
says to execute these commands only between lines that are all ='s and lines that are all ='s except for F I N I S H in the middle somewhere. All the stuff between the {}'s will be checked on all lines between those. That does mean from the first =+ line, but that's ok, we handle that inside.
(a) /^=+$/ { x; d; };
(b) /^=+$/ { s/.*//g; x; d; };
(a) says on each of the lines that are all ='s, swap (x) the current line (the "pattern space") with the "hold space", then delete (d) the pattern space. That keeps the current line and deletes whatever you might have accumulated above on false starts. (Remember -n keeps anything from printing till we want it.)
(b) says erase the current line first, THEN swap and delete. It will still add a newline. Did you want that removed?
/^[^=]/ { H; d; };
Both versions use this. On any line that does not start with an =, add it to the hold space (H), then deletes the pattern space (d). The delete always restarts the cycle, reading the next record.
(a) /^=+F I N I S H=+$/{ H; x; p; q; };
(b) /^=+F I N I S H=+$/{ x; p; q; };
On any line with the sentinel F I N I S H string between all ='s, (a) will first append (H) the pattern to the hold space - (b) will not. Both will then swap the pattern and hold spaces (x), print (p) the pattern space (which is now the value accumulated into the hold space), and then delete (d) the pattern space, triggering the next cycle.
At that point, you will be outside the initial toggle, so unless another row of all ='s happens, you'll skip all the remaining lines. If one does it will again begin to accumulate records, but will not print them unless it hits another F I N I S H record.
}' infile
This just closes the script and passes in whatever filename you were using. Note that is is not an in-place edit...
Hope that helps.
Although there is already a sed solution there, I like sed for its simplicity:
sed -n '/^==*\r*$/,/^==*F I N I S H/{H;/^==*[^F=]/h;${g;p}}' file
In this sed command we made a range for our commands to be run against. This range starts with a line which begins, contains only and ends to = and then finishes on a line that starts with = and heads to F I N I S H. Now our commands:
H appends each line immediately to hold space. Then /^==*[^F=]/h executes on other sections' header or footer that it replaces hold space with current pattern space.
And at the last line we replaces current pattern space with what is in hold space and then print it using ${g;p}. The whole thing outputs this:
=================================================================
Data-I-Need
Data-I-Need
...
...
...
Data-I-Need
==========================F I N I S H============================

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etcץ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Removing control / special characters from log file

I have a log file captured by tclsh which captures all the backspace characters (ctrl-H, shows up as "^H") and color-setting sequences (eg. ^[[32m .... ^[[0m ). What is an efficient way to remove them?
^[...m
This one is easy since, I can just do "sed -i /^[.*m//g" to remove them
^H
Right now I have "sed -i s/.^H//", which "applies" a backspace, but I have to keep looping this until there are no more backspaces.
while [ logfile == `grep -l ^H logfile` ]; do sed -i s/.^H// logfile ; done;
"sed -i s/.^H//g" doesn't work because it would match consecutive backspaces. This process takes 11 mins for my log file with ~6k lines, which is too long.
Any better ways to remove the backspace?
You could always write a simple pipeline command to implement the backspace stripping, something like this:
#include <stdio.h>
#include <stdlib.h>
#define BUFFERSIZE 10240
int main(int argc, char* argv[])
{
int c ;
int buf[BUFFERSIZE] ;
int pos = 0 ;
while((c = getchar()) != EOF)
{
switch (c)
{
case '\b':
{
if (pos > 0)
pos-- ;
break ;
}
case '\n':
{
int i ;
for (i = 0; i < pos; ++i)
putchar(buf[i]) ;
putchar('\n') ;
pos = 0 ;
break ;
}
default:
{
buf[pos++] = c ;
break ;
}
}
}
return 0 ;
}
I've only given the code a minimal test and you may need to adjust the buffer sze depending on how big your lines our. It might be an idea to assert that pos is < BUFERSSIZE after pos++ just to be safe!
Alternatively you could maybe implement something similar with the Tcl code that captures the log file in the first place; but without knowing how that works it's a bit hard to say.
Your could try:
sed -i s/[^^H]^H//g
This might or might not work in one go, but should at least be faster than one at a time as you seem to be doing now.
Did you know that “sed” doesn't just do substitutions? The commands of a sed script have to be on separate lines though (or at least they do on the version of sed I've got on this machine).
sed -i bak 's/^[[^^]]*m//g
: again
s/[^^H]^H//g
t again' logfile
The : sets up a label (again in this case) and t branches to a label if any substitutions have been performed (since the start/last branch). Wrapping those round a suitable s gets the substitution applied until it can't any more.
Just to put it out here, I ended up doing this. It's not a pretty solution and not as flexible as Jackson's answer but does what I need in my particular case. I basically use the inner loop to generate the match string for sed.
# "Applies" up to 10 consecutive backspaces
for i in {10..1}; do
match=""
for j in `seq 1 $i`; do
match=".${match}^H"
done;
# Can't put quotes around s//g or else backspaces are evaluated
sed -i s/${match}//g ${file-to-process}
done;

Resources