I have a file which is much larger than the amount of memory available on the server which needs to run this script.
In that file, I need to run a basic regex which does a find and replace across two lines at a time. I've looked at using sed, awk, and perl, but I haven't been able to get any of them to work as I need it in this instance.
On a smaller file, the following line does what I need it to:
perl -0777 -i -pe 's/,\s+\)/\n\)/g' inputfile.txt
In essence, any time a line ends in a comma and the next line starts in a closing parenthesis, remove the comma.
When I tried to run that on my production file I just got the message "Killed" in the terminal after a couple of minutes and the file contents were completely erased. I was watching memory usage during that and as expected it was running at 100% and using the swap space extensively.
Is there a way to make that perl command run on two lines at a time instead, or an alternative bash command which might achieve the same result?
If it makes it easier by keeping the file size identical then I also have the option of replacing the comma with a space character.
A fairly direct logic:
print a line unless it ends with a comma (need to check on the next line, perhaps remove it)
print the previous line ($p) if it had a comma, without it if the current line starts with )
perl -ne'
if ($p =~ /,$/) { $p =~ s/,$// if /^\s*\)/; print $p };
print unless /,$/;
$p = $_
' file
Efficiency of this can be improved some, by losing one regex (so engine startup overhead goes) and some data copy but at the expense of clumsier code, having additional logic and checks.
Tested with file
hello
here's a comma,
which was fine
(but here's another,
) which has to go,
and that was another good one.
end
The above fails to print the last line if it ends in a comma. One fix for that is to check our buffer (previous line $p) in an END block, so to add at the end
END { print $p if $p =~ /,$/}
This is a fairly usual way to check for trailing buffers or conditions in -n/-p one-liners.
Another fix, less efficient but with perhaps cleaner code, is to replace the statement
print unless /,$/;
with
print if (not /,$/ or eof);
This does run an eof check on every line of the file, while END runs once.
Delay printing out the trailing comma and line feed until you know it's ok to print it out.
perl -ne'
$_ = $buf . $_;
s/^,(?=\n\))//;
$buf = s/(,\n)\z// ? $1 : "";
print;
END { print $buf; }
'
Faster:
perl -ne'
print /^\)/ ? "\n" : ",\n" if $f;
$f = s/,\n//;
print;
END { print ",\n" if $f; }
'
Specifying file to process to Perl one-liner
If using \n newline as a record separator is awkward, use something else. In this case you're specifically interested in the sequence ,\n), and we can let Perl find that for us as we read the file:
perl -pe 'BEGIN{ $/ = ",\n)" } s/,\n\)/\n)/' input.txt >output.txt
This portion: $/ = ",\n)" tells Perl that instead of iterating over lines of the file, it should iterate over records that terminate with the sequence ,\n). That helps us to assure that every chunk will have at most one such sequence, but more importantly, that this sequence will not span chunks (or records, or file-reads). Every chunk read will either end in ,\n) or in the case of the final record, may end not have a record terminator (by our definition of terminator).
Next we just use substitution to eliminate that comma in our ,\n) record separator sequence.
The key here really is that by setting the record separator to the very sequence we're interested in, we guarantee the sequence will not get broken across file-reads.
As has been mentioned in the comments, this solution is most useful only if the span between ,\n) sequences doesn't exceed the amount of memory you are willing to throw at the problem. It is most likely that newlines themselves occur in the file more often than ,\n) sequences, and so, this will read in larger chunks. You know your data set better than we do, and so are in a better position of judging whether the simplicity of this solution is outweighed by the footprint it consumes in memory.
This can be done more simply with just awk.
awk 'BEGIN{RS=".\n."; ORS=""} {gsub(",\n)", "\n)", RT); print $0 RT}'
Explanation:
awk, unlike Perl, allows a regular expression as the Record Separator, here .\n. which "captures" the two characters surrounding each newline.
Setting ORS to empty prevents print from outputting extra newlines. Newlines are all captured in RS/RT.
RT represents the actual text matched by the RS regular expression.
The gsub removes any desired comma from RT if present.
Caveat: You'd need gnu awk (gawk) for this to work. It seems that POSIX-only awk will lack the regexp-RS with RT variable feature, according to gawk man page.
Note: gsub is not really needed, sub is good enough and probably should have been used above.
This might work for you (GNU sed):
sed 'N;s/,\n)/\n)/;P;D' file
Keep a moving window of two lines throughout the file and if the first ends in a , and the second begins with ), remove the ,.
If there is white space and it needs to be preserved, use:
sed 'N;s/,\(\s*\n\s*)\)/\1/;P;D' file
Related
I am writing a script to add new dependencies to the watch list. I am putting a placeholder to know where to add the text, for eg
assets = [
"../../new_app/assets"
# [[NEW_APP_ADD_ASSETS]]
]
It is simple to replace just the place holder but my problem is to add comma in the previous line.
that can be done if I search and replace
"
# [[NEW_APP_ADD_ASSETS]]
ie "\n # [[NEW_APP_ADD_ASSETS]]
I am not able to search for the new line.
One of the solutions I found for adding a new line was
sed -i '' 's/newline/line one\
line two/' filename.txt
But when same way done for the search string it returns :unterminated substitute pattern
sed -i '' s/'assets\"\
#'/'some new text'/ filename.txt
PS: I writing on macos
Sed works on a line-by-line base, hence it becomes tricky to add the coma to the previous line as that line has already been processed. It is possible, but the sed syntax quickly becomes messy.
To be a bit more specific:
In default operation, sed cyclically shall append a line of input, less its terminating <newline> character, into the pattern space. Reading from input shall be skipped if a <newline> was in the pattern space prior to a D command ending the previous cycle. The sed utility shall then apply in sequence all commands whose addresses select that pattern space, until a command starts the next cycle or quits. If no commands explicitly started a new cycle, then at the end of the script the pattern space shall be copied to standard output (except when -n is specified) and the pattern space shall be deleted. Whenever the pattern space is written to standard output or a named file, sed shall immediately follow it with a <newline>.
In short, if you do not manipulate the pattern space, you cannot process <newline> characters as they just do not appear!
And even shorter, if you only use the substitute command, sed only processes one line at a time!
This is also why you suffer from : unterminated substitute pattern. You are searching for a newline character, but as sed just reads one line at a time, it just does not find it and it also does not expect it. The error will vanish if you replace your newline with the symbols \n.
sed -i '' s/'assets\"\n #'/'some new text'/ filename.txt
A better way to achieve your goals would be to make use of awk. It is a bit more readable:
awk '/# [[NEW_APP_ADD_ASSETS]]/{ print t","; t="line1\nline2"; next }
{ print t; t=$0 }
END{ print t }' <file>
Please take a look at the sample file and the desired output below to understand what I am looking for.
It can be done with loops in a shell script but I am struggling to get an awk/sed one liner.
SampleFile.txt
These are leaves.
These are branches.
These are greenery which gives
oxygen, provides control over temperature
and maintains cleans the air.
These are tigers
These are bears
and deer and squirrels and other animals.
These are something you want to kill
Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Desired output
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
With sed:
sed ':a;N;/\nThese/!s/\n/ /;ta;P;D' infile
resulting in
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Here is how it works:
sed '
:a # Label to jump to
N # Append next line to pattern space
/\nThese/!s/\n/ / # If the newline is NOT followed by "These", append
# the line by replacing the newline with a space
ta # If we changed something, jump to label
P # Print part until newline
D # Delete part until newline
' infile
The N;P;D is the idiomatic way of keeping multiple lines in the pattern space; the conditional branching part takes care of the situation where we append more than one line.
This works with GNU sed; for other seds like the one found in Mac OS, the oneliner has to be split up so branching and label are in separate commands, the newlines may have to be escaped, and we need an extra semicolon:
sed -e ':a' -e 'N;/'$'\n''These/!s/'$'\n''/ /;ta' -e 'P;D;' infile
This last command is untested; see this answer for differences between different seds and how to handle them.
Another alternative is to enter the newlines literally:
sed -e ':a' -e 'N;/\
These/!s/\
/ /;ta' -e 'P;D;' infile
But then, by definition, it's no longer a one-liner.
Please try the following:
awk 'BEGIN {accum_line = "";} /^These/{if(length(accum_line)){print accum_line; accum_line = "";}} {accum_line = accum_line " " $0;} END {if(length(accum_line)){print accum_line; }}' < data.txt
The code consists of three parts:
The block marked by BEGIN is executed before anything else. It's useful for global initialization
The block marked by END is executed when the regular processing finished. It is good for wrapping the things. Like printing the last collected data if this line has no These at the beginning (this case)
The rest is the code performed for each line. First, the pattern is searched for and the relevant things are done. Second, data collection is done regardless of the string contents.
awk '$1==These{print row;row=$0}$1!=These{row=row " " $0}'
you can take it from there. blank lines, separators,
other unspecified behaviors (untested)
another awk if you have support for multi-char RS (gawk has)
$ awk -v RS="These" 'NR>1{$1=$1; print RS, $0}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Explanation Set the record delimiter as "These", skip the first (empty) record. Reassign field to force awk to restructure the record; print record separator and the rest of the record.
$ awk '{printf "%s%s", (NR>1 ? (/^These/?ORS:OFS) : ""), $0} END{print ""}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Not a one-liner (but see end of answer!), but an awk-script:
#!/usr/bin/awk -f
NR == 1 { line = $0 }
/^These/ { print line; line = $0 }
! /^These/ { line = line " " $0 }
END { print line }
Explanation:
I'm accumulating, building up, lines that start with "These" with lines not starting with "These", outputting the completed lines whenever I find the next line with "These" at the beginning.
Store the first line (the first "record").
If the line starts with "These", print the accumulated (previous, now complete) line and replace whatever we have found so far with the current line.
If it doesn't start with "These", accumulate the line (i.e concatenate it with the previously read incomplete lines, with a space in between).
When there's no more input, print the last accumulated (now complete) line.
Run like this:
$ ./script.awk data.in
As a one-liner:
$ awk 'NR==1{c=$0} /^These/{print c;c=$0} !/^These/{c=c" "$0} END{print c}' data.in
... but why you would want to run anything like that on the command line is beyond me.
EDIT Saw that it was the specific string "These" (/^These/) that was what should be looked for. Previously had my code look for uppercase letters at the start of the line (/^[A-Z]/).
Here is a sed program which avoids branches. I tested it with the --posix option. The trick is to use an "anchor" (a string which does not occur in the file):
sed --posix -n '/^These/!{;s/^/DOES_NOT_OCCUR/;};H;${;x;s/^\n//;s/\nDOES_NOT_OCCUR/ /g;p;}'
Explanation:
write DOES_NOT_OCCUR at the beginning of lines not starting with "These":
/^These/!{;s/^/DOES_NOT_OCCUR/;};
append the pattern space to the hold space
H;
If the last line is read, exchange pattern space and hold space
${;x;
Remove the newline at the beginning of the pattern space which is added by the H command when it added the first line to the hold space
s/^\n//;
Replace all newlines followed by DOES_NOT_OCCUR with blanks and print the result
s/\nDOES_NOT_OCCUR/ /g;p;}
Note that the whole file is read in sed's process memory, but with only 4GB this should not be a problem.
I have a file with fields separated by the '`' character. But sometimes the actual data also contains this character. How can I remove all the erroneous rows and retain only the good quality data.
Sample Row as below . Towards the end 'fff`ff' this is the erroneous column . in such case The row should be eliminated.
xxx`1000165811`2012`2012_q2`05/09/2012 22:02:00`1343`04/07/2004 00:00:00`05/09/2012 00:00:00````F`1`1.000000`9.620000`1.0000````fff`Not`Free`Free`1.000000`9.620000`0.000000`1.0000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`56565666`255.590000`21`0`0.000000```ddd`dddd`FA May 2012 ddd`0.000000`0.000000`0.000000`0.000000`0.000000`05/30/2012 00:00:00`05/30/2012 00:00:00`1.000000`ddd`ddd`OW`DL`dd dd dd`ddd`dd`dd dd`dd dd`0.000000`0.000000``````````0.000000`````````Non_Mobile`9.620000`1.000000`1`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`9.620000`9.620000`0.000000`0.000000`0.000000`0.000000`28.590000`6.990000`**fff`ff**`````````9.620000`1.000000`1
You need to know what the correct number of delimiters in a line is. You need to count the actual number of delimiters in each line, and reject those lines where the actual count is not the correct number.
Assuming the the correct number of separators is n=5, then you could try:
n=5
grep -E '^[^`]*(`[^`]*){'"$n"'}$' data
The regex uses extended regular expressions (-E). The regex matches the start of the line, zero or more non-back-ticks, then a sequence of n occurrences of a back tick followed by zero or more non-back-ticks, followed by the end of line. Because the back-tick is a shell metacharacter, it is best to enclose most of the regular expression in single quotes. The variable $n could be used without the double quotes around it, but it's generally best to enclose variables in double quotes. Clearly, you can also use this version too:
grep -E '^([^`]*`){'"$n"'}[^`]*$' data
Given a data file data:
AA`BB`CC`DD`EE`FF
AABB`CC`DD`EE`FF
A`A`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
``````
````
Welcome`to`the`land`of`insanity
The output of the command is:
AA`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
Welcome`to`the`land`of`insanity
grep -v "[^`]`[^`]`[^`]`"
you need to have one more times that the correct lines would have
In the spirit of "Be careful what you ask for", here is a "one-liner" (spread over three lines for readability) that will do what was asked, using only awk and assuming that $FILE is the relevant filename.
awk -F'`' -v file="$FILE" '
BEGIN{ while(getline<file){if (min==""||NF<min){min=NF}}}
NF==min' "$FILE"
This incantation first determines the minimum number of delimiters per line (without sorting the file), and then rejects all lines with more than that many.
(This is similar to Ed Morton's proposal, but without the bug :-)
I store my SOA data for multiple domains in a single file that gets $INCLUDEd by zone files. I've written a small sed script that is supposed to get the serial number, increment it, then re-save the SOA file. It all works properly as long as the SOA file is in the proper format, with the entire record on one line, but it fails as soon as the record gets split into multiple lines.
For example, this works as input data:
# IN SOA dnsserver. hostmaster.example.net. ( 2013112202 21600 900 691200 86400 )
But this does not:
# IN SOA dnsserver. hostmaster.example.net. (
2013112202 ; Serial number
21600 ; Refresh every day, 86400 is 1 day
900 ; Retry refresh every 15 min
691200 ; Expire every 8 days
86400 ) ; Minimum TTL 1 day
I like comments, and I would like to spread things out. But I need my script to be able to find the serial number so that I can increment it and rewrite the file.
The SED that works on the single line is this:
SOA=$(sed 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' $SOAfile)
But for multi-line ... I'm a bit lost. I know I can join lines with N, but how do I know if I even need to? Do I need to write separate sed scripts based on some other analysis I do of the original file?
Please help! :-)
I wouldn't use sed for this. While you might be able to brute-force something, it would require a large amount of concentration to come up with it, and it would look like line noise, and so be almost unmaintainable afterwards.
What about this in awk?
The easiest way might be to split your records based on the # character, like so:
SOA=$(awk 'BEGIN{RS="#"} NR==2{print $6}' $SOAfile)
But that will break if you have comments containing # before the uncommented line, or if you have any comments between the # and the serial number. You could make a pipe to avoid these issues...
SOA=$(sed 's/;.*//;/^#/p;1,/^#/d' $SOAfile | awk 'BEGIN{RS="#"} NR==2{print $6}')
It may seem redundant to remove comments and strip the top of the file, but there could be other lines like #include which (however unlikely) could contain your record separator.
Or you could do something like this in pure awk:
SOA=$(awk -v field=6 '/^#/ { if($2=="IN"){field++} for(i=1;i<field;i++){if(i==NF){field=field-NF;getline;i=1}} print $field}' $SOAfile)
Or, broken out for easier reading:
awk -v field=6 '
/^#/ {
if ($2=="IN") {field++;}
for (i=1;i<field;i++) {
if(i==NF) {field=field-NF;getline;i=1;}
}
print $field; }' $SOAfile
This is flexible enough to handle any line splitting you might have, as it counts to field along multiple lines. It also adjusts the field number based on whether your zone segment contains the optional "IN" keyword.
A pure-sed solution would, instead of counting fields, use the first string of digits after an open bracket after your /^#/, like this:
SOA=$(sed -n '/^#/,/^[^;]*)/H;${;x;s/.*#[^(]*([^0-9]*//;s/[^0-9].*//;p;}' $SOAfile)
Looks like line noise, right? :-) Broken out for easier reading, it looks like this:
/^#/,/^[^;]*)/H # "Hold" the meaningful part of the file...
${ # Once we reach the end...
x # Copy the hold space back to the main buffer
s/.*#[^(]*([^0-9]*// # Remove stuff ahead of the serial
s/[^0-9].*// # Remove stuff after the serial
p # And print.
}
The idea here is that starting from the first line that begins with #, we copy the file into sed's hold space, then at the end of the file, do some substitutions to strip out all the text up to the serial number, and then after the serial number, and print whatever remains.
All of these work on single line and multi line zone SOA records I've tested with.
You can try the following - it's your original sed program preceded by commands to first read all input lines, if applicable:
SOA=$(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' \
"$SOAfile")
This form will work with both single- and multi-line input files.
Multi-line input files are first read as a whole before applying the substitutions.
Note: The awkward separate -e options are needed to keep FreeBSD happy with respect to labels and branching commands, which need a literal \n for termination - using separate -e options is a more readable alternative to splicing in literal newlines with $'\n'.
Alternative solution, using awk:
SOA=$(awk -v RS='#' '$1 == "IN" && $2 == "SOA" { print $6 }' "$SOAfile")
Again, this will work with both single- and multi-line record definitions.
The only constraint is that comments must not precede the serial number.
Additionally, if a file contained multiple records, the above would collect ALL serial numbers, separated by a newline each.
Why sed? grep is simplest in this case:
grep -A1 -e '#.*SOA' 1 | grep -oe '[0-9]*'
or: (maybe better):
grep -A1 -e '#.*SOA' 1 | grep 'Serial number' | grep -oe '[0-9]*'
This might work for you (GNU sed):
sed -nr '/# IN SOA/{/[0-9]/!N;s/[^0-9]+([0-9]+).*/\1/p}' file
For lines that contain # IN SOA if the line contains no numbers append the next line. Then extract the first sequence of numbers from the line(s).
I am looking for a bash or sed script (preferably a one-liner) with which I can insert a new line character after a fixed number of characters in huge text file.
How about something like this? Change 20 is the number of characters before the newline, and temp.text is the file to replace in..
sed -e "s/.\{20\}/&\n/g" < temp.txt
Let N be a shell variable representing the count of characters after which you want a newline. If you want to continue the count accross lines:
perl -0xff -pe 's/(.{'$N'})/$1\n/sg' input
If you want to restart the count for each line, omit the -0xff argument.
Because I can't comment directly (to less reputations) a new hint to upper comments:
I prefer the sed command (exactly what I want) and also tested the Posix-Command fold. But there is a little difference between both commands for the original problem:
If you have a flat file with n*bytes records (without any linefeed characters) and use the sed command (with bytes as number (20 in the answer of #Kristian)) you got n lines if you count with wc. If you use the fold command you only got n-1 lines with wc!
This difference is sometimes important to know, if your input file doesn't contain any newline character, you got one after the last line with sed and got no one with fold
if you mean you want to insert your newline after a number of characters with respect to the whole file, eg after the 30th character in the whole file
gawk 'BEGIN{ FS=""; ch=30}
{
for(i=1;i<=NF;i++){
c+=1
if (c==ch){
print ""
c=0
}else{
printf $i
}
}
print ""
}' file
if you mean insert at specific number of characters in each line eg after every 5th character
gawk 'BEGIN{ FS=""; ch=5}
{
print substr($0,1,ch) "\n" substr($0,ch)
}' file
Append an empty line after a line with exactly 42 characters
sed -ie '/^.\{42\}$/a\
' huge_text_file
This might work for you:
echo aaaaaaaaaaaaaaaaaaaax | sed 's/./&\n/20'
aaaaaaaaaaaaaaaaaaaa
x