replacing all numbers by a single word in text files [closed] - bash

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
Is there a easy and efficient one line solution to replace all the numbers or sequences that contain numbers and symbols (\ / $ & * # # ) ( - + ! ~ . , : ; " ' ` ^ % _ ] [ { } = ), for example:
1 2 3 4 998898321321
0.2 1.2 32221.111. 1321321321.111
111.11212.21212
212323/12331/321312
121-12123-32131
121+12123+32131
1_212121_2320
12131!~~~323131
etc
with a single token NUMBER in a huge text (100GB) file? Sample input and output:
input:
hello my friend 212323/12331/321312
hope you are fine 12131!~~~323131 in 33-years from now
happy face is important to maintaion by 98987 321321/32131
output:
hello my friend NUMBER
hope you are fine NUMBER in 33-years from now
happy face is important to maintaion by NUMBER NUMBER
Basically anything between two space that contains numbers and non-alphabetic symbols must be replaced by NUMBER. The rest of the text should be kept as-is.

Okay, I think I got this:
I need three steps:
Double up the white spaces
Replace all non-letter characters surrounded by a space or newline with NUMBER (while retaining the spaces)
Collapse double white spaces into single ones
This is how it looks now:
$ cat test.txt
hello my friend 212323/12331/321312
hope you are fine 12131!~~~323131 in 33-years from now
happy face is important to maintaion by 98987 321321/32131
123 This is a line
$ sed -r 's/ / /g;s/(^| )[^[:alpha:] ]+( |$)/\1NUMBER\2/g;s/ / /g' test.txt
hello my friend NUMBER
hope you are fine NUMBER in 33-years from now
happy face is important to maintaion by NUMBER NUMBER
NUMBER This is a line

To complement chw21's helpful solution with a perl solution that can handle not just spaces, but any mix of spaces and tabs between words:
perl -ple 's/(^|(?<=[[:blank:]]))[^[:alpha:][:blank:]]+((?=[[:blank:]])|$)/NUMBER/g' file
The use of look-behind ((?<=...) and look-ahead ((?=...)) assertions obviates the need for capture groups, and therefore the need for doubling spaces as an intermediate step; using [[:blank:]] (space or tab) in lieu of (just space) makes it work with any mix of spaces and tabs:
(^|(?<=[[:blank:]])) matches the beginning of a line (^) or any character preceded by a blank (space or tab)
[^[:alpha:][:blank:]]+ matches any nonempty run of characters composed of only non-letters and non-blanks
((?=[[:blank:]])|$) matches at the end of the line ($) or if the following character is a blank.

Related

Appending a count to a code in multiple files and saving the result

I'm looking for a bit of help here. I'm a complete newbie!
I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.
After some digging I have found-
perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml
but the issue is the counter does not reset in each file.
That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.
What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.
Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done
This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

How can i get only special strings (by condition) from file?

I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition?
for example, file contents:
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"
[2/Nov/2015][rule="mySecondRule"]"GET
http://anotheruselesssotialnetwork.com/picturewithdog.jpg"
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithzombie.jpg"
and i only need string with "myRule" and "cat"?
I think it should be perl, or bash, but it doesn't matter.
Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed:
sed -n '/myRule/ {N }; /myRule.*cat/ {p}'
the first rule appends the nextline to patternspace when myRule matches
the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs
This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline
perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt
and this is the second
perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt
When given your sample data as input, both methods produce this output
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"

Understanding sed command syntax and sed commands

Could someone explain what this sed command does here?
sed 's!^M$!!;s!\-!!g;s!\.!!g;s!\(..\)!\1:!g;s!:$!!'
It seems replacing/deleting some characters... But I couldn't figure it out... It's really complicated (I mean all of those s ; / g M ^ . and other characters)
thanx
regards
You can split it up into a series of substitutions:
s!^M$!!
s!\-!!g
s!\.!!g
s!\(..\)!\1:!g
s!:$!!
Each one is using ! as the delimiter, so the patterns are s!match!replacement!. The g on the end means that some of them are global, so will happen as many times as possible rather than only once on each line.
^ matches the start of the line and $ matches the end, so the first one removes any Ms that are found on a line by themselves.
The next two remove all . and - that are found. The . needs a slash before it so that it only matches a literal . rather than matching any character. The - doesn't need a slash before it but it doesn't do any harm either.
The fourth one adds a : after every 2 characters, using a capture group and back reference.
Hopefully you can work out what the last one does, based on my explanation of the first one!

How to decrement (subtract) number in file with sed

I've got some source code like the following where I call a function in C:
void myFunction (
&((int) table[1, 0]),
&((int) table[2, 0]),
&((int) table[3, 0])
);
...the only problem is that the function has >300 parameters (it's an auto-generated wrapper for initialising and calling a whole module; it was given to me and I cannot change it). And as you can see: I began accessing the array with a 1 instead of a 0... Great times, modifying all the 300 parameters, i.e. decrasing 300 x the x-coordinate of the array, by hand.
The solution I am looking for is how I could force sed to to do the work for me ;)
EDIT: Please note that the syntax above for accessing a two-dimensional array in C is wrong anyway! Of course it should be [1][0]... (so don't just copy-and-paste ;))
Basically, the command I came up with, was the following:
sed -r 's/(.*)(table\[)([0-9]+)(,)(.*)/echo "\1\2$((\3-1))\4\5"/ge' inputfile.c > outputfile.c
Well, this does not look very intuitive on the first sight - and I was missing good explanations for nearly every example I found.
So I will try to give a detailed explanation on this:
sed
--> basic command
-r
--> most examples you find are using -e; however, the -r parameter (only works with GNU sed) enables extended regular expressions and brings support for the + in a regex. It basically means "one or more matches".
's/input/output/ge'
--> this is the basic replacement syntax. It basically means "replace 'input' by 'output'". The /g is a "global" flag, i.e. sed will replace all occurences and not only the first one. You can add an additional e to execute the result in the bash. This is what we want to do here to handle the calculation.
(.*)
--> this matches "everthing" from the last match to the next match
(table\[)
--> the \ is to escape the bracket. This part of the expression will match Strings like table[
([0-9]+)
--> this one matches numbers with at least one digit, however, it can also match higher numbers with more than only one digit.
(,)
--> this simply matches the comma ,
(.*)
--> and again: the rest of the line
And now the interesting part:
echo "\1\2$((\3-1))\4\5"
the echo is a bash command
the \n (you can use every value from \1 up to \9) is some kind of "variable" for the inputs: \1 will contain the first match, \2 the seconds match, ... --> this helps you to preserve parts of the input string
the $((1+1)) is a simple bash syntax to calculate the value of the term inside the double brackets (in the complete sed command above, the \3 will of course be automatically replaced by the 3rd match, i.e. the 1st part inside the brackets to access the table's cells)
please note that we use quotation marks around the echo content to also be able to process lines with characters like & which would otherwise not work
The already mentioned e of \ge at the end will trigger the execution of the result in the bash. E.g. the first two lines of the example source code in the question would produce the following bash statements:
echo "void myFunction ("
echo " &((int) table[$((1-1)), 0]),"
which is being executed and results in the following output:
void myFunction (
&((int) table[0, 0]),
...which is exatcly what I wanted :)
BTW:
text > output.c
is simple bash syntax to output text (or in this case the sed-processed source code) to a file called output.c.
Good links about this topic are:
sed basics
regular expressions basics
Ahh and one more thing: You can also use sed in the git-Bash on Windows - if you are "forced" to use Windows at work like me ;)
PS: In the meantime I could have easily done this by hand but using sed was a lot more fun ;)
Here's another way you could do it, using Perl:
perl -pe 's/(table\[)(\d+)(,)/$1.($2-1).$3/e' file.c
This uses the e modifier to execute an expression in the replacement. The capture groups are concatenated together but the middle group has 1 subtracted from its value.
This will output to standard output so you can check that it does what you want. When you're happy, you can add the -i switch to overwrite the original file.

count quotes in a string that do not have a backslash before them

Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution
How about string.count('"') - string.count("\\"")?
result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).

Resources