Mask email address, phonenumber, ssn (pattern) using awk

Mask email address, phonenumber, ssn (pattern) using awk - bash

Requirement is to mask some sensitive data from the log file, below code works as expected when awk version is 4.0.2.
I will be greping the log files and then have to mask some data using pattern as mentioned in the below awk snippet and then return the result.
echo "123-123-432-123-999-889 and 123456 and 1234-1234-4321-1234 and xyz#abc.com" | awk ' gsub (/[0-9]{6,}|([0-9]{3,}.){3,}|\w{2,}#\w{2,}.\w{2,}/, "****") 1'
The same is not working in awk version 3.1.7 which is production server version.
I can use only grep, cat, awk and there is no permission to use perl or sed as it is restricted by Admin Team.
Expected Output:
****and **** and ****and ****
Solution should also work if the content is in file, for example
sample.log
123-123-432-123-999-889
and
123456
and
1234-1234-4321-1234
and xyz#abc.com
Command:
cat sample.log | awk ' gsub (/[0-9]{6,}|([0-9]{3,}.){3,}|\w{2,}#\w{2,}.\w{2,}/, "****") 1'
Please help me with awk which can work in 3.1.7 version of awk

Activate RE intervals with:
awk --re-interval '...'
You MAY also need to replace \ws with [[:alnum:]_].
The problem you;re having is that you're using a very old version of gawk from before RE Intervals (e.g. {1,3}) were enabled by default so in that old gawk every { and } is just a literal character for backward compatibility with the 1980s awks (old, broken awk and nawk), so you need to explicitly tell gawk to interpret {1,3} as a RE Interval instead of a literal string of 5 chars.
Idk if back then \w was supported or not so you MAY also need to use the bracket expression I suggested above instead.

Related

Combine multiple sed commands into one

I have a file example.txt, I want to delete and replace fields in it.
The following commands are good, but in a very messy way, unfortunately I'm a rookie to sed command.
The commands I used:
sed 's/\-I\.\.\/\.\.\/\.\.//\n/g' example.txt > example.txt1
sed 's/\-I/\n/g' example.txt1 > example.txt2
sed '/^[[:space:]]*$/d' > example.txt2 example.txt3
sed 's/\.\.\/\.\.\/\.\.//g' > example.txt3 example.txt
and then I'm deleting all the unnecessary files.
I'm trying to get the following result:
Common/Components/Component
Common/Components/Component1
Common/Components/Component2
Common/Components/Component3
Common/Components/Component4
Common/Components/Component5
Common/Components/Component6
Comp
App
The file looks like this:
-I../../../Common/Component -I../../../Common/Component1 -I../../../Common/Component2 -I../../../Common/Component3 -I../../../Common/Component4 -I../../../Common/Component5 -I../../../Common/Component6 -IComp -IApp ../../../
I want to know how the best way to transform input format to output format standard text-processing tool with 1 call with sed tool or awk.

With your shown samples, please try following awk code. Written and tested in GNU awk.
awk -v RS='-I\\S+' 'RT{sub(/^-I.*Common\//,"Common/Components/",RT);sub(/^-I/,"",RT);print RT}' Input_file
output with samples will be as follows:
Common/Components/Component
Common/Components/Component1
Common/Components/Component2
Common/Components/Component3
Common/Components/Component4
Common/Components/Component5
Common/Components/Component6
Comp
App
Explanation: Simple explanation would be, in GNU awk. Setting RS(record separator) as -I\\S+ -I till a space comes. In main awk program, check if RT is NOT NULL, substitute starting -I till Common with Common/Components/ in RT and then substitute starting -I with NULL in RT. Then printing RT here.

If you don't REALLY want the string /Components to be added in the middle of some output lines then this may be what you want, using any awk in any shell on every Unix box:
$ awk -v RS=' ' 'sub("^-I[./]*","")' file
Common/Component
Common/Component1
Common/Component2
Common/Component3
Common/Component4
Common/Component5
Common/Component6
Comp
App
That would fail if any of the paths in your input contained blanks but you don't show that as a possibility in your question so I assume it can't happen.

What about
sed -i 's/\-I\.\.\/\.\.\/\.\.//\n/g
s/\-I/\n/g
/^[[:space:]]*$/d
s/\.\.\/\.\.\/\.\.//g' example.txt

Unable to remove last field CSV file

i have csv file contains data like, I need to get all fields as it is except last one.
"one","two","this has comment section1"
"one","two","this has comment section2 and ( anything ) can come here ( ok!!!"
gawk 'BEGIN {FS=",";OFS=","}{sub(FS $NF, x)}1'
gives error-
fatal: Unmatched ( or (:
I know if i remove '(' from second line solves the problem but i can not remove anything from comment section.

With any awk you could try:
awk 'BEGIN{FS=",";OFS=","}{$NF="";sub(/,$/,"")}1' Input_file
Or with GNU awk try:
awk 'BEGIN{FS=",";OFS=","}NF{--NF};1' Input_file

Since you mention that everything can come here, you might also have a line that looks like:
"one","two","comment with a , comma"
So it is a bit hard to just use the <comma>-character as a field separator.
The following two posts are now very handy:
What's the most robust way to efficiently parse CSV using awk?
[U&L] How to delete the last column of a file in Linux (Note: this is only for GNU awk)
Since you work with GNU awk, you can thus do any of the following two:
$ awk -v FPAT='[^,]*|"[^"]+"' -v OFS="," 'NF{NF--}1'
$ awk 'BEGIN{FPAT="[^,]*|\"[^\"]+\"";OFS=","}NF{NF--}1'
$ awk 'BEGIN{FPAT="[^,]*|\042[^\042]+\042";OFS=","}NF{NF--}1'
Why is your command failing: The sub(ere,repl,in) command of awk assumes that the first part ere is an extended regular expression. Hence, the bracket has a special meaning. If you want to replace fields which are known and unique, you should not use sub, but just redefine the field:
$ awk '{$NF=""}'
If you want to replace a string matching a field, you should do this:
s=$(number);while(i=index(s,$0)){$0=substr(1,i-1) "repl" substr(i+length(s),$0) }

Extracting release number from Jira created bitbucket branch

I am using Jira to create a bitbucket branch for releases. As part of the build process I need to extract the release number from the branch name.
An earlier part of the build dumps the whole branch name to a text file. The problem I'm having is removing all the text before the build number.
An example branch name would be:
release/some-jira-ticket-343-X.X.X
Where X.X.X is the release number e.g. 1.11.1 (each of X could be any length integer).
My first thought was to literally just select the last 3 characters with sed, however as X could be any length this won't work.
Another post (Removing non-alphanumeric characters with sed) suggesting using the sed alpha class. However this won't work as the jira ticket ID will have numbers in.
Any ideas?

You can remove all characters up to last -:
$ sed 's/.*-//' <<< "release/some-jira-ticket-343-1.11.2"
1.11.2
or with grep, to output only digits and dots at the end of the line:
grep -o '[0-9.]*$'

awk solution,
$ awk -F- '{print $NF}' <<< "release/some-jira-ticket-343-1.11.1"
grep solution,
grep -oP '[0-9]-\K.*' <<< "release/some-jira-ticket-343-1.11.1"

use string operators:
var="release/some-jira-ticket-343-2.155.7"
echo ${var##*-}
print:
2.155.7

Awk solution:
awk -F [-.] '{ print $5"."$6"."$7 }' <<< "release/some-jira-ticket-343-12.4.7"
12.4.7
Set the field delimiter to - and . and then extract the pieces of data we need.

Searching non-printable characters using hexadecimal notation using gawk and/or sed

In Windows command line I am trying to fix broken lines that happen in certain field separated by "|". In some business systems free text fields allow users to input return and these sometimes break the record line when the transaction is extracted.
I have GAWK(GNU Awk 3.1.0) and SED(GNU sed version 4.2.1) from UnxUtils and GnuWin. My data is as follows:
smith|Login|command line is my friend|2
oliver|Login|I have no idea
why I am here|10
dennis|Payroll|are we there yet?|100
smith|Time|going to have some fun|200
smith|Logout|here I come|10
The second line is broken due to reason explained in the first paragraph. Return at the end of broken line 2 is a regular Windows return and looks like x0D x0A in a hex editor.
While removing using sed or gawk instead of /n or /r type notations I would like to be able use a hex value(more than one is the case) to add more flexibility. The code should be able to replace it with something only if it appears in the third column. Only sed or (x)awk should be used. For gawk "sed style" on the fly replacing( as with -i parameter) method if possible would be helpful.
Tried the following but does not capture anything:
gawk -F "|" "$3 ~ /\x0D\x0A/" data.txt
Also tried replacing with
gawk -F "|" "{gsub(/\x0d\x0a/, \x20, $3); print }" OFS="|" data.txt
or
sed "s/\x0dx0a/\x20/g" data.txt
(was able to capture x20(space) with sed but no luck with returns)

It's not entirely clear what you're trying to do (why would you want to replace line endings with a blank char???) but this might get you on the right path:
awk -v RS='\r\n' -v ORS=' ' '1' file
and if you want inplace editing just add -i inplace up front.
This is all gawk-only for inplace editing and multi-char RS. You may also need to add -v BINMODE=3 (also gawk-only) depending on the platform you're running on to stop the underlying C primitives from stripping the \rs before gawk sees them.
Hang on, I see you're on gawk 3.1.0 - that is 5+ years out of date, upgrade your gawk version to get access to the latest bug fixes and features (including -i inplace).
Hang on 2 - are you actually trying to replace the newlines within records with a blank char? That's even simpler:
awk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/," ")} 1' file
For example (added a \s* before \n as your input has trailing white space that I assume you also want removed):
$ cat -v file
smith|Login|command line is my friend|2^M
oliver|Login|I have no idea
why I am here|10^M
dennis|Payroll|are we there yet?|100^M
smith|Time|going to have some fun|200^M
smith|Logout|here I come|10^M
$ awk 'BEGIN{RS=ORS="\r\n"} {gsub(/\s*\n/," ")} 1' file | cat -v
smith|Login|command line is my friend|2^M
oliver|Login|I have no idea why I am here|10^M
dennis|Payroll|are we there yet?|100^M
smith|Time|going to have some fun|200^M
smith|Logout|here I come|10^M
or to use UNIX line endings in the output instead of DOS just don't set ORS:
$ awk 'BEGIN{RS="\r\n"} {gsub(/\s*\n/," ")} 1' file | cat -v
smith|Login|command line is my friend|2
oliver|Login|I have no idea why I am here|10
dennis|Payroll|are we there yet?|100
smith|Time|going to have some fun|200
smith|Logout|here I come|10

bash how to extract a field based on its content from a delimited string

Problem - I have a set of strings that essentially look like this:
|AAAAAA|BBBBBB|CCCCCCC|...|XXXXXXXXX|...|ZZZZZZZZZ|
The '...' denotes omitted fields.
Please note that the fields between the pipes ('|') can appear in ANY ORDER and not all fields are necessarily present. My task is to find the "XXXXXXX" field and extract it from the string; I can specify that field with a regex and find it with grep/awk/etc., but once I have that one line extracted from the file, I am at a loss as to how to extract just that text between the pipes.
My searches have turned up splitting the line into individual fields and then extracting the Nth field, however, I do not know what N is, that is the trick.
I've thought of splitting the string by the delimiter, substituting the delimiter with a newline, piping those lines into a grep for the field, but that involves running another program and this will be run on a production server through near-TB of data, so I wanted to minimize program invocations. And I cannot copy the files to another machine nor do I have the benefit of languages like Python, Perl, etc., I'm stuck with the "standard" UNIX commands on SunOS. I think I'm being punished.
Thanks

As an example, let's extract the field that matches MyField:
Using sed
$ s='|AAAAAA|BBBBBB|CCCCCCC|...|XXXXXXXXX|12MyField34|ZZZZZZZZZ|'
$ sed -E 's/.*[|]([^|]*MyField[^|]*)[|].*/\1/' <<<"$s"
12MyField34
Using awk
$ awk -F\| -v re="MyField" '{for (i=1;i<=NF;i++) if ($i~re) print $i}' <<<"$s"
12MyField34
Using grep -P
$ grep -Po '(?<=\|)[^|]*MyField[^|]*' <<<"$s"
12MyField34
The -P option requires GNU grep.

$ sed -e 's/^.*|\(XXXXXXXXX\)|.*$/\1/'
Naturally, this only makes sense if XXXXXXXXX is a regular expression.
This should be really fast if used something like:
$ grep '|XXXXXXXXX|' somefile | sed -e ...

One hackish way -
sed 's/^.*|\(<whatever your regex is>\)|.*$/\1/'
but that might be too slow for your production server since it may involve a fair amount of regex backtracking.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Mask email address, phonenumber, ssn (pattern) using awk - bash

Related

Combine multiple sed commands into one

Unable to remove last field CSV file

Extracting release number from Jira created bitbucket branch

Searching non-printable characters using hexadecimal notation using gawk and/or sed

bash how to extract a field based on its content from a delimited string

Categories

Resources