modularize awk script to mask sensitive data in delimited file - bash

I have a delimited file in the below format:
text1|12345|email#email.com|01-01-2020|1
Considering all the fields are sensitive data, i had written the following awk command to mask the first field with random data.
awk -F'|' -v cmd="strings /dev/urandom | tr -dc '0-9' | fold -w 5" 'BEGIN {OFS=FS} {cmd | getline a;$1=a;print}' source.dat > source_masked.dat
If i want to mask additional fields I add the following.
awk -F'|' -v cmd1="strings /dev/urandom | tr -dc '0-9' | fold -w 5" -v cmd2="strings /dev/urandom | tr -dc 'A-Za-z0-9' | fold -w 7" 'BEGIN {OFS=FS} {cmd | getline a; cmd2 | getline b;$2=b}' source.dat > source_masked.dat
How do i scale it if i want to mask 100s of columns with different datatypes?
Basically, i want to take the following from config file:
column number, datatype, length
and use it in the awk to generate the commands and the replacement script dynamically.
Could you please advice on the same.
I rewrote the same accepted answer on awk as it took a long time to mask larger files using bash.
The code for the same is:
function mask(datatype, precision) {
switch (datatype) {
case "string":
command = "strings /dev/urandom | tr -dc '[:alpha:]' | fold -w "
precision
break
case "alphaNumeric":
command = "strings /dev/urandom | tr -dc '[:alnum:]' | fold -w "
precision
break
case "number":
command = "strings /dev/urandom | tr -dc '[:digit:]' | fold -w "
precision
break
default:
command = "strings /dev/urandom | tr -dc '[:alnum:]' | fold -w "
precision
}
command | getline v
return v
}
BEGIN {
while ((getline line < "properties.conf") > 0) {
split(line, a, ",")
col = a[1]
type = a[2]
len = a[3]
masks[col] = type " "
len
}
IFS = "|"
OFS = "|"
} {
for (i = 1; i <= NF; i++) {
if (masks[i] != "") {
split(masks[i], m, " ")
$i = mask(m[1], m[2])
}
}
print
}

One approach is to read the mask configuration file into an array indexed by column number.
Then, read the data file line by line. put each field in a second array. Then, for each element of the mask array, randomize the appropriate data field. When all fields are updated, output the new line and move on to the next line.
Does this have to be done in awk? It might be easier/quicker to do it in native bash:
#!/bin/bash
declare mask_file=masks.conf
declare input_file=input.dat
declare output_file=output.dat
function create_mask() {
# ${1} is type, ${2} is length
case ${1} in
string ) ;;
date ) ;;
number ) ;;
* ) ;;
esac
}
while read column type length; do
masks[${column}]="${type} ${length}"
done < ${mask_file}
IFS='|'
while read -a data; do
for column in ${!masks[#]}; do
data[${column}]=$(create_mask ${masks[${column}]})
done
echo "${data[*]}" # Uses IFS as output separator.
done < ${input_file} > ${output_file}
I have not included the full contents of the create_mask() function, as I do not know what types you plan to support or format you want for each type.

You can use the built-in rand function instead to generate a random number.
Define an associative array with the list of fields that you want to mask.
E.g. here is a sample code that will mask field 1 & 4
awk -F\| '
BEGIN {
A_mask_field[1]
A_mask_field[4]
}
{
for ( i = 1; i <= NF; i++ )
{
if ( i in A_mask_field )
$i = sprintf( "%d", rand() * length($i) * 100000 )
}
}
1
' OFS=\| file

Related

end result of bash command with a dot (.)

I have a bash script that greps and sorts information from /etc/passwd here
export FT_LINE1=13
export FT_LINE2=23
cat /etc/passwd | grep -v "#" | awk 'NR%2==1' | cut -f1 -d":" | rev | sort -r | awk -v l1="$FT_LINE1" -v l2="$FT_LINE2" 'NR>=l1 && NR<=l2' | tr '\n' ',' | sed 's/, */, /g'
The result is this list
sstq_, sorebrek_brk_, soibten_, sirtsa_, sergtsop_, sec_, scodved_, rlaxcm_, rgmecived_, revreswodniw_, revressta_,
How can i replace the last comma with a dot (.)? I want it to look like this
sstq_, sorebrek_brk_, soibten_, sirtsa_, sergtsop_, sec_, scodved_, rlaxcm_, rgmecived_, revreswodniw_, revressta_.
You can add:
| sed 's/,$/./'
(where $ means "end of line").
There are way to many pipes in your command, some of them can be removed.
As explained in the comment cat <FILE> | grep is a bad habit!!! In general, cat <FILE> | cmd should be replaced by cmd <FILE> or cmd < FILE depending on what type of arguments your command does accept.
On a few GB size file to process, you will already feel the difference.
This being said, you can do the whole processing without using a single pipe by using awk for example:
awk -v l1="$FT_LINE1" -v l2="$FT_LINE2" 'function reverse(s){p=""; for(i=length(s); i>0; i--){p=p substr(s,i,1);}return p;}BEGIN{cmp=0; FS=":"; ORS=","}!/#/{cmp++;if(cmp%2==1) a[cmp]=reverse($1);}END{asort(a);for(i=length(a);i>0;i--){if((length(a)-i+1)>=l1 && (length(a)-i)<=l2){if(i==1){ORS=".";}print a[i];}}}' /etc/passwd
Explanations:
# BEGIN rule(s)
BEGIN {
cmp = 0 #to be use to count the lines since NR can not be used directly
FS = ":" #file separator :
ORS = "," #output record separator ,
}
# Rule(s)
! /#/ { #for lines that does not contain this char
cmp++
if (cmp % 2 == 1) {
a[cmp] = reverse($1) #add to an array the reverse of the first field
}
}
# END rule(s)
END {
asort(a) #sort the array and process it in reverse order
for (i = length(a); i > 0; i--) {
# apply your range conditions
if (length(a) - i + 1 >= l1 && length(a) - i <= l2) {
if (i == 1) { #when we reach the last character to print, instead of the comma use a dot
ORS = "."
}
print a[i] #print the array element
}
}
}
# Functions, listed alphabetically
#if the reverse operation is necessary then you can use the following function that will reverse your strings.
function reverse(s)
{
p = ""
for (i = length(s); i > 0; i--) {
p = p substr(s, i, 1)
}
return p
}
If you don't need to reverse part you can just remove it from the awk script.
In the end, not a single pipe is used!!!

printing contents of variable to a specified line in outputfile with sed/awk

I have been working on a script to concatenate multiple csv files into a single, large csv. The csv's contain names of folders and their respective sizes, in a 2-column setup with the format "Size, Projectname"
Example of a single csv file:
49747851728,ODIN
32872934580,_WORK
9721820722,LIBRARY
4855839655,BASELIGHT
1035732096,ARCHIVE
907756578,USERS
123685100,ENV
3682821,SHOTGUN
1879186,SALT
361558,SOFTWARE
486,VFX
128,DNA
For my current test I have 25 similar files, with different numbers in the first column.
I am trying to get this script to do the following:
Read each csv file
For each Project it sees, scan the outputfile if that Project was already printed to the file. If not, print the Projectname
For each file, for each Project, if the Project was found, print the Size to the output csv.
However, I need the Projects to all be on textline 1, comma separated, so I can use this outputfile as input for a javascript graph. The Sizes should be added in the column below their projectname.
My current script:
csv_folder=$(echo "$1" | sed 's/^[ \t]*//;s/\/[ \t]*$//')
csv_allfiles="$csv_folder/*.csv"
csv_outputfile=$csv_folder.csv
echo -n "" > $csv_outputfile
for csv_inputfile in $csv_allfiles; do
while read line && [[ $line != "" ]]; do
projectname=$(echo $line | sed 's/^\([^,]*\),//')
projectfound1=$(cat $csv_outputfile | grep -w $projectname)
if [[ ! $projectfound1 ]]; then
textline=1
sed "${textline}s/$/${projectname}, /" >> $csv_outputfile
for csv_foundfile in $csv_allfiles; do
textline=$(echo $textline + 1 | bc )
projectfound2=$(cat $csv_foundfile | grep -w $projectname)
projectdata=$(echo $projectfound2 | sed 's/\,.*$//')
if [[ $projectfound2 ]]; then
sed "${textline}s/$/$projectdata, /" >> $csv_outputfile
fi
done
fi
done < $csv_inputfile
done
My current script finds the right information (projectname, projectdata) and if I just 'echo' those variables, it prints the correct data to a file. However, with echo it only prints in a long list per project. I want it to 'jump back' to line 1 and print the new project at the end of the current line, then run the loop to print data at the end of each next line.
I was thinking this should be possible with sed or awk. sed should have a way of inserting text to a specific line with
sed '{n}s/search/replace/'
where {n} is the line to insert to
awk should be able to do the same thing with something like
awk -v l2="$textline" -v d="$projectdata" 'NR == l2 {print d} {print}' >> $csv_outputfile
However, while replacing the sed commands in the script with
echo $projectname
echo $projectdata
spit out the correct information (so I know my variables are filled correctly) the sed and awk commands tend to spit out the entire contents of their current inputcsv; not just the line that I want them to.
Pastebin outputs per variant of writing to file
https://pastebin.com/XwxiAqvT - sed output
https://pastebin.com/xfLU6wri - echo, plain output (single column)
https://pastebin.com/wP3BhgY8 - echo, detailed output per variable
https://pastebin.com/5wiuq53n - desired output
As you see, the sed output tends to paste the whole contents of inputcsv, making the loop stop after one iteration. (since it finds the other Projects after one loop)
So my question is one of these;
How do I make sed / awk behave the way I want it to; i.e. print only the info in my var to the current textline, instead of the whole input csv. Is sed capable of this, printing just one line of variable? Or
Should I output the variables through 'echo' into a temp file, then loop over the temp file to make sed sort the lines the way I want them to? (Bear in mind that more .csv files will be added in the future, I can't just make it loop x times to sort the info)
Is there a way to echo/print text to a specific text line without using sed or awk? Is there a printf option I'm missing? Other thoughts?
Any help would be very much appreciated.
A way to accomplish this transposition is to save the data to an associative array.
In the following example, we use a two dimensional array to keep track of our data. Because ordering seems to be important, we create a col array and create a new increment whenever we see a new projectname -- this col array ends up being our first index into our data. We also create a row array which we increment whenever we see a new data for the current column. The row number is our second index into data. At the end, we print out all the records.
#! /usr/bin/awk -f
BEGIN {
FS = ","
OFS = ", "
rows=0
cols=0
head=""
split("", data)
split("", row)
split("", col)
}
!($2 in col) { # new project
if (head == "")
head = $2
else
head = head OFS $2
i = col[$2] = cols++
row[i] = 0
}
{
i = col[$2]
j = row[i]++
data[i,j] = $1
if (j > rows)
rows = j
}
END {
print head
for (j=0; j<=rows; ++j) {
if ((0,j) in data)
x = data[0,j]
else
x = ""
for (i=1; i<cols; ++i) {
if ((i,j) in data)
x = x OFS data[i,j]
else
x = x OFS
}
print x
}
}
As a bonus, here is a script to reproduce the detailed output from one of your pastebins.
#! /usr/bin/awk -f
BEGIN {
FS = ","
split("", data) # accumulated data for a project
split("", line) # keep track of textline for data
split("", idx) # index into above to maintain input order
sz = 0
}
$2 in idx { # have seen this projectname
i = idx[$2]
x = ORS "textline = " ++line[i]
x = x ORS "textdata = " $1
data[i] = data[i] x
next
}
{ # new projectname
i = sz++
idx[$2] = i
x = "textline = 1"
x = x ORS "projectname = " $2
x = x ORS "textline = 2"
x = x ORS "projectdata = " $1
data[i] = x
line[i] = 2
}
END {
for (i=0; i<sz; ++i)
print data[i]
}
Fill parray with project names and array with values, then print them with bash printf, You can choose column width in printf command (currently 13 characters - %13s)
#!/bin/bash
declare -i index=0
declare -i pindex=0
while read project; do
parray[$pindex]=$project
index=0
while read;do
array[$pindex,$index]="$REPLY"
index+=1
done <<< $(grep -h "$project" *.csv|cut -d, -f1)
pindex+=1
done <<< $(cat *.csv|cut -d, -f 2|sort -u)
maxi=$index
maxp=$pindex
for (( pindex=0; $pindex < $maxp ; pindex+=1 ));do
STR="%13s $STR"
VAL="$VAL ${parray[$pindex]}"
done
printf "$STR\n" $VAL
for (( index=0; $index < $maxi;index+=1 ));do
STR=""; VAL=""
for (( pindex=0; $pindex < $maxp;pindex+=1 )); do
STR="%13s $STR"
VAL="$VAL ${array[$pindex,$index]}"
done
printf "$STR\n" $VAL
done
If you are OK with the output being sorted by name this one-liner might be of use:
awk 'BEGIN {FS=",";OFS=","} {print $2,$1}' * | sort | uniq
The files have to be in the same directory. If not a list of files replaces the *. First it exchanges the two fields. Awk will take a list of files and do the concatenation. Then sort the lines and print just the unique lines. This depends on the project size always being the same.
The simple one-liner above gives you one line for each project. If you really want to do it all in awk and use awk write the two lines, then the following would be needed. There is a second awk at the end that accumulates each column entry in an array then spits it out at the end:
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | awk 'BEGIN {n=0}
{p[n]=$1;s[n++]=$2}
END {for (i=0;i<n;i++) printf "%s,",p[i];print "";
for (i=0;i<n;i++) printf "%s,",s[i];print ""}'
If you have the rs utility then this can be simplified to
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | rs -C',' -T

I'm trying to use tr with multiple sets and not sure how

I have used:
tr -dc [:alpha:] < $fileDoc | wc -c
to count all letters,
tr -dc ' ' < $fileDoc | wc -c
to count all spaces,
tr -dc '\n' < $fileDoc | wc -c
to count all new lines in a text document.
What I would like to do now is to do now is count all other characters in the document as I will call every thing else.
Here is the text from the document:
Hello this is a test text document.
123
!##
Is there a way to delete everything [:alpha:], , and \n found and count the remaining characters?
This should do the trick
tr -d '[:alpha:] \n' < $fileDoc | wc -c
Or perhaps if you want to include tabs in the definition of blanks
tr -d '[:alpha:][:space:]' < $fileDoc | wc -c
Based on the OP's comment, to delete alphabetical, spaces, digits, and newlines and count all remaining characters:
tr -d '[:alnum:][:space:]' < $fileDoc | wc -c
[:alnum:] accounts for letters of the alphabet and digits. [:space:] takes care of all whitespace including newlines
Just posting here for reference, if you wish to do all in one-shot then this awk script should work:
awk -v FS='' '
{
for(i=1; i<=NF; i++) {
if($i ~ /[a-zA-Z]/) {alpha++};
if($i == " ") {space++};
if($i !~ /[A-Za-z0-9 ]/) {spl++}
}
}
END {
printf "Space=%s, Alphabets=%s, SplChars=%s, NewLines=%s\n", space, alpha, spl, NR
}' file
$ cat file
This is a text
I want to count
alot of $tuff
in 1 single shot
$ awk -v FS='' '
{
for(i=1; i<=NF; i++) {
if($i ~ /[a-zA-Z]/) {alpha++};
if($i == " ") {space++};
if($i !~ /[A-Za-z0-9 ]/) {spl++}
}
}
END {
printf "Space=%s, Alphabets=%s, SplChars=%s, NewLines=%s\n", space, alpha, spl, NR
}' file
Space=11, Alphabets=45, SplChars=1, NewLines=4

Switching the format of this output?

I have this script written to print the distribution of words in one or more files:
cat "$#" | tr -cs '[:alpha:]' '\n' |
tr '[:upper:]' '[:lower:]' | sort |
uniq -c | sort -n
Which gives me an output such as:
1 the
4 orange
17 cat
However, I would like to change it so that the word is listed first (I'm assuming sort would be involved so its alphabetical) , not the number, like so:
cat 17
orange 4
the 1
Is there just a simple option I would need to switch this? Or is it something more complicated?
Pipe the output to
awk '{print $2, $1}'
or you can use awk for the complete task:
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
usage:
awk -f wordfreq.awk input

How to sort the columns of a CSV file by the ratio of two columns?

I have a CSV file like this:
bear,1,2
fish,3,4
cats,1,5
mice,3,3
I want to sort it, from highest to lowest, by the ratio of column 2 and 3. E.g.:
bear,1,2 # 1/2 = 0.5
fish,3,4 # 3/4 = 0.75
cats,1,5 # 1/5 = 0.2
mice,3,3 # 3/3 = 1
This would be sorted like this:
mice,3,3
fish,3,4
bear,1,2
cats,1,5
How can I sort the columns from highest to lowest by the ratio of the two numbers in column 2 and 3?
awk 'BEGIN { FS = OFS = ","} {$4 = $2/$3; print}' | sort -k4,4nr -t, | sed 's/,[^,]*$//' inputfile
or using GNU AWK (gawk):
awk -F, '{a[$3/$2] = $3/$2; b[$3/$2] = $0} END {c = asort(a); for (i = 1; i <= c; i++) print b[a[i]]}' inputfile
The methods above are better than the following, but this is more efficient than another answer which uses Bash and various utilities:
while IFS=, read animal dividend divisor
do
quotient=$(echo "scale=4; $dividend/$divisor" | bc)
echo "$animal,$dividend,$divisor,$quotient"
done < inputfile | sort -k4,4nr -t, | sed 's/,[^,]*$//'
As a one-liner:
while IFS=, read animal dividend divisor; do quotient=$(echo "scale=4; $dividend/$divisor" | bc); echo "$animal,$dividend,$divisor,$quotient"; done < inputfile | sort -k4,4nr -t | sed 's/,[^,]*$//'
Why not just create another column that holds the ratio of the second and third columns and then sort on that column?
bash is not meant for stuff like that - pick your own favorite programming language, and do it there.
If you insist... here is an example:
a=( `cut -d "," -f 2 mat.csv` ); b=( `cut -d "," -f 3 mat.csv` );for i in {0..3};do (echo -n `head -n $((i+1)) mat.csv|tail -1`" "; echo "scale=4;${a[i]}/${b[i]}"|bc) ;done|sort -k 2 -r
Modify filename and length.

Resources