script to find similar email users - bash

We have a mail server and I am trying to write a script that will find all users with similar names to avoid malicious users from impersonating legitimate users. For example, a legit user may have the name of james2014#domain.com but a malicious user may register as james20l4#domain.com. The difference, if you notice carefully, is that I replaced the number 'one' with the letter 'l' (el). So I am trying to write something that can consult my /var/vmail/domain/* and find similar names and alert me (the administrator). I will then take the necessary steps to do what I need. Really appreciate any help.

One hacky way to do this is to derive "normalized" versions of your usernames, put those in an associative array as keys mapping to the original input, and use those to find problems.
The example I posted below uses bash associative arrays to store the mapping from normalized name to original name, and tr to switch some characters for other characters (and delete other characters entirely).
I'm assuming that your list of users will fit into memory; you'll also need to tweak the mapping of modified and removed characters to hit your favorite balance between effectiveness and false positives. If your list can't fit in memory, you can use a single file or the filesystem to approximate it, but honestly if you're processing that many names you're probably better off with a non-shell programming language.
Input:
doc
dopey
james2014
happy
bashful
grumpy
james20l4
sleepy
james.2014
sneezy
Script:
#!/bin/bash
# stdin: A list of usernames. stdout: Pairs of names that match.
CHARS_TO_REMOVE="._\\- "
CHARS_TO_MAP_FROM="OISZql"
CHARS_TO_MAP_TO="0152g1"
normalize() {
# stdin: A word. stdout: A modified version of the same word.
exec tr "$CHARS_TO_MAP_FROM" "$CHARS_TO_MAP_TO" \
| tr --delete "$CHARS_TO_REMOVE" \
| tr "A-Z" "a-z"
}
declare -A NORMALIZED_NAMES
while read NAME; do
NORMALIZED_NAME=$(normalize <<< "$NAME")
# -n tests for non-empty strings, as it would be if the name were set already.
if [[ -n ${NORMALIZED_NAMES[$NORMALIZED_NAME]} ]]; then
# This name has been seen before! Print both of them.
echo "${NORMALIZED_NAMES[$NORMALIZED_NAME]} $NAME"
else
# This name has not been seen before. Store it.
NORMALIZED_NAMES["$NORMALIZED_NAME"]="$NAME"
fi
done
Output:
james2014 james20l4
james2014 james.2014

Related

Sending script and file content via STDIN

I generate (dynamically) a script concatenating the following files:
testscript1
echo Writing File
cat > /tmp/test_file <<EOF
testcontent
line1
second line
testscript2
EOF
echo File is written
And I execute by calling
$ cat testscript1 testcontent testscript2 | ssh remote_host bash -s --
The effect is that the file /tmp/test_file is filled with the desired content.
Is there also a variant thinkable where binary files can be supplied in a similar fashion? Instead of cat of course dd could be used or other Tools, but the problem I see is 'telling' them that the STDIN now ended (can I send ^D through that stream?)
I am not able to get my head around that problem, but there is likely no comparable solution. However, I might be wrong, so I'd be happy to hear from you.
Regards,
Mazze
can I send ^D through that stream
Yes but you don't want to.
Control+D, commonly notated ^D, is just a character -- or to be pedantic (as I often am), a codepoint in the usual character code (ASCII or a superset like UTF-8) that we treat as a character. You can send that character/byte by a number of methods, most simply printf '\004', but the receiving system won't treat it as end-of-file; it will instead be stored in the destination file, just like any other data byte, followed by the subsequent data that you meant to be a new command and file etc.
^D only causes end-of-file when input from a terminal (more exactly, a 'tty' device) -- and then only in 'cooked' mode (which is why programs like vi and less can do things very different from ending a file when you type ^D). The form of ssh you used doesn't make the input a 'tty' device. ssh can make the input (and output) a 'tty' (more exactly a subclass of 'tty' called a pseudo-tty or 'pty', but that doesn't matter here) if you add the -t option (in some situations you may need to repeat it as -t -t or -tt). But then if your binary file contains any byte with the value \004 -- or several other special values -- which is quite possible, then your data will be corrupted and garbage commands executed (sometimes), which definitely won't do what you want and may damage your system.
The traditional approach to what you are trying to do, back in the 1980s and 1990s, was 'shar' (shell archive) and the usual solution to handling binary data was 'uuencode', which converts binary data into only printable characters that could go safely go through a link like this, matched by 'uudecode' which converts it back. See this surviving example from GNU. uuencode and uudecode themselves were part of a communication protocol 'uucp' used mostly for email and Usenet, which are (all) mostly obsolete and forgotten.
However, nearly all systems today contain a 'base64' program which provides equivalent (though not identical) functionality. Within a single system you can do:
base64 <infile | base64 -d >outfile
to get the same effect as cp infile outfile. In your case you can do something like:
{ echo "base64 -d <<END# >outfile"; base64 <infile; echo "END#"; otherstuff; } | ssh remote bash
You can also try:
cat testscript1 testcontent testscript2 | base64 | ssh <options> "base64 --decode | bash"
Don't worry about ^D, because when your input is exhausted, the next processes of the pipeline will notice that they have reached the end of the input file.

(bash) What is the least redundant way to systematically apply changes to an array of variables?

My goal is to check a list of file paths if they end in "/" and remove it if that is the case.
Ideally I would like to change the original FILEPATH variables to reflect this change, and I'd like this to work for a long list without unnecessary redundancy. I tried doing it as a loop, but the changes didn't alter the original variables, it just changed the iterating "EACH_PATH" variable. Can anyone think of a better way to do this?
Here is my code:
FILEPATH1="filepath1/file1"
FILEPATH2="filepath2/file2/"
PATH_ARRAY=(${FILEPATH1} ${FILEPATH2})
echo ${PATH_ARRAY[#]}
for EACH_PATH in ${PATH_ARRAY[#]}
do
if [ "${EACH_PATH:$((${#EACH_PATH}-1)):${#EACH_PATH}}"=="/" ]
then EACH_PATH=${EACH_PATH:0:$((${#EACH_PATH}-1))}
fi
done
edit: I'm happy to do this in a totally different way and scrap the code above, I just want to know the most elegant way to do this.
I'm not entirely clear on the actual goal here, but depending on the situation I can see several possible solutions. The best (if it'll work in the situation) is to dispense with the individual variables, and just use array entries. For example, you could use:
declare -a filepath
filepath[1]="filepath1/file1"
filepath[2]="filepath2/file2/"
for index in "${!filepath[#]}"; do
if [[ "${filepath[index]}" = *?/ ]]; then
filepath[index]="${filepath[index]%/}"
fi
done
...and then use "${filepath[x]}" instead of "$FILEPATHx" throughout. Some notes:
I've used lowercase names. It's generally best to avoid all-caps names, since there are a lot of them with special functions, and accidentally using one of those names can cause trouble.
"${!filepath[#]}" gets a list of the indexes of the array (in this case, "1" "2") rather than their values; this is necessary so we can set the entries rather than just look at them.
I changed the logic of the slash-trimming test -- it uses [[ = ]] to do pattern matching, to see if the entry ends with "/" and has at least one character before that (i.e. it isn't just "/", 'cause you don't want to trim that). Then it uses in the expansion %/ to just trim "/" from the end of the value.
If a numerically-indexed array won't work (and you have at least bash version 4), how about a string-indexed ("associative") array? It's very similar, but use declare -A and use $ on variables in the index (and generally quote them). Something like this:
declare -A filepath
filepath["foo"]="filepath1/file1"
filepath["bar"]="filepath2/file2/"
for index in "${!filepath[#]}"; do
if [[ "${filepath["$index"]}" = *?/ ]]; then
filepath["$index"]="${filepath["$index"]%/}"
fi
done
If you really need separate variables instead of array entries, you might be able to use an array of variable names, and indirect variable references. how this works varies quite a bit between different shells, and can easily be unsafe depending on what's in your data (in this case, specifically what's in path_array). Here's a way to do it in bash:
filepath1="filepath1/file1"
filepath2="filepath2/file2/"
path_array=(filepath1 filepath2)
for varname in "${path_array[#]}"; do
if [[ "${!varname}" = *?/ ]]; then
declare "$varname=${!varname%/}"
fi
done
Using sed
PATH_ARRAY=($(echo ${PATH_ARRAY[#]} | sed 's#\/ ##g;s#/$##g'))
Demo:
$FILEPATH1="filepath1/file1"
$FILEPATH2="filepath2/file2/"
$PATH_ARRAY=(${FILEPATH1} ${FILEPATH2})
$echo ${PATH_ARRAY[#]}
filepath1/file1 filepath2/file2/
$PATH_ARRAY=($(echo ${PATH_ARRAY[#]} | sed 's#\/ ##g;s#/$##g'))
$echo ${PATH_ARRAY[#]}
filepath1/file1 filepath2/file2
$

How do I move files into folders with similar names in Unix?

I'm sorry if this question has been asked before, I just didn't know how to word it as a search query.
I have a set of folders that look like this:
Brain - Amygdala/ Brain - Spinal cord (cervical c-1)/ Skin - Sun Exposed (Lower leg)/
Brain - Caudate (basal ganglia)/ Lung/ Whole Blood/
I also have a set of files that look like this:
Brain_Amygdala.v7.covariates_output.txt Skin_Not_Sun_Exposed_Suprapubic.v7.covariates_output.txt
Brain_Caudate_basal_ganglia.v7.covariates_output.txt Skin_Sun_Exposed_Lower_leg.v7.covariates_output.txt
Brain_Spinal_cord_cervical_c-1.v7.covariates_output.txt Whole_Blood.v7.covariates_output.txt
As you can see, the files do not perfectly match up with the directories in their names. For example, Brain_Amygdala.v7.covariates_output.txt is not totally identical to Brain - Amygdala/. Even if we were to excise the tissue name from the covariates file, Brain_Amygdala is formatted differently from its corresponding folder.
Same with Whole Blood/. It is different from Whole_Blood.v7.covariates_output.txt, even if you were to isolate the tissue name from the covariates file Whole_Blood.
What I want to do, however, is to move each of these tissue files to their corresponding folder. If you notice, the covariate files are named after the tissue leading up to the first dot . in the file name. They are separated by underscores _. How I was thinking about approaching this was to break up the first few words leading up to the first . of the file name so that I can easily move it to its corresponding file.
e.g.
Brain_Amygdala.v7.covariates_output.txt -> Brain*Amygdala [mv]-> Brain*Amygdala/
a) I'm not sure how to isolate the first words of a file name leading up to the first . in a filename
b) if I were to do that, I don't know how to insert a wildcard in between each word and match that to the corresponding folder.
However, I am completely open to other ways of doing something like this.
Not a full answer, but it should address some of your concerns:
a) to isolate the first word of a string, leading up to the first .: use Parameter Expansions
string=Brain_Amygdala.v7.covariates_output.txt
until_dot=${string%%.*}
echo "$until_dot"
will output Brain_Amygdala (which we saved in the variable until_dot).
b) You may want to use the ${parameter/pattern/string} parameter expansion:
# Replace all non-alphabetic characters by the glob *
glob_pattern=${until_dot//[^[:alpha:]]/*}
echo "$glob_pattern"
will output (with the same variables as above) Brain*Amygdala
c) To use all of this: it's probably a good idea to determine the possible targets first, and do some basic checks:
# Use nullglob to have non matching glob expand to nothing
shopt -s nullglob
# DO NOT USE QUOTES IN THE FOLLOWING EXPANSION:
# the variable is actually a glob!
# Could also do dirs=( $glob_pattern*/ ) to check if directory
dirs=( $glob_pattern/ )
# Now check how many matches there are:
if ((${#dirs[#]} == 0)); then
echo >&2 "No matches for $glob_pattern"
elif ((${#dirs[#]} > 1)); then
echo >&2 "More than one matches for $glob_pattern: ${dirs[#]}"
else
echo "All good!"
# Remove the echo to actually perform the move
echo mv "$string" "${dirs[0]}"
fi
I don't know how your data will effectively conform to these, but I hope this answer actually answers some of your questions! (and to learn more about parameter expansions, do read — and experiment with — the link to the reference I gave you).

Bash scripting print list of files

Its my first time to use BASH scripting and been looking to some tutorials but cant figure out some codes. I just want to list all the files in a folder, but i cant do it.
Heres my code so far.
#!/bin/bash
# My first script
echo "Printing files..."
FILES="/Bash/sample/*"
for f in $FILES
do
echo "this is $f"
done
and here is my output..
Printing files...
this is /Bash/sample/*
What is wrong with my code?
You misunderstood what bash means by the word "in". The statement for f in $FILES simply iterates over (space-delimited) words in the string $FILES, whose value is "/Bash/sample" (one word). You seemingly want the files that are "in" the named directory, a spatial metaphor that bash's syntax doesn't assume, so you would have to explicitly tell it to list the files.
for f in `ls $FILES` # illustrates the problem - but don't actually do this (see below)
...
might do it. This converts the output of the ls command into a string, "in" which there will be one word per file.
NB: this example is to help understand what "in" means but is not a good general solution. It will run into trouble as soon as one of the files has a space in its name—such files will contribute two or more words to the list, each of which taken alone may not be a valid filename. This highlights (a) that you should always take extra steps to program around the whitespace problem in bash and similar shells, and (b) that you should avoid spaces in your own file and directory names, because you'll come across plenty of otherwise useful third-party scripts and utilities that have not made the effort to comply with (a). Unfortunately, proper compliance can often lead to quite obfuscated syntax in bash.
I think problem in path "/Bash/sample/*".
U need change this location to absolute, for example:
/home/username/Bash/sample/*
Or use relative path, for example:
~/Bash/sample/*
On most systems this is fully equivalent for:
/home/username/Bash/sample/*
Where username is your current username, use whoami to see your current username.
Best place for learning Bash: http://www.tldp.org/LDP/abs/html/index.html
This should work:
echo "Printing files..."
FILES=(/Bash/sample/*) # create an array.
# Works with filenames containing spaces.
# String variable does not work for that case.
for f in "${FILES[#]}" # iterate over the array.
do
echo "this is $f"
done
& you should not parse ls output.
Take a list of your files)
If you want to take list of your files and see them:
ls ###Takes list###
ls -sh ###Takes list + File size###
...
If you want to send list of files to a file to read and check them later:
ls > FileName.Format ###Takes list and sends them to a file###
ls > FileName.Format ###Takes list with file size and sends them to a file###

Bash/batch multiple file, single folder, incrimental rename script; user provided filename prefix parameter

I have a folder of files which need to be renamed.
Instead of a simple incrimental numeric rename function I need to first provide a naming convention which will then incriment in order to ensure file name integrity within the folder.
say i have files:
wei12346.txt
wifr5678.txt
dkgj5678.txt
which need to be renamed to:
Eac-345-018.txt
Eac-345-019.txt
Eac-345-020.txt
Each time i run the script the naming could be different and the numeric incriment to go along with it may also be ddifferent:
Ebc-345-010.pdf
Ebc-345-011.pdf
Ebc-345-012.pdf
So i need to ask for a provided parameter from the user, i was thinking this might be useful as the previous file name in the list of files to be indexed eg: Eac-345-017.txt
The other thing I am unsure about with the incriment is how the script would deal with incrimenting 099 to 100 or 999 to 1000 as i am not aware of how this process is carried out.
I have been told that this is an easy script in perl however I am running cygwin on a windows machine in work and have access to only bash and windows shells in order to execute the script.
Any pointers to get me going would be greatly appreciated, i have some experience programming but scripting is almost entirely new.
Thanks,
Craig
(i understand there are allot of posts on this type of thing already but none seem to offer any concise answer, hence my question)
#!/bin/bash
prefix="$1"
shift
base_n="$1"
shift
step="$1"
shift
n=$base_n
for file in "$#" ; do
formatted_n=$(printf "%03d" $n)
# re-use original file extension whilke we're at it.
mv "$file" "${prefix}-${formatted_n}.${file##*.}"
let n=n+$step
done
Save the file, invoke it like this:
bash fancy_rename.sh Ebc-345- 10 1 /path/to/files/*
Note: In your example you "renamed" a .txt to a .pdf, but above I presumed the extension would stay the same. If you really wanted to just change the extension then it would be a trivial change. If you wanted to actually convert the file format then it would be a little more complex.
Note also that I have formatted the incrementing number with %03d. This means that your number sequence will be e.g.
010
011
012
...
099
100
101
...
999
1000
Meaning that it will be zero padded to three places but will automatically overflow if the number is larger. If you prefer consistency (always 4 digits) you should change the padding to %04d.
OK, you can do the following. You can ask the user first the prefix and then the starting sequence number. Then, you can use the built-in printf from bash to do the correct formatting on the numbers, but you may have to decide to provide enough number width to hold all the sequence, because this will result in a more homogeneous names. You can use read to read user input:
echo -n "Insert the prefix: "
read prefix
echo -n "Insert the sequence number: "
read sn
for i in * ; do
fp=`printf %04d $sn`
mv "$i" "$prefix-$fp.txt"
sn=`expr $sn + 1`
done
Note: You can extract the extension also. That wouldn't be a problem. Also, here I selected 4 numbers fot the sequence number, calculated into the variable $fp.

Resources