Generate ID number from a name in bash - bash

Currently I have a bunch of names that are tied to numbers, for example:
Joe Bloggs - 17
John Smith - 23
Paul Smith - 24
Joe Bloggs - 32
Using the name and the number I'd like to generate a random/unique ID made of 4 numbers that also ends with the initial number.
So for example, Joe Bloggs and 17 would make something random/unique like: xxxx17.
Is this possible in bash? Would it be better in some other language?
This would be used on debian and darwin based systems.

It is impossible to ensure than 4-digit hash (checksum) would be unique for a set of 10 character long names.
As an alternative, you can try
file="./somefile"
paste -d"\0\n" <(seq -f "%04g" 9999 | sort -R | head -$(grep -c '' "$file")) <(grep -oP '\d+' "$file")
for better readability
paste -d"\0\n" <(
seq -f "%04g" 9999 | gsort -R | head -$(grep -c '' "$file")
) <(
grep -oP '\d+' "$file"
)
for your input produces something like:
010817
161523
748024
269032
All lines are in the form RRRRXX, where:
the RRRR is an guaranteed unique and random number (from the range 0001 up to 9999)
the XX is the number from your input
decomposition:
seq produces 9999 4-digit numbers (ofc, each number is unique)
sort -R sorts the lines in random order (based on their hash, so get unique random numbers)
head - from the random list show only first N lines, where the N is the number of lines in your file,
the number of lines is counted by grep -c '' (better than wc -l)
the grep -oP filters the numbers from your file
finally the paste combines the two inputs to the final output
the <(..) <(..) is process substitution

Each name, after you add their number, becomes unique already unless there are two Joe Bloggs 17. In your case, there are two Joe Bloggs, one with 17 and 32. Put those together, you have uniqueness "Joe Bloggs 17" and "Joe Bloggs 32" are not the same. Using this, you can simply assign a number to each name + number pair and remember that number in an associative array (dictionary). No need to be random. When you find a name that isn't already in the dictionary, just keep incrementing the number and, then, associate the new number with the name. If uniqueness is the only goal, then you are in good shape for 10,000 people.
Python is a great language for this, but you can make associative arrays in BASH too.

You can get very close to doing exactly what you want using the random string generated by $(date +%N) and then selecting 4 digits to use as the first for characters in the new ID. You can choose from the beginning if you want IDs that are closer together, or from the mid part of the string for more randomness. After selecting your random 4, then just keep track of the ones used in an array and check against the array as each new ID is assigned. This overhead is negligible for 10,000 or so IDs:
#!/bin/bash
declare -a used4=0 # array to hold IDs you have assigned
declare -i dupid=0 # a flag to prompt regeneration in case of a dup
while read -r line || [ -n "$line" ]; do
name=${line% -*}
id2=${line##* }
while [ $dupid -eq 0 ]; do
ns=$(date +%N) # fill variable with nanoseconds
fouri=${ns:4:4} # take 4 integers (mid 4 for better randomness)
# test for duplicate (this is BASH only test - use loop if portability needed)
[[ "$fouri" =~ "${used4[#]}" ]] && continue
newid="${fouri}${id2}" # contatinate 4ints + orig 2 digit id
used4+=( "$fouri" ) # add 4ints to used4 array
dupid=1
done
dupid=0 # reset flag
printf "%s => %s\n" "$line" "$newid"
done<"$1"
output:
$ bash fourid.sh dat/nameid.dat
Joe Bloggs - 17 => 762117
John Smith - 23 => 603623
Paul Smith - 24 => 210424
Joe Bloggs - 32 => 504732

Related

Bash: checking substring increments with modular arithmetic

I have a list of files with file names that contain a substring of 6 numbers that represents HHMMSS, HH: 2 digits hour, MM: 2 digits minutes, SS: 2 digits seconds.
If the list of files is ordered, the increments should be in steps of 30 minutes, that is, the first substring should be 000000, followed by 003000, 010000, 013000, ..., 233000.
I want to check that no file is missing iterating the list of files and checking that neither of these substrings is missing. My approach:
string_check=000000
for file in ${file_list[#]}; do
if [[ ${file:22:6} == $string_check ]]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi
string_check=$((string_check+3000)) #this is the key line
done
And the previous to the last line is the key. It should be formatted to 6 digits, I know how to do that, but I want to add time like a clock, or, in more specific words, modular arithmetic modulo 60. How can that be done?
Assumptions:
all 6-digit strings are of the format xx[03]0000 (ie, has to be an even 00 or 30 minutes and no seconds)
if there are strings like xx1529 ... these will be ignored (see 2nd half of answer - use of comm - to address OP's comment about these types of strings being an error)
Instead of trying to do a bunch of mod 60 math for the MM (minutes) portion of the string, we can use a sequence generator to generate all the desired strings:
$ for string_check in {00..23}{00,30}00; do echo $string_check; done
000000
003000
010000
013000
... snip ...
230000
233000
While OP should be able to add this to the current code, I'm thinking we might go one step further and look at pre-parsing all of the filenames, pulling the 6-digit strings into an associative array (ie, the 6-digit strings act as the indexes), eg:
unset myarray
declare -A myarray
for file in ${file_list}
do
myarray[${file:22:6}]+=" ${file}" # in case multiple files have same 6-digit string
done
Using the sequence generator as the driver of our logic, we can pull this together like such:
for string_check in {00..23}{00,30}00
do
[[ -z "${myarray[${string_check}]}" ]] &&
echo "Problem: (file) '${string_check}' is missing"
done
NOTE: OP can decide if the process should finish checking all strings or if it should exit on the first missing string (per OP's current code).
One idea for using comm to compare the 2 lists of strings:
# display sequence generated strings that do not exist in the array:
comm -23 <(printf "%s\n" {00..23}{00,30}00) <(printf "%s\n" "${!myarray[#]}" | sort)
# OP has commented that strings not like 'xx[03]000]` should generate an error;
# display strings (extracted from file names) that do not exist in the sequence
comm -13 <(printf "%s\n" {00..23}{00,30}00) <(printf "%s\n" "${!myarray[#]}" | sort)
Where:
comm -23 - display only the lines from the first 'file' that do not exist in the second 'file' (ie, missing sequences of the format xx[03]000)
comm -13 - display only the lines from the second 'file' that do not exist in the first 'file' (ie, filenames with strings not of the format xx[03]000)
These lists could then be used as input to a loop, or passed to xargs, for additional processing as needed; keeping in mind the comm -13 output will display the indices of the array, while the associated contents of the array will contain the name of the original file(s) from which the 6-digit string was derived.
Doing this easy with POSIX shell and only using built-ins:
#!/usr/bin/env sh
# Print an x for each glob matched file, and store result in string_check
string_check=$(printf '%.0sx' ./*[0-2][0-9][03]000*)
# Now string_check length reflects the number of matches
if [ ${#string_check} -eq 48 ]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi
Alternatively:
#!/usr/bin/env sh
if [ "$(printf '%.0sx' ./*[0-2][0-9][03]000*)" \
= 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' ]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi

Bash - Count frequency of palindromes from text file

This is a follow up from my other post:
Printing all palindromes from text file
I want to be able to print to amount of palindromes that I have found from my text file similar to a frequency table. It'll show the amount of the word followed by the word, similar to this format:
100 did
32 sas
17 madam
My code right now is:
#!usr/bin/env bash
function search
{
grep -oiE '[a-z]{3,}' "$1" | sort -n | tr '[:upper:]' '[:lower:]' | while read -r word; do
[[ $word == $(rev <<< "$word") ]] && echo "$word" | uniq -c
done
}
search "$1"
In comparison to the last post I did: Printing all palindromes from text file . I have added "sort -n" and "uniq -c" which from my knowledge is to sort the palindromes found in alphabetical order, then "uniq -c" is to print the number of occurrences of the words found.
Just to test script I have a testing file named: "testingfile.txt" . This contains:
testing words testing words testing words
palindromes
Sas
Sas
Sas
sas
bob
Sas
Sas
Sas Sas madam
midim poop goog tot sas did i want to go to the movies did
otuikkiuto
pop
poop
This file is just so I can test before trying this script on a much larger file in which it'll take much longer.
When typing in the console: (also to note "palindrome" is the name of my script)
source palindrome testingfile.txt
The output appears like this:
1 bob
1 did
1 did
1 goog
1 madam
1 midim
1 otuikkiuto
1 poop
1 poop
1 pop
1 sas
1 sas
1 sas
1 sas
1 sas
1 sas
1 sas
1 sas
1 sas
1 tot
Is there something I am missing to get the result that I want:
9 sas
2 did
2 poop
1 bob
1 goog
1 madam
1 midim
1 otuikkiuto
1 pop
1 tot
Solutions to this would be greatly appreciated! If there are solutions with other commands that are needed an explanation of the reasoning behind the other commands are also greatly appreciated.
Thank you
You missed two important details:
You need to pass all input at once to uniq -c to count them, not one by one to one uniq each
uniq expects its input to be sorted. The sort you had in the grep pipeline is ineffective, because after the transformation to lowercase, the values would need to be sorted again
You can apply sort | uniq -c to the output of an entire loop,
by piping the loop itself:
grep -oiE '[a-z]{3,}' "$1" | tr '[:upper:]' '[:lower:]' | while read -r word; do
[[ $word == $(rev <<< "$word") ]] && echo "$word"
done | sort | uniq -c
Finally, to get an output sorted in descending order by count,
you need to further pipe the output to sort -nr.

How can I find the missing integers in a unique and sequential list (one per line) in a unix terminal?

Suppose I have a file as follows (a sorted, unique list of integers, one per line):
1
3
4
5
8
9
10
I would like the following output (i.e. the missing integers in the list):
2
6
7
How can I accomplish this within a bash terminal (using awk or a similar solution, preferably a one-liner)?
Using awk you can do this:
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' file
2
6
7
Explanation:
{p = $1}: Variable p contains value from previous record
{for ...}: We loop from p+1 to the current row's value (excluding current value) and print each value which is basically the missing values
Using seq and grep:
seq $(head -n1 file) $(tail -n1 file) | grep -vwFf file -
seq creates the full sequence, grep removes the lines that exists in the file from it.
perl -nE 'say for $a+1 .. $_-1; $a=$_'
Calling no external program (if filein contains the list of numbers):
#!/bin/bash
i=0
while read num; do
while (( ++i<num )); do
echo $i
done
done <filein
To adapt choroba's clever answer for my own use case, I needed my sequence to deal with zero-padded numbers.
The -w switch to seq is the magic here - it automatically pads the first number with the necessary number of zeroes to keep it aligned with the second number:
-w, --equal-width equalize width by padding with leading zeroes
My integers go from 0 to 9999, so I used the following:
seq -w 0 9999 | grep -vwFf "file.txt"
...which finds the missing integers in a sequence from 0000 to 9999. Or to put it back into the more universal solution in choroba's answer:
seq -w $(head -n1 "file.txt") $(tail -n1 "file.txt") | grep -vwFf "file.txt"
I didn't personally find the - in his answer was necessary, but there may be usecases which make it so.
Using Raku (formerly known as Perl_6)
raku -e 'my #a = lines.map: *.Int; say #a.Set (^) #a.minmax.Set;'
Sample Input:
1
3
4
5
8
9
10
Sample Output:
Set(2 6 7)
I'm sure there's a Raku solution similar to #JJoao's clever Perl5 answer, but in thinking about this problem my mind naturally turned to Set operations.
The code above reads lines into the #a array, mapping each line so that elements in the #a array are Ints, not strings. In the second statement, #a.Set converts the array to a Set on the left-hand side of the (^) operator. Also in the second statement, #a.minmax.Set converts the array to a second Set, on the right-hand side of the (^) operator, but this time because the minmax operator is used, all Int elements from the min to max are included. Finally, the (^) symbol is the symmetric set-difference (infix) operator, which finds the difference.
To get an unordered whitespace-separated list of missing integers, replace the above say with put. To get a sequentially-ordered list of missing integers, add the explicit sort below:
~$ raku -e 'my #a = lines.map: *.Int; .put for (#a.Set (^) #a.minmax.Set).sort.map: *.key;' file
2
6
7
The advantage of all Raku code above is that finding "missing integers" doesn't require a "sequential list" as input, nor is the input required to be unique. So hopefully this code will be useful for a wide variety of problems in addition to the explicit problem stated in the Question.
OTOH, Raku is a Perl-family language, so TMTOWTDI. Below, a #a.minmax array is created, and grepped so that none of the elements of #a are returned (none junction):
~$ raku -e 'my #a = lines.map: *.Int; .put for #a.minmax.grep: none #a;' file
2
6
7
https://docs.raku.org/language/setbagmix
https://docs.raku.org/type/Junction
https://raku.org

KornShell Sort Array of Integers

Is there a command in KornShell (ksh) scripting to sort an array of integers? In this specific case, I am interested in simplicity over efficiency. For example if the variable $UNSORTED_ARR contained values "100911, 111228, 090822" and I wanted to store the result in $SORTED_ARR
Is it actually an indexed array or a list in a string?
Array:
UNSORTED_ARR=(100911 111228 090822)
SORTED_ARR=($(printf "%s\n" ${UNSORTED_ARR[#]} | sort -n))
String:
UNSORTED_ARR="100911, 111228, 090822"
SORTED_ARR=$(IFS=, printf "%s\n" ${UNSORTED_ARR[#]} | sort -n | sed ':a;$s/\n/,/g;N;ba')
There are several other ways to do this, but the principle is the same.
Here's another way for a string using a different technique:
set -s -- ${UNSORTED_ARR//,}
SORTED_ARR=$#
SORTED_ARR=${SORTED_ARR// /, }
Note that this is a lexicographic sort so you would see this kind of thing when the numbers don't have leading zeros:
$ set -s -- 10 2 1 100 20
$ echo $#
1 10 100 2 20
If I take that out then it works but I can't loop through it (because its a list of strings now) – pws5068 Mar 4 '11 at 21:01
Do this:
\# create sorted array
set **-s** -A $#

Shell script sort list

I have a list with the following content:
VIP NAME DATE ARRIVE_TIME FLIGHT_TIME
1 USER1 11-02 20.00 21.00
3 USER2 11-02 20.45 21.45
4 USER2 11-03 20.00 21.30
2 USER1 11-04 17.20 19.10
I want to sort this and similar lists with a shell script. The result should be a new list with lines that do not collide. VIP 1 is most important, if any VIP with a bigger number has ARRIVE_TIME before FLIGHT_TIME for VIP 1 on the same date this line should be removed, so the VIP number should be used to decide which lines to keep if the ARRIVE_TIME, FLIGHT_TIME and DATE collide. Similarly, VIP 2 is more important than VIP 3 and so on.
This is pretty advanced, and I am totally empty for ideas on how to solve this.
You can use the unix sort command to do this:
There's an example of how to set primary and secondary keys etc:
Example
The uniq command is what you need to remove dupes.
This might get you started:
I'm ignoring the header line. You can get rid of it using head or skip it in the for loop.
Sort the flights by date, arrival, departure and vip number - having the vip number as a sort key simplifies the logic later.
I'm saving the result in an array, but you could redirect it to a temporary file and read it in a line at a time with a while read line; do ...; done <tempfile loop.
I'm using indirection to make things more readable (naming the fields instead of using array indices directly - the exclamation point means indirection here instead of "not")
For each line in the result that occurs on the same date as the most recently printed line, compare its arrival time to the previous flight's departure time
Echo the lines that are appropriate.
save the date and departure time for later comparison.
You should adjust the < comparison to be <= if that works better for your data.
Here is the script:
#!/bin/bash
saveIFS="$IFS"
IFS=$'\n'
flights=($(sort -k3,3 -k4,4n -k5,5n -k1,1n flights ))
IFS="$saveIFS"
date=fields[2]
arrive=fields[3]
depart=fields[4]
for line in "${flights[#]}"
do
fields=($line)
if [[ ${!date} == $prevdate && ${!arrive} < $prevdep ]]
then
echo "deleted: $line" # or you could do something else here
else
echo $line
prevdep=${!depart}
prevdate=${!date}
fi
done

Resources