Extract substrings with certain length randomly from a file with Bash - bash

I have multiple text files, and from each file I need to extract random contiguous substrings with a certain length.
For example, I need to extract Five random substrings that consist of 3 contiguous characters each, or 4 random substrings that consist of 20 characters each.
In practice, let's assume this is the content of one of the files
Welcome to stackoverflow the best technical resource ever
so if I want Five random substrings that consist of 3 characters each, I expect an output that looks like this for example:
elc
sta
tec
res
rce
Your help would be much appreciated.

awk to the rescue!
awk -v n=5 -v s=3 'BEGIN {srand()}
{len=length($0);
for(i=1;i<=n;i++)
{k=rand()*(len-s)+1; printf "%s\t", substr($0,k,s)}
print ""}' file
there may be spaces in the extracted substrings

Create a function to pick a random substring:
random_string() {
line=$1
length=$2
# make sure we start at a random position that guarantees a substring of given length
start=$((RANDOM % ((${#line} - $length))))
# use Bash brace expansion to extract substring
printf '%s' "${line:$start:$length}"
}
Use the function in a loop:
#!/bin/bash
while IFS= read -r line; do
random1=$(random_string "$line" 3)
random2=$(random_string "$line" 20)
printf 'random1=[%s], random2=[%s]\n' "$random1" "$random2"
done < file
Sample output with the content Welcome to stackoverflow the best technical resource ever in file:
random1=[hni], random2=[low the best technic]
random1=[sta], random2=[e best technical res]
random1=[ove], random2=[ackoverflow the best]
random1=[rfl], random2=[echnical resource ev]
random1=[ech], random2=[est technical resour]
random1=[cal], random2=[ome to stackoverflow]
random1=[tec], random2=[o stackoverflow the ]
random1=[l r], random2=[come to stackoverflo]
random1=[erf], random2=[ stackoverflow the b]
random1=[me ], random2=[ the best technical ]
random1=[est], random2=[ckoverflow the best ]
random1=[tac], random2=[tackoverflow the bes]
random1=[e t], random2=[o stackoverflow the ]
random1=[al ], random2=[come to stackoverflo]

Related

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Efficient substring parsing of large fixed length text file in Bash

I have a large text file (millions of records) of fixed length data and need to extract unique substrings and create a number of arrays with those values. I have a working version, however I'm wondering if performance can be improved since I need to run the script iteratively.
$_file5 looks like:
138000010065011417865201710152017102122
138000010067710416865201710152017102133
138000010131490417865201710152017102124
138000010142349413865201710152017102154
138400010142356417865201710152017102165
130000101694334417865201710152017102176
Here is what I have so far:
while IFS='' read -r line || [[ -n "$line" ]]; do
_in=0
_set=${line:15:6}
_startDate=${line:21:8}
_id="$_account-$_set-$_startDate"
for element in "${_subsets[#]}"; do
if [[ $element == "$_set" ]]; then
_in=1
break
fi
done
# If we find a new one and it's not 504721
if [ $_in -eq 0 ] && [ $_set != "504721" ] ; then
_subsets=("${_subsets[#]}" "$_set")
_ids=("${_ids[#]}" "$_id")
fi
done < $_file5
And this yields:
_subsets=("417865","416865","413865")
_ids=("9899-417865-20171015", "9899-416865-20171015", "9899-413865-20171015")
I'm not sure if sed or awk would be better here and can't find a way to implement either. Thanks.
EDIT: Benchmark Tests
So I benchmarked my original solution against the two provided. Ran this over 10 times and all results where similar to below.
# Bash read
real 0m8.423s
user 0m8.115s
sys 0m0.307s
# Using sort -u (#randomir)
real 0m0.719s
user 0m0.693s
sys 0m0.041s
# Using awk (#shellter)
real 0m0.159s
user 0m0.152s
sys 0m0.007s
Looks like awk wins this one. Regardless, the performance improvement from my original code is substantial. Thank you both for your contributions.
I don't think you can beat the performance of sort -u with bash loops (except in corner cases, as this one turned out to be, see footnote✻).
To reduce the list of strings you have in file to a list of unique strings (set), based on a substring:
sort -k1.16,1.21 -u file >set
Then, to filter-out the unwanted id, 504721, starting at position 16, you can use grep -v:
grep -vE '.{15}504721' set
Finally, reformat the remaining lines and store them in arrays with cut/sed/awk/bash.
So, to populate the _subsets array, for example:
$ _subsets=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | cut -c16-21))
$ printf "%s\n" "${_subsets[#]}"
413865
416865
417865
or, to populate the _ids array:
$ _ids=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | sed -E 's/^.{15}(.{6})(.{8}).*/9899-\1-\2/'))
$ printf "%s\n" "${_ids[#]}"
9899-413865-20171015
9899-416865-20171015
9899-417865-20171015
✻ If the input file is huge, but it contains only a small number (~40) of unique elements (for the relevant field), then it makes perfect sense for the awk solution to be faster. sort needs to sort a huge file (O(N*logN)), then filter the dupes (O(N)), all for a large N. On the other hand, awk needs to pass through the large input only once, checking for dupes along the way via set membership testing. Since the set of uniques is small, membership testing takes only O(1) (on average, but for such a small set, practically constant even in worst case), making the overall time O(N).
In case there were less dupes, awk would have O(N*log(N)) amortized, and O(N2) worst case. Not to mention the higher constant per-instruction overhead.
In short: you have to know how your data looks like before choosing the right tool for the job.
Here's an awk solution embedded in a bash script:
#!/bin/bash
fn_parser() {
awk '
BEGIN{ _account="9899" }
{ _set=substr($0,16,6)
_startDate=substr($0,22,8)
#dbg print "#dbg:_set=" _set "\t_startDate=" _startDate
if (_set != "504721") {
_id= _account "-" _set"-" _startDate
ids[_id] = _id
sets[_set]=_set
}
}
END {
printf "_subsets=("
for (s in sets) { printf("%s\"%s\"" , (commaCtr++ ? "," : ""), sets[s]) }
print ");"
printf "_ids=("
for (i in ids) { printf("%s\"%s\"" , (commaCtr2++ ? "," : ""), ids[i]) }
print ")"
}
' "${#}"
}
#dbg set -vx
eval $( echo $(fn_parser *.txt) )
echo "_subsets="$_subsets
echo "_ids="$_ids
output
_subsets=413865,417865,416865
_ids=9899-416865-20171015,9899-413865-20171015,9899-417865-20171015
Which I believe would be the same output your script would get if you did an echo on your variable names.
I didn't see that _account was being extracted from your file, and assume it is passed in from a previous step in your batch. But until I know if that is a critical piece, I'll have to come back to figuring out how to pass in var to a function that calls awk.
People won't like using eval, but hopefully no one will embed /bin/rm -rf / into your data set ;-)
I use the eval so that the data extracted is available via the shell variables. You can uncomment the #dbg before the eval line to see how the code is executing in the "layers" of function, eval, var=value assignments.
Hopefully, you see how the awk script is a transcription of your code into awk.
It does rely on the fact that arrays can contain only 1 copy of a key/value pair.
I'd really appreciate if you post timings for all solutions submitted. (You could reduce the file size by 1/2 and still have a good test). Be sure to run each version several times, and discard the first run.
IHTH

Evaluating overlap of number ranges in bash

Assume a text file file which contains multiple lines of number ranges. The lower and upper bound of each range are separated by a dash, and the individual ranges are sorted (i.e., range 101-297 comes before 1299-1314).
$cat file
101-297
1299-1314
1301-5266
6898-14503
How can I confirm in bash if one or more of these number ranges are overlapping?
In my opinion, all that is needed seems to be to iteratively perform integer comparisons across adjacent lines. The individual integer comparisons could look like something like this:
if [ "$upperbound_range1" -gt "$lowerbound_range2" ]; then
echo "Overlap!"
exit 1
fi
I suspect, however, that this comparison can also be done via awk.
Note: Ideally, the code could not only determine if any of the ranges is overlapping with its immediate successor range, but also which range is the overlapping one.
try in awk.
awk -F"-" 'Q>=$1 && Q{print}{Q=$NF}' Input_file
Making here -(dash) as a field separator then checking if a variable named Q is NOT NULL and it's value is greater then current line's first field($1) is yes then print that line(if you want to print previous line we could do that also), now create/re-assign variable Q's value to current line's last field's value.
EDIT: As per OP user wants to get the previous line so changing it to that now too.
awk -F"-" 'Q>=$1 && Q{print val}{Q=$NF;val=$0}' Input_file
You could do:
$ awk -F"-" '$1<last_2 && NR>1 {printf "%s: %s: Overlap\n", last_line, $0}
{last_line=$0; last_2=$2}' file
1299-1314: 1301-5266: Overlap
If ranges are sorted by lower bound, and there's a range which overlaps, then the overlapping range will be the successor.
ranges=( $(<file) )
# or ranges=(101-297 1299-1314 1301-5266 6898-14503)
for ((i=1;i<${#ranges[#]};i+=1)); do
range=${ranges[i-1]}
succesorRange=${ranges[i]}
if ((${range#*-}>=${succesorRange%-*})); then
echo "overlap $i $range $succesorRange"
fi
done

Iterate a user string in bash to add vowels to string

So I have a word list containing over 30,000 words. My goals is to make a script that takes in a word without constants in it (example: mbnt), and somehow add constants and compare to the word list to find atleast the word "ambient", though it will also find other words that would read as "mbnt" if you were to take out all of their vowels.
So far this is my bash script
f=/wordList
anyVowel=[aAeEiIoOuU]
nonVowel=[^aAeEiIoOuU]
input=$1
for (( i=0; i<${#input}; i++ ));
do
grep "${input:$i:1}$nonVowel" $f | head -10
done
however this will just return a just a normal list of words with some of the characters the user inputs. Any thoughts on what I might be doing wrong?
awk to the rescue!
$ awk -v w=whr '{a=tolower($0);
gsub(/[^a-z]/,"",a);
gsub(/[aeiou]/,"",a)}
a==w' words
where
looking for the vowels dropped word "whr" in the words (make up a custom dict). Convert to lowercase, filter out non alphas and remove vowels, finally look for a match with the given word.
Note that this is very inefficient if you're looking for many words, but perhaps can be a template for your solution.
Try
wordsfile=wordList
consonants=$1
# Create a regular expression that matches the input consonants with
# any number of vowels before, after, or between them
regex='^[[:space:]]*[aeiou]*'
for (( i=0; i<${#consonants}; i++ )) ; do
regex+="${consonants:i:1}[aeiou]*"
done
regex+='[[:space:]]*$'
grep -i -- "$regex" "$wordsfile"

Remove partial duplicates from text file

My bash-foo is a little rusty right now so I wanted to see if there's a clever way to remove partial duplicates from a file. I have a bunch of files containing thousands of lines with the following format:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
Essentially it's a bunch of pipe delimited strings, with the final two columns being a timestamp and x. What I'd like to do is concatenate all of my files and then remove all partial duplicates. I'm defining partial duplicate as a line in the file that matches from String1 up to String22, but the timestamp can be different.
For example, a file containing:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 12:12:12|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
would become:
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
(It doesn't matter which timestamp is chosen).
Any ideas?
Using awk you can do this:
awk '{k=$0; gsub(/(\|[^|]*){2}$/, "", k)} !seen[k]++' file
String1|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|String7|09-Apr-2016 05:28:03|x
awk command first makes a variable k by removing last 2 fields from each line. Then it uses an associative array seen with key as k where it prints only first instance of key by storing each processes key in the array.
If you have Bash version 4, which supports associative arrays, it can be done fairly efficiently in pure Bash:
declare -A found
while IFS= read -r line || [[ -n $line ]] ; do
strings=${line%|*|*}
if (( ! ${found[$strings]-0} )) ; then
printf '%s\n' "$line"
found[$strings]=1
fi
done < "$file"
same idea with #anubhava, but I think more idiomatic
$ awk -F'|' '{line=$0;$NF=$(NF-1)=""} !a[$0]++{print line}' file
String1|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x
String124|String2|String3|String4|String5|String6|...|String22|09-Apr-2016 05:28:03|x

Resources