Replace a string with a random number for every line, in every file, in a directory in Bash - bash

!/bin/bash
for file in ~/tdg/*.TXT
do
while read p; do
randvalue=`shuf -i 1-99999 -n 1`
sed -i -e "s/55555/${randvalue}/" $file
done < $file
done
This is my script. I'm attempting to replace 55555 with a different random number every time I find it. This currently works, but it replaces every instance of 55555 with the same random number. I have attempted to replace $file at the end of the sed command with $p but that just blows up.
Really though, even if I get to the point were each instance on the same line all of that same random number, but a new random number is used for each line, then I'll be happy.
EDIT
I should have specified this. I would like to actually save the results of the replace in the file, rather than just printing the results to the console.
EDIT
The final working version of my script after JNevill's fantastic help:
!/bin/bash
for file in ~/tdg/*.TXT
do
while read p;
do
gawk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' $file > ${file}.new
done < $file
mv -f ${file}.new $file
done

Since doing this is in sed gets pretty awful and quickly you may want to switch over to awk to perform this:
awk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' $file
Using this, you can remove the inner loop as this will run across the entire file line-by-line as awk does.
You could just swap out the entire script and feed the wildcard filename to awk directly too:
awk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' ~/tdg/*.TXT

This is how to REALLY do what you're trying to do with GNU awk:
awk -i inplace '{ while(sub(/55555/,int(rand()*99999)+1)); print }' ~/tdg/*.TXT
No shell loops or temp files required and it WILL replace every 55555 with a different random number within and across all files.
With other awks it'd be:
seed="$RANDOM"
for file in ~/tdg/*.TXT; do
seed=$(awk -v seed="$seed" '
BEGIN { srand(seed) }
{ while(sub(/55555/,int(rand()*99999)+1)); print > "tmp" }
END { print int(rand()*99999)+1 }
' "$file") &&
mv tmp "$file"
done

A variation on JNevill's solution that generates a different set of random numbers every time you run the script ...
A sample data file:
$ cat grand.dat
abc def 55555
xyz-55555-55555-__+
123-55555-55555-456
987-55555-55555-.2.
.+.-55555-55555-==*
And the script:
$ cat grand.awk
{ $0=gensub(/55555/,int(rand()*seed),"g",$0); print }
gensub(...) : works same as Nevill's answer, while we'll mix up the rand() multiplier by using our seed value [you can throw any numbers in here you wish to help determine size of the resulting value]
** keep in mind that this will replace all occurrences of 55555 on a single line with the same random value
Script in action:
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 6939
xyz-8494-8494-__+
123-24685-24685-456
987-4442-4442-.2.
.+.-17088-17088-==*
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 4134
xyz-5060-5060-__+
123-14706-14706-456
987-2646-2646-.2.
.+.-10180-10180-==*
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 4287
xyz-5248-5248-__+
123-15251-15251-456
987-2744-2744-.2.
.+.-10558-10558-==*
seed=$RANDOM : have the OS generate a random int for us and pass into the awk script as the seed variable

Related

Trying to create a script that counts the length of a all the reads in a fastq file but getting no return

I am trying go count the length of each read in a fastq file from illumina sequencing and outputting this to a tsv or any sort of file so I can then later also look at this and count the number of reads per file. So I need to cycle down the file and eactract each line that has a read on it (every 4th line) then get its length and store this as an output
num=2
for file in *.fastq
do
echo "counting $file"
function file_length(){
wc -l $file | awk '{print$FNR}'
}
for line in $file_length
do
awk 'NR==$num' $file | chrlen > ${file}read_length.tsv
num=$((num + 4))
done
done
Currently all I get the counting $file and no other output but also no errors
Your script contains a lot of errors in both syntax and algorithm. Please try shellcheck to see what is the problem. The most issue will be the $file_length part.
You may want to call a function file_length() here but it is just
an undefined variable which is evaluated as null in the for loop.
If you just want to count the length of the 4th line of *.fastq files,
please try something like:
for file in *.fastq; do
awk 'NR==4 {print length}' "$file" > "${file}_length.tsv"
done
Or if you want to put the results together in a single tsv file, try:
tsvfile="read_lenth.tsv"
for file in *.fastq; do
echo -n -e "$file\t" >> "$tsvfile"
awk 'NR==4 {print length}' "$file" >> "$tsvfile"
done
Hope this helps.

Update version number in property file using bash

I am new in bash scripting and I need help with awk. So the thing is that I have a property file with version inside and I want to update it.
version=1.1.1.0
and I use awk to do that
file="version.properties"
awk -F'["]' -v OFS='"' '/version=/{
split($4,a,".");
$4=a[1]"."a[2]"."a[3]"."a[4]+1
}
;1' $file > newFile && mv newFile $file
but I am getting strange result version="1.1.1.0""...1
Could someone help me please with this.
You mentioned in your comment you want to update the file in place. You can do that in a one-liner with perl:
perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i version.properties
Explanation
-e is followed by a script to run. With -p and -i, the effect is to run that script on each line, and modify the file in place if the script changes anything.
The script itself, broken down for explanation, is:
/^version=/ and # Do the following on lines starting with `version=`
s/ # Make a replacement on those lines
(\d+\.\d+\.\d+\.)(\d+)/ # Match x.y.z.w, and set $1 = `x.y.z.` and $2 = `w`
$1 . ($2+1)/ # Replace x.y.z.w with a copy of $1, followed by w+1
e # This tells Perl the replacement is Perl code rather
# than a text string.
Example run
$ cat foo.txt
version=1.1.1.2
$ perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i foo.txt
$ cat foo.txt
version=1.1.1.3
This is not the best way, but here's one fix.
Test case
I am assuming the input file has at least one line that is exactly version=1.1.1.0.
$ awk -F'["]' -v OFS='"' '/version=/{
> split($4,a,".");
> $4=a[1]"."a[2]"."a[3]"."a[4]+1
> }
> ;1' <<<'version=1.1.1.0'
Output:
version=1.1.1.0"""...1
The """ is because you are assigning to field 4 ($4). When you do that, awk adds field separators (OFS) between fields 1 and 2, 2 and 3, and 3 and 4. Three OFS => """, in your example.
Minimal change
$ awk -F'["]' -v OFS='"' '/version=/{
split($1,a,".");
$1=a[1]"."a[2]"."a[3]"."a[4]+1;
print
}
' <<<'version=1.1.1.0'
version=1.1.1.1
Two changes:
Change $4 to $1
Since the input field separator (-F) is ["], $4 is whatever would be after the third " (if there were any in the input). Therefore, split($4, ...) splits an empty field. The contents of the line, before the first " (if any), are in $1.
print at the end instead of ;1
The 1 after the closing curly brace is the next condition, and there is no action specified. The default action is to print the current line, as modified, so the 1 triggers printing. Instead, just print within your action when you are done processing. That way your action is self-contained. (Of course, if you needed to do other processing, you might want to print later, after that processing.)
You can use the = as the delimiter, like this:
awk -F= -v v=1.0.1 '$1=="version"{printf "version=\"%s\"\n", v}' file.properties

Need To Generate New Random Numbers For Each Sed Substitution Used In "While Loop"

I'm using sed to substitute a random 10 digit string of numbers for a certain field in a file, which I can successfully do. However, the same random 10 digit string of numbers are used for each substitution sed performs which is unacceptable in this case. I need a new random 10 digit string of numbers for every substitution sed performs. Here's what I have so far:
#!/bin/bash
#
#
random_number()
{
for i in {1}; do tr -c -d 0-9 < /dev/urandom | head -c 10; done
}
while read line
do
sed -E "s/[<]FITID[>][[:digit:]]+/<FITID>$(random_number)/g"
done<~/Desktop/FITIDTEST.QFX
Here's a sample of what the original FITIDTEST.QFX file looks like:
<FITID>1266821191
<FITID>1267832241
<FITID>1268070393
<FITID>1268565193
<FITID>1268882385
<FITID>1268882384
And here is the output after executing the script:
<FITID>4270240286
<FITID>4270240286
<FITID>4270240286
<FITID>4270240286
<FITID>4270240286
<FITID>4270240286
I need those 10 digit numbers to be different for each field. I thought the "while loop" would force sed to call the random_number() function each time but apparently it's called once and the value is stored and used repeatedly. Is there anyway to avoid that? Any help is greatly appreciated!
Your sed is replacing all the lines with matching pattern not just one line hence at the end of loop you are seeing same number in replacement.
You can use:
while read line; do
sed -E "/<FITID>/s/<FITID>[[:digit:]]+/<FITID>$(random_number)/" <<< "$line"
done < ~/Desktop/FITIDTEST.QFX > _tmp_
Output:
cat _tmp_
<FITID>9974823224
<FITID>1524680591
<FITID>7433495381
<FITID>6642730759
<FITID>9653629434
<FITID>1325816974
Just use awk:
$ cat tst.awk
BEGIN { srand() }
{
sub(/[0-9]+/,sprintf("%010d",rand()*10000000000))
print
}
$ awk -f tst.awk file
<FITID>3730584119
<FITID>1473036092
<FITID>8390375691
<FITID>6700634479
<FITID>8379256766
<FITID>6583696062
$ awk -f tst.awk file
<FITID>7844627153
<FITID>0141034890
<FITID>9714288799
<FITID>0911892354
<FITID>8916456168
<FITID>4187598430

How to convert HHMMSS to HH:MM:SS Unix?

I tried to convert the HHMMSS to HH:MM:SS and I am able to convert it successfully but my script takes 2 hours to complete because of the file size. Is there any better way (fastest way) to complete this task
Data File
data.txt
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,071600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,072200,072200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,072600,072600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073200,073200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073500,073500,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,073700,073700,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,073900,073900,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,074400,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,090200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,090900,090900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,091500,091500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,091900,091900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092500,092500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092900,092900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,093200,093200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,093500,093500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,094500,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,170100,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,170400,170400,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,170700,170700,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171000,171000,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171500,171500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,171900,171900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172500,172500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172900,172900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,173500,173500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,174100,,
My code : script.sh
#!/bin/bash
awk -F"," '{print $5}' Data.txt > tmp.txt # print first line first string before , to tmp.txt i.e. all Numbers will be placed into tmp.txt
sort tmp.txt | uniq -d > Uniqe_number.txt # unique values be stored to Uniqe_number.txt
rm tmp.txt # removes tmp file
while read line; do
echo $line
cat Data.txt | grep ",$line," > Numbers/All/$line.txt # grep Number and creats files induvidtually
awk -F"," '{print $5","$4","$7","$8","$9","$10","$11}' Numbers/All/$line.txt > Numbers/All/tmp_$line.txt
mv Numbers/All/tmp_$line.txt Numbers/Final/Final_$line.txt
done < Uniqe_number.txt
ls Numbers/Final > files.txt
dos2unix files.txt
bash time_replace.sh
when you execute above script it will call time_replace.sh script
My Code for time_replace.sh
#!/bin/bash
for i in `cat files.txt`
do
while read aline
do
TimeDep=`echo $aline | awk -F"," '{print $6}'`
#echo $TimeDep
finalTimeDep=`echo $TimeDep | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeDep
##########
TimeAri=`echo $aline | awk -F"," '{print $7}'`
#echo $TimeAri
finalTimeAri=`echo $TimeAri | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeAri
sed -i 's/',$TimeDep'/',$finalTimeDep'/g' Numbers/Final/$i
sed -i 's/',$TimeAri'/',$finalTimeAri'/g' Numbers/Final/$i
############################
done < Numbers/Final/$i
done
Any better solution?
Appreciate any help.
Thanks
Sri
If there's a large quantity of files, then the pipelines are probably what are going to impact performance more than anything else - although processes can be cheap, if you're doing a huge amount of processing then cutting down the amount of time you do pass data through a pipeline can reap dividends.
So you're probably going to be better off writing the entire script in awk (or perl). For example, awk can send output to an arbitary file, so the while lop in your first file could be replaced with an awk script that does this. You also don't need to use a temporary file.
I assume the sorting is just for tracking progress easily as you know how many numbers there are. But if you don't care for the sorting, you can simply do this:
#!/bin/sh
awk -F ',' '
{
print $5","$4","$7","$8","$9","$10","$11 > Numbers/Final/Final_$line.txt
}' datafile.txt
ls Numbers/Final > files.txt
Alternatively, if you need to sort you can do sort -t, -k5,4,10 (or whichever field your sort keys actually need to be).
As for formatting the datetime, awk also does functions, so you could actually have an awk script that looks like this. This would replace both of your scripts above whilst retaining the same functionality (at least, as far as I can make out with a quick analysis) ... (Note! Untested, so may contain vauge syntax errors):
#!/usr/bin/awk
BEGIN {
FS=","
}
function formattime (t)
{
return substr(t,1,2)":"substr(t,3,2)":"substr(t,5,2)
}
{
print $5","$4","$7","$8","$9","formattime($10)","formattime($11) > Numbers/Final/Final_$line.txt
}
which you can save, chmod 700, and call directly as:
dostuff.awk filename
Other awk options include changing fields in-situ, so if you want to maintain the entire original file but with formatted datetimes, you can do a modification of the above. Change the print block to:
{
$10=formattime($10)
$11=formattime($11)
print $0
}
If this doesn't do everything you need it to, hopefully it gives some ideas that will help the code.
It's not clear what all your sorting and uniq-ing is for. I'm assuming your data file has only one entry per line, and you need to change the 10th and 11th comma-separated fields from HHMMSS to HH:MM:SS.
while IFS=, read -a line ; do
echo -n ${line[0]},${line[1]},${line[2]},${line[3]},
echo -n ${line[4]},${line[5]},${line[6]},${line[7]},
echo -n ${line[8]},${line[9]},
if [ -n "${line[10]}" ]; then
echo -n ${line[10]:0:2}:${line[10]:2:2}:${line[10]:4:2}
fi
echo -n ,
if [ -n "${line[11]}" ]; then
echo -n ${line[11]:0:2}:${line[11]:2:2}:${line[11]:4:2}
fi
echo ""
done < data.txt
The operative part is the ${variable:offset:length} construct that lets you extract substrings out of a variable.
In Perl, that's close to child's play:
#!/usr/bin/env perl
use strict;
use warnings;
use English( -no_match_vars );
local($OFS) = ",";
while (<>)
{
my(#F) = split /,/;
$F[9] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[9];
$F[10] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[10];
print #F;
}
If you don't want to use English, you can write local($,) = ","; instead; it controls the output field separator, choosing to use comma. The code reads each line in the file, splits it up on the commas, takes the last two fields, counting from zero, and (if they're not empty) inserts colons in between the pairs of digits. I'm sure a 'Code Golf' solution would be made a lot shorter, but this is semi-legible if you know any Perl.
This will be quicker by far than the script, not least because it doesn't have to sort anything, but also because all the processing is done in a single process in a single pass through the file. Running multiple processes per line of input, as in your code, is a performance disaster when the files are big.
The output on the sample data you gave is:
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,07:16:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:22:00,07:22:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,07:26:00,07:26:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:32:00,07:32:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:35:00,07:35:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,07:37:00,07:37:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,07:39:00,07:39:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:44:00,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,09:02:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:09:00,09:09:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:15:00,09:15:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,09:19:00,09:19:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:25:00,09:25:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:29:00,09:29:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,09:32:00,09:32:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,09:35:00,09:35:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:45:00,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,17:01:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,17:04:00,17:04:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,17:07:00,17:07:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:10:00,17:10:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:15:00,17:15:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,17:19:00,17:19:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:25:00,17:25:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:29:00,17:29:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:35:00,17:35:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:41:00,,

search a pattern in file and output each pattern result in its own file using awk, sed

I have a file of numbers in each new line:
$cat test
700320947
700509217
701113187
701435748
701435889
701667717
701668467
702119126
702306577
702914910
that I want to search details of from another larger file with several comma separated fields and out put results in
700320947.csv
700509217.csv
701113187.csv
701435748.csv
701435889.csv
701667717.csv
701668467.csv
702119126.csv
702306577.csv
702914910.csv
Logic:
ls test | while read file; do zgrep $line *large*file*gz >> $line.csv ; done
Please assist.
Thanks
Since nothing said about the structure of the large file, I'll just assume that the numbers in test are to be found in the second column of the large file; generalize as needed.
This can be done in a single pass through each of the files by using output redirection in awk:
awk -F"," 'FILENAME == "test" { num[$1]=1; next }
num[$2] { print > $2".csv" }' test bigfile
Unzip the large file first; using zgrep means unzipping on-the-fly for every line of the number file... very inefficient. After unzipping the big file, this will do it:
for number in `cat test`; do grep $number bigfile > $number.csv; done
Edited:
To limit hits to whole words only (eg 702119126 won't match 1702119126), add word boundaries to the regex:
for number in `cat test`; do grep \\b$number\\b bigfile > $number.csv; done

Resources