Bash capturing in brace expansion - bash

What would be the best way to use something like a capturing group in regex for brace expansion. For example:
touch {1,2,3,4,5}myfile{1,2,3,4,5}.txt
results in all permutations of the numbers and 25 different files. But in case I just want to have files like 1myfile1.txt, 2myfile2.txt,... with the first and second number the same, this obviously doesn't work. Therefore I'm wondering what would be the best way to do this?
I'm thinking about something like capturing the first number, and using it a second time. Ideally without a trivial loop.
Thanks!

Not using a regex but a for loop and sequence (seq) you get the same result:
for i in $(seq 1 5); do touch ${i}myfile${i}.txt; done
Or tidier:
for i in $(seq 1 5);
do
touch ${i}myfile${i}.txt;
done
As an example, using echo instead of touch:
➜ for i in $(seq 1 5); do echo ${i}myfile${i}.txt; done
1myfile1.txt
2myfile2.txt
3myfile3.txt
4myfile4.txt
5myfile5.txt

Variation on MTwarog's answer with one less pipe/subprocess:
$ echo {1..5} | tr ' ' '\n' | xargs -I '{}' touch {}myfile{}.txt
$ ls -1 *myfile*
1myfile1.txt
2myfile2.txt
3myfile3.txt
4myfile4.txt
5myfile5.txt

You can use AWK to do that:
echo {1..5} | tr ' ' '\n' | awk '{print $1"filename"$1".txt"}' | xargs touch
Explanation:
echo {1..5} - prints range of numbers
tr ' ' '\n' - splits numbers to separate lines
awk '{print $1"filename"$1}' - enables you to format output using previously printed numbers
xargs touch - passes filenames to touch command (creates files)

Related

How to get from a file only the character with reputed value

I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

How to append lots of variables to one variable with a simple command

I want to stick all the variables into one variable
A=('blah')
AA=('blah2')
AAA=('blah3')
AAB=('blah4')
AAC=('blah5')
#^^lets pretend theres 100 more of these ^^
#Variable composition
#after AAA, is AAB then AAC then AAD etc etc, does that 100 times
I want them all placed into this MASTER variable
#MASTER=${A}${AA}${AAA} (<-- insert AAB, AAC and 100 more variables here)
I obviously don't want to type 100 variables in this expression because there's probably an easier way to do this. Plus I'm gonna be doing more of these therefore I need it automated.
I'm relatively new to sed, awk, is there a way to append those 100 variables into the master variable?
For this specific purpose I DO NOT want an array.
You can use a simple one-liner, quite straightforward, though more expensive:
master=$(set | grep -E '^(A|AA|A[A-D][A-D])=' | sort | cut -f2- -d= | tr -d '\n')
set lists all the variables in var=name format
grep filters out the variables we need
sort puts them in the right order (probably optional since set gives a sorted output)
cut extracts the values, removing the variable names
tr removes the newlines
Let's test it.
A=1
AA=2
AAA=3
AAB=4
AAC=5
AAD=6
AAAA=99 # just to make sure we don't pick this one up
master=$(set | grep -E '^(A|AA|A[A-D][A-D])=' | sort | cut -f2- -d= | tr -d '\n')
echo "$master"
Output:
123456
With my best guess, how about:
#!/bin/bash
A=('blah')
AA=('blah2')
AAA=('blah3')
AAB=('blah4')
AAC=('blah5')
# to be continued ..
for varname in A AA A{A..D}{A..Z}; do
value=${!varname}
if [ -n "$value" ]; then
MASTER+=$value
fi
done
echo $MASTER
which yields:
blahblah2blah3blah4blah5...
Although I'm not sure whether this is what the OP wants.
echo {a..z}{a..z}{a..z} | tr ' ' '\n' | head -n 100 | tail -n 3
adt
adu
adv
tells us, that it would go from AAA to ADV to reach 100, or for ADY for 103.
echo A{A..D}{A..Z} | sed 's/ /}${/g'
AAA}${AAB}${AAC}${AAD}${AAE}${AAF}${AAG}${AAH}${AAI}${AAJ}${AAK}${AAL}${AAM}${AAN}${AAO}${AAP}${AAQ}${AAR}${AAS}${AAT}${AAU}${AAV}${AAW}${AAX}${AAY}${AAZ}${ABA}${ABB}${ABC}${ABD}${ABE}${ABF}${ABG}${ABH}${ABI}${ABJ}${ABK}${ABL}${ABM}${ABN}${ABO}${ABP}${ABQ}${ABR}${ABS}${ABT}${ABU}${ABV}${ABW}${ABX}${ABY}${ABZ}${ACA}${ACB}${ACC}${ACD}${ACE}${ACF}${ACG}${ACH}${ACI}${ACJ}${ACK}${ACL}${ACM}${ACN}${ACO}${ACP}${ACQ}${ACR}${ACS}${ACT}${ACU}${ACV}${ACW}${ACX}${ACY}${ACZ}${ADA}${ADB}${ADC}${ADD}${ADE}${ADF}${ADG}${ADH}${ADI}${ADJ}${ADK}${ADL}${ADM}${ADN}${ADO}${ADP}${ADQ}${ADR}${ADS}${ADT}${ADU}${ADV}${ADW}${ADX}${ADY}${ADZ
The final cosmetics is easily made by hand.
One-liner using a for loop:
for n in A AA A{A..D}{A..Z}; do str+="${!n}"; done; echo ${str}
Output:
blahblah2blah3blah4blah5
Say you have the input file inputfile.txt with arbitrary variable names and values:
name="Joe"
last="Doe"
A="blah"
AA="blah2
then do:
master=$(eval echo $(grep -o "^[^=]\+" inputfile.txt | sed 's/^/\$/;:a;N;$!ba;s/\n/$/g'))
This will concatenate the values of all variables in inputfile.txt into master variable. So you will have:
>echo $master
JoeDoeblahblah2

Alternating output in bash for loop from two grep

I'm trying to search through files and extract two pieces of relevant information every time they appear in the file. The code I currently have:
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3
samples=$(grep $str2 $file | cut -d '/' -f 8
echo $samples $reads >> reads.txt
done
It is doing each line for the file (the files have varying numbers of instances of these phrases) and gives me the output per row for each file:
PopA_15.fq 1081264
PopA_16.fq PopA_17.fq 1008416 554791
PopA_18.fq PopA_20.fq PopA_21.fq 604610 531227 595129
...
I want it to match each instance (i.e. 1st instance of both greps next two each other):
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
...
How do I do this? Thank you
Considering that your Input_file is same as sample shown and number of columns are even on each line with 1 PopA value and other will be with digit values. Following awk may help you in same.
awk '{for(i=1;i<=(NF/2);i++){print $i,$((NF/2)+i)}}' Input_file
Output will be as follows.
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
In case you want to pass output of a command to awk command then you could do like your command | awk command... no need to add Input_file to above awk command.
This is what ended up working for me...any tips for more efficient code are definitely welcome
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3)
samples=$(grep $str2 $file | cut -d '/' -f 8)
paste <(echo "$samples" | column -t) <(echo "$reads" | column -t) >> reads.txt
done
This provides the desired output described above.

KSH: Loop performance

I need to process a file with approximately 120k lines that has the following format using ksh:
"[UserId=USER1]";"Client=001";"Locked_Status=0";"TYPE=A";"Last_Logon=00000000";"Valid_To=99991231";"Password_Change=20120131";"Last_Password_Change=29990"
"[UserId=USER2]";"Client=000";"Locked_Status=0";"TYPE=A";"Last_Logon=20141020";"Valid_To=00000000";"Password_Change=20140620";"Last_Password_Change=9501"
"[UserId=USER3]";"Client=002";"Locked_Status=0";"TYPE=A";"Last_Logon=00000000";"Valid_To=99991231";"Password_Change=20140304";"Last_Password_Change=9817"
The output should be something like:
[UserId=USER1] Client=001
Locked_Status=0
TYPE=A
Last_Logon=00000000
Valid_To=99991231
Password_Change=20120131
Last_Password_Change=29985
[UserId=USER2]
Client=000
Locked_Status=0
TYPE=A
Last_Logon=20141020
Valid_To=00000000
Password_Change=20140620
Last_Password_Change=9496
[UserId=User3]
Client=002
Locked_Status=0
TYPE=A
Last_Logon=00000000
Valid_To=99991231
Password_Change=20140304
Last_Password_Change=9812
I initially used the following code do process the file:
for a in $(<$1)
do
a=$(echo $a|sed -e 's/;/ /g' -e 's/"//g')
for b in $a
do
print $b
done
done
It was taking around 3hrs to process 120k lines.
Then I tried to improved the code changing it to the following:
for a in $(<$1)
do
printf "\n$(echo $a|sed -e 's/"//g' -e 's/;/\\n/g')"
done
That gave me 2hrs processing time however it still takes too long to process 120k lines
At last I tried this code which processed the 120k lines in 3secs!
perl -ne '
chomp;
s/\"//g;
s/;/\n/g;
print;
' <$1
Is there anyway I can improve the code in KSH to achieve similar performance? I believe that I must be missing something in my KSH code... Help me to find out please.
Thanks in advance
How about:
tr ';' '\n' < file | tr -d '"'
Your code is assigning every whitespace-delimited word to variable "a" in turn, and thus invoking sed once for each word in the file. Clearly a lot of accumulated overhead spawning all those processes. The idiom to iterate over the lines of a file is:
while IFS= read -r line; do ...; done < file
You suggestion worked perfectly
Host> wc -l /tmp/MyTest
114449 /tmp/MyTest
Host> time tr ';' '\n' < /tmp/MyTest | tr -d '"' > /tmp/zuza.out
real 0m1.04s
user 0m1.06s
sys 0m0.08s
Host> time perl -ne '
chomp;
s/\"//g;
s/;/\n/g;
print "\n$_";
' </tmp/MyTest > /tmp/zuza
real 0m1.30s
user 0m0.60s
sys 0m0.08s

Why is this bash for loop slow?

I am trying to this code:
for f in jobs/UPDTEST/apples* ; do
nf=`echo $f | sed s:jobs\/::g`
echo $nf | tr '_' ' '
done > jobs
There are 750 apples* type text files. But as I am only messing with the file name - I would have thought it should be quick - but take about 5 mins.
Is there an alternative way to do this?
You can use parameter expansions like ${parameter/pattern/string} to get rid of the calls to sed and tr. In your case it could look like:
for f in jobs/UPDTEST/apples*; do
f=${f//jobs\//}
echo ${f//_/ }
done > jobs
First, cd jobs would remove the need for the sed
Second, you don't need tr to substitute characters in the value of a bash variable.
Third, with find you don't need a loop at all.
f=$(cd jobs; find UPDTEST -name 'apples*' -depth 1)
echo "${f//_/ }" > jobs.log
By the way, you can't have a jobs directory and a jobs file in the same directory.

Resources