bash/sed/awk: change first alphabet in string to uppercase - bash

Let say I have this list:
39dd809b7a36
d83f42ab46a9
9664e29ac67c
66cf165f7e32
51b9394bc3f0
I want to convert the first occurrence of alphabet to uppercase, for example
39dd809b7a36 -> 39Dd809b7a36
bash/awk/sed solution should be ok.
Thanks for the help.

GNU sed can do it
printf "%s\n" 39dd809b7a36 d83f42ab46a9 9664e29ac67c 66cf165f7e32 51b9394bc3f0 |
sed 's/[[:alpha:]]/\U&/'
gives
39Dd809b7a36
D83f42ab46a9
9664E29ac67c
66Cf165f7e32
51B9394bc3f0

Pure Bash 4.0+ using parameter substitution:
string=( "39dd809b7a36" "d83f42ab46a9"
"9664e29ac67c" "66cf165f7e32" "51b9394bc3f0" )
for str in ${string[#]}; do
# get the leading digits by removing everything
# starting from the first letter:
head="${str%%[a-z]*}"
# and the rest of the string starting with the first letter
tail="${str:${#head}}"
# compose result : head + tail with 1. letter to upper case
result="$head${tail^}"
echo -e "$str\n$result\n"
done
Result:
39dd809b7a36
39Dd809b7a36
d83f42ab46a9
D83f42ab46a9
9664e29ac67c
9664E29ac67c
66cf165f7e32
66Cf165f7e32
51b9394bc3f0
51B9394bc3f0

I can't think of any clever way to do this with the basic SW tools, but the BFI solution isn't too bad.
In the One True awk(1) or in gawk:
{ n = split($0, a, "")
for(i = 1; i <= n; ++i) {
s = a[i]
t = toupper(s)
if (s != t) {
a[i] = t
break
}
}
r = ""
for(i = 1; i <= n; ++i) {
r = r a[i]
}
print r
}
It's not too bad in Ruby:
ruby -p -e '$_ = $_.split(/(?=[a-z])/, 2); $_[1].capitalize!'

Here is my solution. It will not allow patterns in form ###.### but can be tweaked as needed.
A=$(cat); B=$(echo $A | sed 's/([a-z])/###\1###/' | sed 's/.###(.)###./\1/' | sed 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/') ; C="echo $A | sed 's/[a-z]/$B/'" ; eval $C

Related

How to get n random "paragraphs" (groups of ordered lines) from a file

I have a file (originally compressed) with a known structure - every 4 lines, the first line starts with the character "#" and defines an ordered group of 4 lines. I want to select randomly n groups (half) of lines in the most efficient way (preferably in bash/another Unix tool).
My suggestion in python is:
path = "origin.txt.gz"
unzipped_path = "origin_unzipped.txt"
new_path = "/home/labs/amit/diklag/subset.txt"
subprocess.getoutput("""gunzip -c %s > %s """ % (path, unzipped_path))
with open(unzipped_path) as f:
lines = f.readlines()
subset_size = round((len(lines)/4) * 0.5)
l = random.sample(list(range(0, len(lines), 4)),subset_size)
selected_lines = [line for i in l for line in list(range(i,i+4))]
new_lines = [lines[i] for i in selected_lines]
with open(new_path,'w+') as f2:
f2.writelines(new_lines)
Can you help me find another (and faster) way to do it?
Right now it takes ~10 seconds to run this code
The following script might be helpful. This is however, untested as we do not have an example file:
attempt 1 (awk and shuf) :
#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
nrec=$(gunzip -c $path | awk '/^#/{c++}{END print c})'
awk '(NR==FNR){a[$1]=1;next}
!/^#/{next}
((++c) in a) { for(i=1;i<=4;i++) { print; getline } }' \
<(shuf -i 1-$nrec -n $count) <(gunzip -c $path) > $new_path
attempt 2 (sed and shuf) :
#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
gunzip -c $path | sed ':a;N;$!ba;s/\n/__END_LINE__/g;s/__END_LINE__#/\n#/g' \
| shuf -n $count | sed 's/__END_LINE__/\n/g' > $new_path
In this example, the sed line will replace all newlines with the string __END_LINE__, except if it is followed by #. The shuf command will then pick $count random samples out of that list. Afterwards we replace the string __END_LINE__ again by \n.
attempt 3 (awk) :
Create a file called subset.awk containing :
# Uniform(m) :: returns a random integer such that
# 1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }
# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m, i,j,k) {
for (i = 1; i <= m ; i++) { permutation[i] = i }
for (i = 1; i <= m-1; i++) {
j = Uniform(i-1)
k = permutation[i]
permutation[i] = permutation[j]
permutation[j] = k
}
}
BEGIN{RS="\n#"; srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR);
sub("#","",a[1])
for(r = 1; r <= count; r++) {
print "#"a[permutation[r]]
}
}
And then you can run :
$ gunzip -c <file.gz> | awk -c count=30 -f subset.awk > <output.txt>

Substituting variables in a text string

I have a text string in a variable in bash which looks like this:
filename1.txt
filename2.txt
varname1 = v1value
$(varname1)/filename3.txt
$(varname1)/filename4.txt
varname2 = $(varname1)/v2value
$(varname2)/filename5.txt
$(varname2)/filename6.txt
I want to substitute all of the variables in place, producing this:
filename1.txt
filename2.txt
v1value/filename3.txt
v1value/filename4.txt
v1value/v2value/filename5.txt
v1value/v2value/filename6.txt
Can anyone suggest a clean way to do this in the shell?
In awk:
BEGIN {
FS = "[[:space:]]*=[[:space:]]*"
}
NF > 1 {
map[$1] = $2
next;
}
function replace( count)
{
for (key in map) {
count += gsub("\\$\\("key"\\)", map[key])
}
return count
}
{
while (replace() > 0) {}
print
}
In lua:
local map = {}
--for line in io.lines("file.in") do -- To read from a file.
for line in io.stdin:lines() do -- To read from standard input.
local key, value = line:match("^(%w*)%s*=%s*(.*)$")
if key then
map[key] = value
else
local count
while count ~= 0 do
line, count = line:gsub("%$%(([^)]*)%)", map)
end
print(line)
end
end
I found a reasonable solution using m4:
function make_substitutions() {
# first all $(varname)s are replaced with ____varname____
# then each assignment statement is replaced with an m4 define macro
# finally this text is then passed through m4
echo "$1" |\
sed 's/\$(\([[:alnum:]][[:alnum:]]*\))/____\1____/' | \
sed 's/ *\([[:alnum:]][[:alnum:]]*\) *= *\(..*\)/define(____\1____, \2)/' | \
m4
}
Perhaps
echo "$string" | perl -nlE 'm/(\w+)\s*=\s*(.*)(?{$h{$1}=$2})/&&next;while(m/\$\((\w+)\)/){$x=$1;s/\$\($x\)/$h{$x}/e};say$_'
prints
filename1.txt
filename2.txt
v1value/filename3.txt
v1value/filename4.txt
v1value/v2value/filename5.txt
v1value/v2value/filename6.txt

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Let me be specific. We have a csv file consists of 2 columns x and y like this:
x,y
1h,a2
2e,a2
4f,a2
7v,a2
1h,b6
4f,b6
4f,c9
7v,c9
...
And we want to count how many shared x values two y values have, which means we want to get this:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
And b6,a2,2 should not show up. Does anyone know how to do this by awk? Or anything else?
Thx ahead!
Try this executable awk script:
#!/usr/bin/awk -f
BEGIN {FS=OFS=","}
NR==1 { print "y1" OFS "y2" OFS "share" }
NR>1 {last=a[$1]; a[$1]=(last!=""?last",":"")$2}
END {
for(i in a) {
cnt = split(a[i], arr, FS)
if( cnt>1 ) {
for(k=1;k<cnt;k++) {
for(i=2;i<=cnt;i++) {
if( arr[k] != arr[i] ) {
key=arr[k] OFS arr[i]
if(out[key]=="") {order[++ocnt]=key}
out[key]++
}
}
}
}
}
for(i=1;i<=ocnt;i++) {
print order[i] OFS out[order[i]]
}
}
When put into a file called awko and made executable, running it like awko data yields:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
I'm assuming the file is sorted by y values in the second column as in the question( after the header ). If it works for you, I'll add some explanations tomorrow.
Additionally for anyone who wants more test data, here's a silly executable awk script for generating some data similar to what's in the question. Makes about 10K lines when run like gen.awk.
#!/usr/bin/awk -f
function randInt(max) {
return( int(rand()*max)+1 )
}
BEGIN {
a[1]="a"; a[2]="b"; a[3]="c"; a[4]="d"; a[5]="e"; a[6]="f"
a[7]="g"; a[8]="h"; a[9]="i"; a[10]="j"; a[11]="k"; a[12]="l"
a[13]="m"; a[14]="n"; a[15]="o"; a[16]="p"; a[17]="q"; a[18]="r"
a[19]="s"; a[20]="t"; a[21]="u"; a[22]="v"; a[23]="w"; a[24]="x"
a[25]="y"; a[26]="z"
print "x,y"
for(i=1;i<=26;i++) {
amultiplier = randInt(1000) # vary this to change the output size
r = randInt(amultiplier)
anum = 1
for(j=1;j<=amultiplier;j++) {
if( j == r ) { anum++; r = randInt(amultiplier) }
print a[randInt(26)] randInt(5) "," a[i] anum
}
}
}
I think if you can get the input into a form like this, it's easy:
1h a2 b6
2e a2
4f a2 b6 c9
7v a2 c9
In fact, you don't even need the x value. You can convert this:
a2 b6
a2
a2 b6 c9
a2 c9
Into this:
a2,b6
a2,b6
a2,c9
a2,c9
That output can be sorted and piped to uniq -c to get approximately the output you want, so we only need to think much about how to get from your input to the first and second states. Once we have those, the final step is easy.
Step one:
sort /tmp/values.csv \
| awk '
BEGIN { FS="," }
{
if (x != $1) {
if (x) print values
x = $1
values = $2
} else {
values = values " " $2
}
}
END { print values }
'
Step two:
| awk '
{
for (i = 1; i < NF; ++i) {
for (j = i+1; j <= NF; ++j) {
print $i "," $j
}
}
}
'
Step three:
| sort | awk '
BEGIN {
combination = $0
print "y1,y2,share"
}
{
if (combination == $0) {
count = count + 1
} else {
if (count) print combination "," count
count = 1
combination = $0
}
}
END { print combination "," count }
'
This awk script does the job:
BEGIN { FS=OFS="," }
NR==1 { print "y1","y2","share" }
NR>1 { ++seen[$1,$2]; ++x[$1]; ++y[$2] }
END {
for (y1 in y) {
for (y2 in y) {
if (y1 != y2 && !(y2 SUBSEP y1 in c)) {
for (i in x) {
if (seen[i,y1] && seen[i,y2]) {
++c[y1,y2]
}
}
}
}
}
for (key in c) {
split(key, a, SUBSEP)
print a[1],a[2],c[key]
}
}
Loop through the input, recording both the original elements and the combinations. Once the file has been processed, look at each pair of y values. The if statement does two things: it prevents equal y values from being compared and it saves looping through the x values twice for every pair. Shared values are stored in c.
Once the shared values have been aggregated, the final output is printed.
This sed script does the trick:
#!/bin/bash
echo y1,y2,share
x=$(wc -l < file)
b=$(echo "$x -2" | bc)
index=0
for i in $(eval echo "{2..$b}")
do
var_x_1=$(sed -n ''"$i"p'' file | sed 's/,.*//')
var_y_1=$(sed -n ''"$i"p'' file | sed 's/.*,//')
a=$(echo "$i + 1" | bc)
for j in $(eval echo "{$a..$x}")
do
var_x_2=$(sed -n ''"$j"p'' file | sed 's/,.*//')
var_y_2=$(sed -n ''"$j"p'' file | sed 's/.*,//')
if [ "$var_x_1" = "$var_x_2" ] ; then
array[$index]=$var_y_1,$var_y_2
index=$(echo "$index + 1" | bc)
fi
done
done
counter=1
for (( k=1; k<$index; k++ ))
do
if [ ${array[k]} = ${array[k-1]} ] ; then
counter=$(echo "$counter + 1" | bc)
else
echo ${array[k-1]},$counter
counter=1
fi
if [ "$k" = $(echo "$index-1"|bc) ] && [ $counter = 1 ]; then
echo ${array[k]},$counter
fi
done

Reduced permutations

Consider the following string
abcd
I can return 2 character permutations
(cartesian product)
like this
$ echo {a,b,c,d}{a,b,c,d}
aa ab ac ad ba bb bc bd ca cb cc cd da db dc dd
However I would like to remove redundant entries such as
ba ca cb da db dc
and invalid entries
aa bb cc dd
so I am left with
ab ac ad bc bd cd
Example
Here's a pure bash one:
#!/bin/bash
pool=( {a..d} )
for((i=0;i<${#pool[#]}-1;++i)); do
for((j=i+1;j<${#pool[#]};++j)); do
printf '%s\n' "${pool[i]}${pool[j]}"
done
done
and another one:
#!/bin/bash
pool=( {a..d} )
while ((${#pool[#]}>1)); do
h=${pool[0]}
pool=("${pool[#]:1}")
printf '%s\n' "${pool[#]/#/$h}"
done
They can be written as functions (or scripts):
get_perms_ordered() {
local i j
for((i=1;i<"$#";++i)); do
for((j=i+1;j<="$#";++j)); do
printf '%s\n' "${!i}${!j}"
done
done
}
or
get_perms_ordered() {
local h
while (("$#">1)); do
h=$1; shift
printf '%s\n' "${#/#/$h}"
done
}
Use as:
$ get_perms_ordered {a..d}
ab
ac
ad
bc
bd
cd
This last one can easily be transformed into a recursive function to obtain ordered permutations of a given length (without replacement—I'm using the silly ball-urn probability vocabulary), e.g.,
get_withdraws_without_replacement() {
# $1=number of balls to withdraw
# $2,... are the ball "colors"
# return is in array gwwr_ret
local n=$1 h r=()
shift
((n>0)) || return
((n==1)) && { gwwr_ret=( "$#" ); return; }
while (("$#">=n)); do
h=$1; shift
get_withdraws_without_replacement "$((n-1))" "$#"
r+=( "${gwwr_ret[#]/#/$h}" )
done
gwwr_ret=( "${r[#]}" )
}
Then:
$ get_withdraws_without_replacement 3 {a..d}
$ echo "${gwwr_ret[#]}"
abc abd acd bcd
You can use awk to filter away the entries you don't want:
echo {a,b,c,d}{a,b,c,d} | awk -v FS="" -v RS=" " '$1 == $2 { next } ; $1 > $2 { SEEN[ $2$1 ] = 1 ; next } ; { SEEN[ $1$2 ] =1 } ; END { for ( I in SEEN ) { print I } }'
In details:
echo {a,b,c,d}{a,b,c,d} \
| awk -v FS="" -v RS=" " '
# Ignore identical values
$1 == $2 { next }
# Reorder and record inverted entries
$1 > $2 { SEEN[ $2$1 ] = 1 ; next }
# Record everything else
{ SEEN[ $1$2 ] = 1 }
# Print the final list
END { for ( I in SEEN ) { print I } }
'
FS="" tells awk that each character is a separate field.
RS=" " uses spaces to separate records.
I'm sure someone's going to do this in one line of awk, but here is something in bash:
#!/bin/bash
seen=":"
result=""
for i in "$#"
do
for j in "$#"
do
if [ "$i" != "$j" ]
then
if [[ $seen != *":$j$i:"* ]]
then
result="$result $i$j"
seen="$seen$i$j:"
fi
fi
done
done
echo $result
Output:
$ ./prod.sh a b c d
ab ac ad bc bd cd
$ ./prod.sh I have no life
Ihave Ino Ilife haveno havelife nolife
here is a pseudo code to achieve that, based on your restrictions, and
using an array for your characters:
for (i=0;i<array.length;i++)
{
for (j=i+1;j<array.length;j++)
{
print array[i] + array[j]; //concatenation
}
}
I realized that I am not looking for permutations, but the power set. Here
is an implementation in Awk:
{
for (c = 0; c < 2 ^ NF; c++) {
e = 0
for (d = 0; d < NF; d++)
if (int(c / 2 ^ d) % 2) {
printf "%s", $(d + 1)
}
print ""
}
}
Input:
a b c d
Output:
a
b
ab
c
ac
bc
abc
d
ad
bd
abd
cd
acd
bcd
abcd
Example

Extracting multiple parts of a string using bash

I have a caret delimited (key=value) input and would like to extract multiple tokens of interest from it.
For example: Given the following input
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"
1=A00^35=D^22=101^150=1^33=1
1=B000^35=D^22=101^150=2^33=2
I would like the following output
35=D^150=1^
35=D^150=2^
I have tried the following
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"|egrep -o "35=[^/^]*\^|150=[^/^]*\^"
35=D^
150=1^
35=D^
150=2^
My problem is that egrep returns each match on a separate line. Is it possible to get one line of output for one line of input? Please note that due to the constraints of the larger script, I cannot simply do a blind replace of all the \n characters in the output.
Thank you for any suggestions.This script is for bash 3.2.25. Any egrep alternatives are welcome. Please note that the tokens of interest (35 and 150) may change and I am already generating the egrep pattern in the script. Hence a one liner (if possible) would be great
You have two options. Option 1 is to change the "white space character" and use set --:
OFS=$IFS
IFS="^ "
set -- 1=A00^35=D^150=1^33=1 # No quotes here!!
IFS="$OFS"
Now you have your values in $1, $2, etc.
Or you can use an array:
tmp=$(echo "1=A00^35=D^150=1^33=1" | sed -e 's:\([0-9]\+\)=: [\1]=:g' -e 's:\^ : :g')
eval value=($tmp)
echo "35=${value[35]}^150=${value[150]}"
To get rid of the newline, you can just echo it again:
$ echo $(echo "1=A00^35=D^150=1^33=1"|egrep -o "35=[^/^]*\^|150=[^/^]*\^")
35=D^ 150=1^
If that's not satisfactory (I think it may give you one line for the whole input file), you can use awk:
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=35,150 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
35=D^150=1^
35=d^
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=1,33 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
1=A00^33=1^
1=a00^33=11^
This one allows you to use a single awk script and all you need to do is to provide a comma-separated list of keys to print out.
And here's the one-liner version :-)
echo '1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLST=1,33 -F^ '{s="";split(LST,k,",");for(i=1;i<=NF;i++){for(j in k){split($i,arr,"=");if(arr[1]==k[j]){printf s""arr[1]"="arr[2];s="^";}}}if(s!=""){print s;}}'
given a file 'in' containing your strings :
$ for i in $(cut -d^ -f2,3 < in);do echo $i^;done
35=D^150=1^
35=D^150=2^

Resources