bash : to keep all values > 3500 with sed

bash : to keep all values > 3500 with sed - bash

I 've a question concerning sed cmd: how to keep all values > 3500 in a field?
this is my problem:
I've as output (from a .csv file):
String1;Val1;String2;Val2
i would like to keep all lines where Val1 is only > 3500 and Val2 >= 60,00 (<= 99,99)
so, i tried this:
`sed -nr 's/^(.*);
([^([0-9]|[1-9][0-9]|[1-9][0-9]{2}|[1-2][0-9]{3}|3[0-4][0-9]{2}|3500)]);
(.*);
([6-9][0-9],[0-9]*)$
/Dans la ville de \1, \2 votants avec un pourcentage de \4 pour \3/p'
`
but i 've this error:
`sed -e expression #1, char 174: Unmatched ) or \)`
i think the problem come from the search of the second field.
i look all numbers <= 3500 and i put NOT(these tests).
Do u have an idea to how should i proceed?
Thanks.
(and sry for this terrible english)

Awk is the right way to go in such case:
awk 'BEGIN{ FS=OFS=";" }$2 > 3500 && ($4 >= 60.00 && $4 <= 99.99)' file

The parsing error is in [^([0-9]|[1-9][0-9]|[1-9][0-9]{2}|[1-2][0-9]{3}|3[0-4]. I'm not entirely sure where exactly, but that doesn't matter since there is an error in your approach:
(Inverted) character classes [^...] do not work on full strings. [^ab|xy] matches all single characters that are not a, b, |, x, or y.
If you want to say »all strings except 0, 1, 2, ..., 3500« you have to use something different, probably a positive formulation like »all strings from 3500, 3501, ...«.
The following regex should work for numbers >= 3500.
0*([1-9][0-9]{4,}|[4-9][0-9]{3}|3[5-9][0-9]{2})

Related

AWK: subset randomly and without replacement a string in every row of a file

So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.

I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)

Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.

if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL

How to insert a comma after 4 digits for all number with more than 8 digits

I have csv-file that looks like this:
12625,6475,387,-388,-332,-217,-104,17,125,160,121,38,-101,-282,-368
-2675,6475,420,-385,-330,-217,-106,16,124,158,120,37,-104,-281,-365
2725,6475,633,-377,-327,-222,-117,6,113,148,109,26,-114,-282,-359
-12775,6475,927,-367,-324,-229,-133,-9,99,134,95,11,-128,-283,-351
12825,64751200,-357,-320,-236,-147,-23,86,121,82,-3,-140,-283,-344
^ missing comma
In some rows I have the problem shown in the last row of the example, where a comma is missing between the second and third column. I know from the data that the most digits a legitimate entry can have is 5 (in some cases with a - in front) and all entries that have 8 digits originate from missing commas, which should appear after the fourth digit.
I am looking from an expression - presumably with sed - that inserts a comma after the fourth digit of all 8-digit numbers in the file.
What I have so far is
echo "12356" | sed 's/\B[0-9]\{3\}/&,/g'
which will insert a comma after four digits. How can filter such that this will only happen for 8-digit numbers, not for 5-digit numbers.
I am also open to any more elegant way that might exist to solve that problem.
Thank you

Try this sed
sed -E 's/([0-9]{4})([0-9]{4})/\1,\2/g'

Because sed has already been mentioned, here’s some awk…
awk -F, -vOFS=, '{
for (i = 1; i <= NF; ++i)
if (length($i) >= 8)
$i = substr($i, 1, 4) "," substr($i, 5)
} 1' < some_file.csv
…and here’s some pure Bash, for no good reason:
(
IFS=,
while read -ra line; do
for i in "${!line[#]}"; do
((${#line[i]} >= 8)) && line[i]="${line[i]::4},${line[i]:4}"
done
printf '%s\n' "${line[*]}"
done
) < some_file.csv

how to round the output in shell?

For our webshop we get from the manufacturers a csv file (automatically updated) with product data.
Some manufacturers use prices without Tax and some within.
I want to change prices with a shell script to add 21% TAX and round it to nearest .95 or .50
For example I get a sheet:
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900
I use this code:
sed -i "1 s/price/price_excl_vat/" inputfile
awk '{FS="|"; OFS="|"; if (NR<=1) {print $0 "|price"} else {print $0 "|" $5*1.21}}' inputfile > outputfile
the output is:
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900|30.2379
How do I round it to the correct price like below ?
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900|29.95

awk to the rescue!
awk 'BEGIN {FS=OFS="|"}
$NF==$NF+0 {a=$NF*1.21;
r=a-int(a);
if (r<0.225) a=a-r-0.05;
else if (r<0.725) a=a-r+0.50;
else a=a-r+0.95;
$(NF+1)=a} 1'
note that in your example the nearest number for 30.2379 will be 30.50 Perhaps you want to round down?
To round down instead of the nearest, and with a variable price column. The new computed value will be appended to the end of the row.
awk 'BEGIN {FS=OFS="|"; k=5}
$k==$k+0 {a=$k*1.21;
r=a-int(a);
if (r<0.50) a=a-r-0.05;
else if (r<0.95) a=a-r+0.50;
else a=a-r+0.95;
$(NF+1)=a} 1'

awk '#define field separator in and out
BEGIN{FS=OFS="|"}
# add/modify a 6th field for price label if missing on header only
NR==1 && NF == 6 { $6 = "price"; print; next}
NR==1 && NF == 5 { $6 = "price"; print; next}
# add price with tva rounded to 0.01 if missing
NF == 5 { $6 = int( $5 * 121 ) / 100 }
# print the line (modified or not, ex empty lines) [7 is just a *not 0*)
7
' inputfile \
> outputfile
self documented
not sure about your sed for header becasue sample show already a header with price so take the one you want

Not knowing what you're program looks like, it makes it difficult to give you more information.
However, both awk and bash have the printf command. This command can be used for rounding floating point numbers. (Yes, Bash is integer arithmetic, but it can pretend a number is a decimal number).
I gave you the link for the C printf command because the one for Bash doesn't include the formatting codes. Read it and weep because the documentation is a bit dense, and if you've never used printf before, it can be quite difficult to understand. Fortunately, an example will bring things to light:
$ foo="23.42532"
$ printf "%2.2f\n", $foo
$ 23.43 #All rounded for you!
The f means it's a floating point number. The % tells you that this is the beginning of a formatting sequence. The 2.2 means you want 2 digits on the left side of the decimal and two digits on the right. If you said %4.2f, it would make sure there's enough room for four digits on the left side of the decimal, and left pad the number with spaces. The \n on the end is the New Line character.
Fortunately, although printf can be hard to understand at first, it's pretty much the same in almost all programming languages. It's in awk, Perl, Python, C, Java, and many more languages. And, if the information you need isn't in printf, try the documentation on sprintf which is like printf, but prints the formatted text into a string.
The best documentation I've seen is in the Perl sprintf documentation because it gives you plenty of examples.

Bash: arithmetic addressed by line number and column

I have normally done this with Excel, but as I am trying to learn bash, I'd like to ask for advice here on how to do so. My input file resembles:
# s0 legend "1001"
# s1 legend "1002"
#target G0.S0
#type xy
2.0 -1052.7396157664
2.5 -1052.7330560932
3.0 -1052.7540013664
3.5 -1052.7780321236
4.0 -1052.7948229060
4.5 -1052.8081313831
5.0 -1052.8190310613
&
#target G0.S1
#type xy
2.0 -1052.5384564253
2.5 -1052.7040374678
3.0 -1052.7542803612
3.5 -1052.7781686744
4.0 -1052.7948927247
4.5 -1052.8081704241
5.0 -1052.8190543049
&
where the above only shows two data sets: s0 and s1. In reality I have 17 data sets and will combine them arbitrarily. By combine, I mean I would like to:
For two data sets, extract the second column of each separately.
Subtract these two columns row by row.
Multiply the difference by a constant, $C.
Note: $C multiplies very small numbers and the only way I could get it to not divide by zero was to take a massive scale.
Edit: After requests, I was apparently not entirely clear what I was going for. Take for example:
set0
2 x
3 y
4 z
set1
2 r
3 s
4 t
I also have defined a constant C.
I would like to perform the following operation:
C*(r - x)
C*(s - y)
C*(t - z)
I will be doing this for sets > 1, up to 16, for example (set 10) minus (set 0). Therefore, I need the flexibility to target a value based on its line number and column number, and preferably acting over a range of line numbers to make it efficient.
So far this works:
C=$(echo "scale=45;x=(small numbers)*(small numbers); x" | bc -l)
sed -n '5,11p' input.in | cut -c 5-20 > tmp1.in
sed -n '15,21p' input.in | cut -c 5-20 > tmp2.in
pr -m -t -s tmp1.in tmp2.in > tmp3.in
awk '{printf $2-$1 "\n"}' tmp3.in > tmp4.in
but the multiplication failed:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
returning:
0.00
0.00
0.00
0.00
0.00
0.00
0.00
I have a feeling the whole thing can be accomplished more elegantly with awk. I also tried this:
for (( i=0; i<=6; i++ ))
do
n=5+$i
m=10+n
awk 'NR==n{a=$2};NR==m{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
done
but all I get in temp.in is a long column of 0s.
I also tried
awk 'NR==5,NR==11{a=$2};NR==15,NR==21{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
but got the error
awk: (FILENAME=input.in FNR=20) fatal: attempt to access field -1052
Any idea how to formulate this with awk, and if that doesn't work, then why I cannot multiply with awk above? Thank you!

this does the math in one go
$ awk -v c=1 '/^&/ {s++}
s==1 {a[$1]=$2}
s==3 {print $1,a[$1],$2,c*(a[$1]-$2)}
/#type/ {s++}' file
2.0 -1052.7396157664 -1052.5384564253 -0.201159
2.5 -1052.7330560932 -1052.7040374678 -0.0290186
3.0 -1052.7540013664 -1052.7542803612 0.000278995
3.5 -1052.7780321236 -1052.7781686744 0.000136551
4.0 -1052.7948229060 -1052.7948927247 6.98187e-05
4.5 -1052.8081313831 -1052.8081704241 3.9041e-05
5.0 -1052.8190310613 -1052.8190543049 2.32436e-05
you can remove the decorations and add print formatting easily. The magic numbers 1=g1 and 3=2*g2-1 correspond to data groups 1 and 2 as the order presented in the data file, can be converted to awk variables as well.
The counter s keeps track of whether you're in a set or not, Odd numbers correspond to sets and even numbers between sets. The increment is done both at the start pattern and end pattern. The order of increment statements were set in such a way they, they are not printed following the pattern (unset first, print set values, reset last}. You can change the order and observe the effects.

This might be what you're looking for:
$ cat tst.awk
/^[#&]/ { lineNr=0; next }
{
++lineNr
if (lineNr in prev) {
print $1, c * ($2 - prev[lineNr])
}
prev[lineNr] = $2
}
$ awk -v c=100000 -f tst.awk file
2.0 20115.9
2.5 2901.86
3.0 -27.8995
3.5 -13.6551
4.0 -6.98187
4.5 -3.9041
5.0 -2.32436

In your first try, you should replace that line:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
with that one:
awk -v C=$C '{printf "%11.2f\n", C*$1 }' tmp4.in > tmp5.in
You are mixing notations of bash shell with notation with awk.
in shell you define variable without $, and you use them with $.
Here you are in awk script, there is no $ to use variables. Yet there are some special variables : $1 $2 ...
You have put single quote ' around your awk script, so the shell variables cant be used. I mean you have written $C, but the shell can not see it inside single-quote. That is why you have to write awk -v C=$C so that the shell variable $C is transferred to an awk variable called C.
In your other tries with awk, we can see such errors also. Now I think you'll make it.

calculate distance; substract the first column of the second line from the second column of the fist line using awk

I have a question. I have a file with coordinates (TAB separated)
2 10
35 50
90 200
400 10000
...
I would like to substract the first column of the second line from the second column of the fist line , i.e. calculate the distance, i.e. I would like a file with
25
40
200
...
How could I do that using awk???
Thank you very much in advance

here is an awk one-liner may help you:
kent$ awk 'a{print $1-a}{a=$2}' file
25
40
200

Here's a pure bash solution:
{
read _ ps
while read f s; do
echo $((f-ps))
((ps=s))
done
} < input_file
This only works if you have (small) integers, as it uses bash's arithmetic. If you want to deal with arbitrary sized integers or floats, you can use bc (with only one fork):
{
read _ ps
while read f s; do
printf '%s-%s\n' "$f" "$ps"
ps=$s
done
} < input_file | bc
Now I leave the others give an awk answer!
Alright, since nobody wants to upvote my answer, here's a really funny solution that uses bash and bc:
a=( $(<input_file) )
printf -- '-(%s)+(%s);\n' "${a[#]:1:${#a[#]}-2}" | bc
or the same with dc (shorter but doesn't work with negative numbers):
a=( $(<input_file) )
printf '%s %sr-pc' "${a[#]:1:${#a[#]}-2}" | dc

using sed and ksh for evaluation
sed -n "
1x
1!H
$ !b
x
s/^ *[0-9]\{1,\} \(.*\) [0-9]\{1,\} *\n* *$/\1 /
s/\([0-9]\{1,\}\)\(\n\)\([0-9]\{1,\}\) /echo \$((\3 - \1))\2/g
s/\n *$//
w /tmp/Evaluate.me
"
. /tmp/Evaluate.me
rm /tmp/Evaluate.me

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio