I work in telecoms and regularly need to expand number ranges.
For example, 6121234567X [note that there are 10 numbers preceeding the X] is shorthand for:
61212345670
61212345671
61212345672....... etc (a 10 number range)
and 612123456X [note that there are only 9 numbers preceeding the X] is shorthand for
61212345600
61212345601....... etc (a 100 number range)
So I need a grep command that...
reads how many characters in the line preceeding the X (to determine how many suffixes)
writes the appropriate amount of lines (10, 100, or 100) with ascending suffixes
hopefully removes the original line
Below is the Python script that does it, file-name is the expected first argument. Example usage: python script.py file.in > file.out
#!/usr/bin/env python
import sys
def generate(pattern):
p = pattern.lower().find('x')
ret = ""
for i in range(10**(10-p+1)):
ret += pattern[:p] + str(i).zfill(10-p+1) + " "
return ret
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Filename needed!")
else:
with open(sys.argv[1]) as f:
for ln in f:
print(generate(ln.rstrip()))
You can do this in awk quite quickly:
awk -v val=$a -v max=10
'BEGIN {
gsub("X","",val)
items=max - length(val)
for (i=0; i<=10^items; i++)
print val*(10^items)+i
}'
This works as an example. To do the same reading from a file, you just need to play with $1 (first field of the field) instead of val and move all the code from BEGIN into the main block.
Explanation
-v val=$a -v max=10 pass parameters: $a is the variable containing the string on the form 12345678X AND max contains the maximum amount of digits the number will have (10 in your case).
BEGIN {} perform all these actions [before/without] reading a file.
gsub("X","",val) remove X from val.
items=max - length(val) count the size of the variable without the X.
for (i=0; i<=10^items; i++) print val*(10^items)+i loop from 0 to 10^remaining_size. This means from 0 to 10 or from 0 to 100... depending on the result of 10 - size without X.
Test
With 9 as maximum:
$ a=12345678X
$ awk -v val=$a -v max=9 'BEGIN {gsub("X","",val); items=max - length(val); for (i=0; i<=10^items; i++) print val*(10^items)+i}'
123456780
123456781
123456782
123456783
123456784
123456785
123456786
123456787
123456788
123456789
123456790
echo 6121234567X | perl -nE 'm/(.*)X/;
say $1. $_ foreach (0..10**(11-length $1)-1)'
61212345670
61212345671
61212345672
61212345673
61212345674
61212345675
61212345676
61212345677
61212345678
61212345679
It's a little uglier to get the zero padded format:
echo 611234567X | perl -wne 'm/(.*)X/; $b=$1; $r=11 - length $b;
$fmt="%0" . $r . "s\n";
printf "$b$fmt", $_ foreach (0..10**$r-1) '
Related
I have the following script that creates multiple objects.
I tried just simply running it in my terminal, but it seems to take so long. How can I run this with GNU-parallel?
The script below creates an object. It goes through niy = 1 through niy = 800, and for every increment in niy, it loops through njx = 1 to 675.
#!/bin/csh
set njx = 675 ### Number of grids in X
set niy = 800 ### Number of grids in Y
set ll_x = -337500
set ll_y = -400000 ### (63 / 2) * 1000 ### This is the coordinate at lower right corner
set del_x = 1000
set del_y = 1000
rm -f out.shp
rm -f out.shx
rm -f out.dbf
rm -f out.prj
shpcreate out polygon
dbfcreate out -n ID1 10 0
# n = 0 ### initilzation of counter (n) to count gridd cells in loop
# iy = 1 ### initialization of conunter (iy) to count grid cells along north-south direction
echo ### emptly line on screen
while ($iy <= $niy) ### start the loop for norht-south direction
echo ' south-north' $iy '/' $niy ### print a notication on screen
# jx = 1
while ($jx <= $njx)### start the loop for east-west direction
# n++
set x = `echo $ll_x $jx $del_x | awk '{print $1 + ($2 - 1) * $3}'`
set y = `echo $ll_y $iy $del_y | awk '{print $1 + ($2 - 1) * $3}'`
set txt = `echo $x $y $del_x $del_y | awk '{print $1, $2, $1, $2 + $4, $1 + $3, $2 + $4, $1 + $3, $2, $1, $2}'`
shpadd out `echo $txt`
dbfadd out $n
# jx++
end ### close the second loop
# iy++
end ### close the first loop
echo
### lines below create a projection file for the created shapefile using
cat > out.prj << eof
PROJCS["Asia_Lambert_Conformal_Conic",GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",120.98],PARAMETER["Standard_Parallel_1",5.0],PARAMETER["Standard_Parallel_2",20.0],PARAMETER["Latitude_Of_Origin",14.59998],UNIT["Meter",1.0]]
eof
###
###
###
The inner part gets executed 540,000 times and on each iteration you invoke 3 awk processes to do 3 simple bits of maths... that's 1.6 million awks.
Rather than that, I have written a single awk to generate all the loops and do all the maths and this can then be fed into bash or csh to actually execute it.
I wrote this and ran it completely in the time the original version got to 16% through. I have not checked it extremely thoroughly, but you should be able to readily correct any minor errors:
#!/bin/bash
awk -v njx=675 -v niy=800 -v ll_x=-337500 -v ll_y=-400000 '
BEGIN{
print "shpcreate out polygon"
print "dbfcreate out -n ID1 10 0"
n=0
for(iy=1;iy<niy;iy++){
for(jx=1;jx<njx;jx++){
n++
x = llx + (jx-1)*1000
y = lly + (iy-1)*1000
txt = sprintf("%d %d %d %d %d %d %d %d %d %d",x,y,x, y+dely, x+delx, y+dely, x+delx,y,x,y)
print "shpadd out",txt
print "dbfadd out",n
}
}
}' /dev/null
If the output looks good, you can then run it through bash or csh like this:
./MyAwk | csh
Note that I don't know anything about these Shapefile (?) tools, shpadd or dbfadd tools. They may or may not be able to be run in parallel - if they are anything like sqlite running them in parallel will not help you much. I am guessing the changes above are enough to make a massive improvement to your runtime. If not, here are some other things you could think about.
You could append an ampersand (&) to each line that starts dbfadd or shpadd so that several start in parallel, and then print a wait after every 8 lines so that you run 8 things in parallel in chunks.
You could feed the output of the script directly into GNU Parallel, but I have no idea if the ordering of the lines is critical.
I presume this is creating some sort of database. It may be faster if you run it on a RAM-backed filesystem, such as /tmp.
I notice there is a Python module for manipulating Shapefiles here. I can't help thinking that would be many, many times faster still.
I have a large file that contains 2 IPs per line - and there's about 3 million lines total.
Here's an example of the file:
1.32.0.0,1.32.255.255
5.72.0.0,5.75.255.255
5.180.0.0,5.183.255.255
222.127.228.22,222.127.228.23
222.127.228.24,222.127.228.24
I need to convert each IP to an IP Decimal, like this:
18874368,18939903
88604672,88866815
95682560,95944703
3732923414,3732923415
3732923416,3732923416
I'd prefer a way to do this strictly via command line. I'm okay with perl or python being used, as long as it doesn't require extra modules to be installed.
I thought I had come across a way that someone converted IPs like this using sed but can't seem to find that tutorial anymore. Any help would be appreciated.
If you have gnu awk installed (for the RT variable), you could use this one-liner:
awk -F. -v RS='[\n,]' '{printf "%d%s", (($1*256+$2)*256+$3)*256+$4, RT}' file
18874368,18939903
88604672,88866815
95682560,95944703
3732923414,3732923415
3732923416,3732923416
Here it is python solution, that use only standard modules (re, sys):
import re
import sys
def multiplier_generator():
""" Cyclic generator of powers of 256 (from 256**3 down to 256**0)
The mulitpliers tupple could be replaced by inline calculation
of power, but this approach has better performance.
"""
multipliers = (
256**3,
256**2,
256**1,
256**0,
)
idx = 0
while 1 == 1:
yield multipliers[idx]
idx = (idx + 1) % 4
def replacer(match_object):
"""re.sub replacer for ip group"""
multiplier = multiplier_generator()
res = 0
for i in xrange(1,5):
res += multiplier.next()*int(match_object.group(i))
return str(res)
if __name__ == "__main__":
std_in = ""
if len(sys.argv) > 1:
with open(sys.argv[1],'r') as f:
std_in = f.read()
else:
std_in = sys.stdin.read()
print re.sub(r"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)", replacer, std_in )
This solution replace every ip address, that can be found in text from standard input or from file passed as first parameter, i.e:
python convert.py < input_file.txt, or
python convert.py file.txt, or
echo "1.2.3.4, 5.6.7.8" | python convert.py.
With bash:
ip2dec() {
set -- ${1//./ } # split $1 with "." to $1 $2 $3 $4
declare -i dec # set integer attribute
dec=$1*256*256*256+$2*256*256+$3*256+$4
echo -n $dec
}
while IFS=, read -r a b; do ip2dec $a; echo -n ,; ip2dec $b; echo; done < file
Output:
18874368,18939903
88604672,88866815
95682560,95944703
3732923414,3732923415
3732923416,3732923416
With bash and using shift (one CPU instruction) instead of multiply (a lot of instructions):
ip2dec() { local IFS=.
set -- $1 # split $1 with "." to $1 $2 $3 $4
printf '%s' "$(($1<<24+$2<<16+$3<<8+$4))"
}
while IFS=, read -r a b; do
printf '%s,%s\n' "$(ip2dec $a)" "$(ip2dec $b)"
done < file
Reading a text file into an array, extracting elements and sorting them is taking a very long time.
The text file is ffmpeg console output for R128 audio analysis. I need to get the highest M and S values. Example:
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.49998 M: -22.2 S: -29.9 I: -27.0 LUFS LRA: 9.8 LU FTPK: -12.4 dBFS TPK: -9.7 dBFS
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.69998 M: -22.5 S: -28.6 I: -25.9 LUFS LRA: 11.3 LU FTPK: -12.7 dBFS TPK: -9.7 dBFS
The text file can be hundreds or thousands of lines long depending on the duration of the audio file being analysed
I want to find the highest M (-22.2) and S Values (-28.6) and assign them to variables M and S
This is what I am using currently:
ARRAY=()
while read LINE
do
ARRAY+=("$LINE")
done < $tempDir/text.txt
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n ‘/B:/p' | sed 's/S:.*//' | sed -n -e 's/^.*M://p' | sed -n -e 's/-//p' >>/$tempDir/R128M.txt
done
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n '/M:/p' | sed 's/I:.*//' | sed -n -e 's/^.*S://p' | sed -n -e 's/-//p' >>$tempDir/R128S.txt
done
cat $tempDir/R128M.txt
M=( $(sort $tempDir/R128M.txt) )
cat $tempDir/R128S.txt
S=( $(sort $tempDir/R128S.txt) )
Is there a faster way of doing this?
Rather than reading in the whole file in memory, writing bits of it out to separate file, and reading those in again, just parse it and pick out the largest values:
$ awk '$7 > m || m == "" { m = $7 } $9 > s || s == "" { s = $9 } END { print m, s }' data
-22.2 -28.6
In your data, field 7 and 9 contains the values of M and S. The awk script will update its m and s variables if it finds larger values in these fields and then print the largest found at the end. The m == "" and s == "" are needed to trigger initialization of the values if no values has been read yet.
Another way with awk, which may look cleaner:
$ awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { print m, s }' data
To assign them to M and S in the shell:
$ declare $( awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { printf("M=%f S=%f\n", m, s) }' data )
$ echo $M $S
-22.200000 -28.600000
Adjust the printf() format to use %s instead of %f if you want the original strings instead of float values, or set the number of decimals you might want with, e.g., %.2f in place of %f.
First of all, three-process pipe is a bit redundant for a single value extraction, especially taking into account you reinstantiate it anew for every line.
Next, you save all the values into a file and then sort that file, while all you need is the maximum value. You can easily find it during the very first (value extraction) loop, for additional O(N) running time, instead of I/O and sorting with all the I/O overhead and O(NlogN) sorting expenses. See ARITHMETIC EXPANSION and conditional expressions in bash manual.
I am trying to use bc in an awk script. In the code below, I am trying to convert hexadecimal number to binary and store it in a variable.
#!/bin/awk -f
{
binary_vector = $(bc <<< "ibase=16;obase=2;FF")
}
Where do I go wrong?
Not saying it's a good idea but:
$ awk 'BEGIN {
cmd = "bc <<< \"ibase=16;obase=2;FF\""
rslt = ((cmd | getline line) > 0 ? line : -1)
close(cmd)
print rslt
}'
11111111
Also see http://gnu.org/software/gawk/manual/gawk.html#Bitwise-Functions and http://gnu.org/software/gawk/manual/gawk.html#Nondecimal-Data
The following one-liner Awk script should do what you want:
awk -vVAR=$(read -p "Enter number: " -u 0 num; echo $num) \
'BEGIN{system("echo \"ibase=16;obase=2;"VAR"\"|bc");}'
Explanation:
-vVAR Passes the variable VAR into Awk
-vVAR=$(read -p ... ) Sets the variable VAR from the
shell to the user input.
system("echo ... |bc") Uses the Awk system built in command to execute the shell commands. Notice how the quoting stops at the variable VAR and then continues just after it, thats so that Awk interprets VAR as an Awk variable and not as part of the string put into the system call.
Update - to use it in an Awk variable:
awk -vVAR=$(read -p "Enter number: " -u 0 num; echo $num) \
'BEGIN{s="echo \"ibase=16;obase=2;"VAR"\"|bc"; s | getline awk_var;\
close(s); print awk_var}'
s | getline awk_var will put the output of the command s into the Awk variable awk_var. Note the string is built before sending it to getline - if not (unless you parenthesize the string concatenation) Awk will try to send it to getline in separate pieces %s VAR %s.
The close(s) closes the pipe - although for bc it doesn't matter and Awk automatically closes pipes upon exit - if you put this into a more elaborate Awk script it is best to explicitly close the pipe. According to the Awk documentation some commands such as mail will wait on the pipe to close prior to completion.
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_39.html
By the way you wrote your example, it looks like you want to convert an awk record ( line ) into an associative array. Here's an awk executable script that allows that by running the bc command over values in a split type array:
#!/usr/bin/awk -f
{
# initialize the a array
cnt = split($0, a, FS)
if( convertArrayBase(10, 2, a, cnt) > -1 ) {
# use the array here
for(i=1; i<=cnt; i++) {
print a[i]
}
}
}
# Destructively updates input array, converting numbers from ibase to obase
#
# #ibase: ibase value for bc
# #obase: obase value for bc
# #a: a split() type associative array where keys are numeric
# #cnt: size of a ( number of fields )
#
# #return: -1 if there's a getline error, else cnt
#
function convertArrayBase(ibase, obase, a, cnt, i, b, cmd) {
cmd = sprintf("echo \"ibase=%d;obase=%d", ibase, obase)
for(i=1; i<=cnt; i++ ) {
cmd = cmd ";" a[i]
}
cmd = cmd "\" | bc"
i = 0 # reset i
while( (cmd | getline b) > 0 ) {
a[++i] = b
}
close( cmd )
return i==cnt ? cnt : -1
}
When used with an input of:
1 2 3
4 s 1234567
this script outputs the following:
1
10
11
100
0
100101101011010000111
The convertArrayBase function operates on split type arrays. So you have to initialize the input array (a here) with the full row (as shown) or a field's subflds(not shown) before calling the it. It destructively updates the array.
You could instead call bc directly with some helper files to get similar output. I didn't find that bc supported - ( stdin as a file name ) so
it's a little more than I'd like.
Making a start_cmds file like this:
ibase=10;obase=2;
and a quit_cmd like:
;quit
Given an input file (called data.semi) where the data is separated by a ;, like this:
1;2;3
4;s;1234567
you can run bc like:
$ bc -q start_cmds data.semi quit_cmd
1
10
11
100
0
100101101011010000111
which is the same data that the awk script is outputting, but only calling bc a single time with all of the inputs. Now, while that data isn't in an awk associative array in a script, the bc output could be written as stdin input to awk and reassembed into an array like:
bc -q start_cmds data.semi quit_cmd | awk 'FNR==NR {a[FNR]=$1; next} END { for( k in a ) print k, a[k] }' -
1 1
2 10
3 11
4 100
5 0
6 100101101011010000111
where the final dash is telling awk to treat stdin as an input file and lets you add other files later for processing.
I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
To see a frequency count for column two (for example):
awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr
fileA.txt
z z a
a b c
w d e
fileB.txt
t r e
z d a
a g c
fileC.txt
z r a
v d c
a m c
Result:
3 d
2 r
1 z
1 m
1 g
1 b
Here is a way to do it in the shell:
FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr
This is the sort of thing bash is great at.
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.
Here goes:
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Perl
This code computes the occurrences of all columns, and prints a sorted report for each of them:
# columnvalues.pl
while (<>) {
#Fields = split /\s+/;
for $i ( 0 .. $#Fields ) {
$result[$i]{$Fields[$i]}++
};
}
for $j ( 0 .. $#result ) {
print "column $j:\n";
#values = keys %{$result[$j]};
#sorted = sort { $result[$j]{$b} <=> $result[$j]{$a} || $a cmp $b } #values;
for $k ( #sorted ) {
print " $k $result[$j]{$k}\n"
}
}
Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*
Explanation
In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the #Fields array
* For every column, increment the result array-of-hashes data structure
In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence
Results based on the sample input files provided by #Dennis
column 0:
a 3
z 3
t 1
v 1
w 1
column 1:
d 3
r 2
b 1
g 1
m 1
z 1
column 2:
c 4
a 3
e 2
.csv input
If your input files are .csv, change /\s+/ to /,/
Obfuscation
In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:
perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
Ruby(1.9+)
#!/usr/bin/env ruby
Dir["*"].each do |file|
h=Hash.new(0)
open(file).each do |row|
row.chomp.split("\t").each do |w|
h[ w ] += 1
end
end
h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end
Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!
$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$
Pure-Bash version:
FIELD=1
declare -A results
while read -a line; do
results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[#]#A}
The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:
$FIELD is the selected column number
${line[$FIELD]} is the column value from that line in the file
${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)
To have the output sorted in the expected OP format, a little more work is needed:
sort -rn < <(
for k in "${!results[#]}"; do
echo "${results[$k]} $k";
done
)
Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.