Here is the test file on google drive.
sample :test file
I want to list all bytes non ascii byte which beyond \x00-\x7f with awk in the test file.
There are 12 bytes beyond \x00-\x7f.
It is my try.
awk 'BEGIN{FS=""}{for(i=1;i<=NF;++i)if($i~/[^\x00-\x7f]/)print i,$i}' test
146 “
148 ”
181 “
184 ”
awk 'BEGIN{FS=""}{for(i=1;i<=NF;++i)if($i~/[^\x00-\x7f]/)printf("%d %x \n", i,$i)}' test
146 0
148 0
181 0
184 0
Failed,how to list all the 12 bytes in the file as below format.
146 e2
147 80
148 9c
150 e2
151 80
152 9d
185 e2
186 80
187 9c
190 e2
191 80
192 9d
export LC_ALL=C
awk 'BEGIN{FS=""}{for(i=1;i<=NF;++i)if($i~/[^\x00-\x7f]/)printf("%d %c\n",i,$i)}' test
146
147 �
148 �
150
151 �
152 �
185
186 �
187 �
190
191 �
192 �
How to fix my code?
I'm in a UTF8 shell:
$ locale
LANG=en_US.UTF-8
...
so first:
$ export LC_ALL=C
Then:
$ awk -F '' ' # split record in fields
BEGIN { for(n=0;n<256;n++) # iterate all values
ord[sprintf("%c",n)]=n } # make a hash ord[char]=n
{ for(i=1;i<=NF;i++) # iterate all fields
if(ord[$i]>127) # beyond 7f
print ord[$i] } # print n (value)
' test
Outputs:
226
128
156
226
128
157
226
128
156
226
128
157
which in hex would be:
e2
80
9c
...
Related
I have a tab file with two columns like that
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
and the desired output should be
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 14 27
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 6 209 299 305
2 10 13 23 40 47 58 87 10 23 40 58
I would like to change the numbers in 2nd column for random numbers in 1st column resulting in an output in 2nd column with the same number of numbers. I mean e.g. if there are four numbers in 2nd column for x row, the output must have four random numbers from 1st column for this row, and so on...
I'm try to create two arrays by AWK and split and replace every number in 2nd column for numbers in 1st column but not in a randomly way. I have seen the rand() function but I don't know exactly how joint these two things in a script. Is it possible to do in BASH environment or are there other better ways to do it in BASH environment? Thanks in advance
awk to the rescue!
$ awk -F'\t' 'function shuf(a,n)
{for(i=1;i<n;i++)
{j=i+int(rand()*(n+1-i));
t=a[i]; a[i]=a[j]; a[j]=t}}
function join(a,n,x,s)
{for(i=1;i<=n;i++) {x=x s a[i]; s=" "}
return x}
BEGIN{srand()}
{an=split($1,a," ");
shuf(a,an);
bn=split($2,b," ");
delete m; delete c; j=0;
for(i=1;i<=bn;i++) m[b[i]];
# pull elements from a upto required sample size,
# not intersecting with the previous sample set
for(i=1;i<=an && j<bn;i++) if(!(a[i] in m)) c[++j]=a[i];
cn=asort(c);
print $1 FS join(c,cn)}' file
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 85 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 20 205 294 295
2 10 13 23 40 47 58 87 10 13 47 87
shuffle (standard algorithm) the input array, sample required number of elements, additional requirement is no intersection with the existing sample set. Helper structure map to keep existing sample set and used for in tests. The rest should be easy to read.
Assuming that there is a tab delimiting the two columns, and each column is a space delimited list:
awk 'BEGIN{srand()}
{n=split($1,a," ");
m=split($2,b," ");
printf "%s\t",$1;
for (i=1;i<=m;i++)
printf "%d%c", a[int(rand() * n) +1], (i == m) ? "\n" : " "
}' FS=\\t input
Try this:
# This can be an external file of course
# Note COL1 and COL2 seprated by hard TAB
cat <<EOF > d1.txt
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
EOF
# Loop to read each line, not econvert TAB to:, though could have used IFS
cat d1.txt | sed 's/ /:/' | while read LINE
do
# Get the 1st column data
COL1=$( echo ${LINE} | cut -d':' -f1 )
# Get col1 number of items
NUM_COL1=$( echo ${COL1} | wc -w )
# Get col2 number of items
NUM_COL2=$( echo ${LINE} | cut -d':' -f2 | wc -w )
# Now split col1 items into an array
read -r -a COL1_NUMS <<< "${COL1}"
COL2=" "
# THis loop runs once for each COL2 item
COUNT=0
while [ ${COUNT} -lt ${NUM_COL2} ]
do
# Generate a random number to use as teh random index for COL1
COL1_IDX=${RANDOM}
let "COL1_IDX %= ${NUM_COL1}"
NEW_NUM=${COL1_NUMS[${COL1_IDX}]}
# Check for duplicate
DUP_FOUND=$( echo "${COL2}" | grep ${NEW_NUM} )
if [ -z "${DUP_FOUND}" ]
then
# Not a duplicate, increment loop conter and do next one
let "COUNT = COUNT + 1 "
# Add the random COL1 item to COL2
COL2="${COL2} ${COL1_NUMS[${COL1_IDX}]}"
fi
done
# Sort COL2
COL2=$( echo ${COL2} | tr ' ' '\012' | sort -n | tr '\012' ' ' )
# Print
echo ${COL1} :: ${COL2}
done
Output:
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 :: 88 95
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 :: 20 299 304 305
2 10 13 40 47 58 :: 2 10 40 58
I have a list of more than 100000 records.
per example the values from 21 to 84 are continuous, then it will be 21-84 but if it is not continuous as the case 84 87, then it need to be 84,87 separated by ,
at beginning of each line will be the value 11111.
The values from the list will be in the column range of 21 to 80 with, at last.
The length of each row need to be maximum 80.
here is the input file.
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
87
85
86
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
108
111
109
112
110
113
115
114
117
116
118
124
125
120
122
123
126
132
127
133
128
130
131
135
136
137
138
139
140
141
142
143
144
145
146
148
147
149
150
151
152
153
154
155
156
158
157
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
184
183
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
214
here in the output file desired.
111111 21-84,87,85-86,88-106,108,111,109,112,110,113,115,114,117,
111111 116,118,124-125,120,122-123,126,132,127,133,128,130-131,
111111 135-146,148,147,149-156,158,157,159-182,184,183,185-212,214,
thanks in advance
Presented without explanation: check the man pages for the commands used and come back with questions:
awk '
function printrange() { print start (start == last ? "" : "-" last) }
NR == 1 {start=last=$1; next}
$1 == last+1 {last=$1; next}
{printrange(); start=last=$1}
END {printrange()}
' file | paste -sd" " | fold -sw 60 | tr ' ' ',' | sed 's/^/111111 /'
111111 21-84,87,85-86,88-106,108,111,109,112,110,113,115,114,117,
111111 116,118,124-125,120,122-123,126,132,127,133,128,130-131,
111111 135-146,148,147,149-156,158,157,159-182,184,183,185-212,214
I wrote this in response to Reddit's daily programmer challenge, and I would like to get some of your feedback on it to improve the code (it seems to work). The challenge is as follows:
We are given a list of numbers in a "short-hand" range notation where only the significant part of the next number is written because we know the numbers are always increasing (ex. "1,3,7,2,4,1" represents [1, 3, 7, 12, 14, 21]). Some people use different separators for their ranges (ex. "1-3,1-2", "1:3,1:2", "1..3,1..2" represent the same numbers [1, 2, 3, 11, 12]) and they sometimes specify a third digit for the range step (ex. "1:5:2" represents [1, 3, 5]).
NOTE: For this challenge range limits are always inclusive.
Our job is to return a list of the complete numbers.
The possible separators are: ["-", ":", ".."]
Sample input:
104..02
545,64:11
Sample output:
104 105 106...200 201 202 # truncated for simplicity
545 564 565 566...609 610 611 # truncated for simplicity
My solution:
BEGIN { FS = "," }
function next_value(current_value, previous_value) {
regexp = current_value "$"
while(current_value <= previous_value || !(current_value ~ regexp)) {
current_value += 10
}
return current_value;
}
{
j = 0
delete number_list
for(i = 1; i <= NF; i++) {
# handle fields with ranges
if($i ~ /-|:|\.\./) {
split($i, range, /-|:|\.\./)
if(range[1] > range[2]) {
if(j != 0) {
range[1] = next_value(range[1], number_list[j-1])
range[2] = next_value(range[2], range[1])
}
else
range[2] = next_value(range[2], range[1]);
}
if(range[3] == "")
number_to_iterate_by = 1;
else
number_to_iterate_by = range[3];
range_iterator = range[1]
while(range_iterator <= range[2]) {
number_list[j] = range_iterator
range_iterator += number_to_iterate_by
j++
}
}
else {
number_list[j] = $i
j++
}
}
# apply increasing range logic and print
for(i = 0; i < j; i++ ) {
if(i == 0) {
if(NR != 1) printf "\n"
current_value = number_list[i]
}
else {
previous_value = current_value
current_value = next_value(number_list[i], previous_value)
}
printf "%s ", current_value
}
}
END { printf "\n" }
This is BASH (Not AWK).
I believe it is a valid answer because the original challenge doesn't specify a language.
#!/bin/bash
mkord(){ local v=$1 dig base
max=$2
(( dig=10**${#v} , base=max/dig*dig , v+=base ))
while (( v < max )); do (( v+=dig )); done
max=$v
}
while read line; do
line="${line//[,\"]/ }" line="${line//[:-]/..}"
IFS=' ' read -a arr <<<"$line"
max=0 a='' res=''
for val in "${arr[#]//../ }"; do
IFS=" " read v1 v2 v3 <<<"$val"
(( a==0 )) && max=$v1
[[ $v1 ]] && mkord "$v1" "$max" && v1=$max
[[ $v2 ]] && mkord "$v2" "$max" && v2=$max
res=$res${a:+,}${v2:+\{}$v1${v2:+\.\.}$v2${v3:+\.\.}$v3${v2:+\}}
a=1
done
(( ${#arr[#]} > 1 )) && res={$res}
eval set -- $res
echo "\"$*\""
done <"infile"
If the source of the tests is:
$ cat infile
"1,3,7,2,4,1"
"1-3,1-2"
"1:5:2"
"104-2"
"104..02"
"545,64:11"
The result will be:
"1 3 7 12 14 21"
"1 2 3 11 12"
"1 3 5"
"104 105 106 107 108 109 110 111 112"
"104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202"
"545 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611"
This gets the list done in 7 milliseconds.
My solution using gawk, RT (It contains the input text that matched the text denoted by RS) and next_n function uses modulo operation for to find the next number based on the last
cat range.awk
BEGIN{
RS="\\.\\.|,|:|-"
start = ""
end = 0
temp = ""
}
function next_n(n, last){
mod = last % (10**length(n))
if(mod < n) return last - mod + n
return last + ((10**length(n))-mod) + n
}
{
if(RT==":" || RT==".." || RT=="-"){
if(start=="") start = next_n($1,end)
else temp = $1
}else{
if(start != ""){
if(temp==""){
end = next_n($1,start)
step = 1
}else {
end = next_n(temp,start)
step = $1
}
for(i=start; i<=end; i+=step) printf "%s ", i
start = ""
temp = ""
}else{
end = next_n($1,end)
printf "%s ", end
}
}
}
END{
print ""
}
TEST 1
echo "104..02" | awk -f range.awk
OUTPUT 1
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
TEST 2
echo "545,64:11" | awk -f range.awk
OUTPUT 2
545 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611
TEST 3
echo "2..5,7,2-1,2:1,0-3,2-7,8..0,4,4,2..1" | awk -f range.awk
OUTPUT 3
2 3 4 5 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 40 41 42 43 52 53 54 55 56 57 58 59 60 64 74 82 83 84 85 86 87 88 89 90 91
TEST 4 with step
echo "1:5:2,99,88..7..3" | awk -f range.awk"
OUTPUT 4
1 3 5 99 188 191 194 197
I look for common elements in two files or which row of a matrix has the most elements from a given row. what I understood until now is how to compare fields. I receive the lines which hold the same value in the same fieldnumber.
But how can I open the search to the other field numbers?
awk 'NR==FNR{a[$1];next}$1 in a{print $1" "FNR}' file1 file2
104 3
Expected output:
104 3 111 4 117 2 134 2 148 - 156 4 166 4 176 3 186 - 198 1 221 6 236 -
best match row 4 with 3 elements common.
file 1
104 111 117 134 148 156 166 176 186 198 221 236
file 2
102 108 116 124 132 141 151 162 173 185 198 211
103 109 117 125 134 143 153 163 175 187 200 213
104 110 118 126 135 144 154 165 176 188 201 215
105 111 119 127 136 145 156 166 178 190 203 217
106 112 120 128 137 147 157 168 179 192 205 219
107 113 121 130 139 148 158 169 181 193 207 221
108 114 122 131 140 150 160 171 183 195 208 200
This solution assumes 1) that file1 contains unique values as shown in the provided example and 2) there is only one top ranked line in file2.
awk -v string=$(cat file1 | tr " " ",") \
'{split(string,array,","); cnt=0;
for(i in array) {for(j=1;j<=NF;j++) if(array[i]==$j) cnt++};
if(cnt>cntmax) {cntmax=cnt; NRmax=NR}} END{print NRmax}' file2
4
This question continues the discussion started here. I found out that the HTTP response body can't be unmarshaled into JSON object because of deflate compression of the latter. Now I wonder how can I perform decompression with Golang. I will appreciate anyone who can show the errors in my code.
Input data
I've dumped the HTTP response body into the 'test' file. Here is it:
$ cat test
x��PAN�0�
;��NtJ�FӮdU�|"oVR�C%�f�����Z.�^Hs�dW뮑�'��DH�S�SFVC����r)G,�����<���z}�x_g�+�2��sl�r/�Oy>��J3\�G�9���N���#[5M�^v/�2Ҕ��|�h��[�~7�_崛<D*���/��i
Let's make sure that this file can be decompressed and even contains valid JSON:
$ zlib-flate -uncompress < test
{"timestamp":{"tv_sec":1428488670,"tv_usec":197041},"string_timestamp":"2015-04-08 10:24:30.197041","monitor_status":"enabled","commands":{"REVERSE_LOOKUP":{"cache":{"outside":{"successes":0,"failures":0,"size":0,"time":0},"internal":{"successes":0,"failures":0,"size":0,"time":0}},"disk":{"outside":{"successes":0,"failures":0,"size":0,"time":0},"internal":{"successes":13366,"failures":0,"size":0,"time":501808}},"total":{"storage":{"successes":0,"failures":0},"proxy":{"successes":13366,"failures":0}}},"clients":{}}}
$ zlib-flate -uncompress < test | python -m json.tool
{
"commands": {
"REVERSE_LOOKUP": {
"cache": {
....
Source code
package main
import (
"bytes"
"compress/flate"
"fmt"
"io/ioutil"
)
func main() {
fname := "./test"
content, err := ioutil.ReadFile(fname)
if err != nil {
panic(err)
}
fmt.Println("File content:\n", content)
enflated, err := ioutil.ReadAll(flate.NewReader(bytes.NewReader(content)))
if err != nil {
panic(err)
}
fmt.Println("Enflated:\n", enflated)
}
Error
$ go run uncompress.go
File content:
[120 156 181 80 65 78 195 48 16 252 10 242 57 69 118 226 166 38 247 156 64 42 42 130 107 100 156 165 88 196 118 149 93 35 160 234 223 89 183 61 112 42 226 192 109 118 118 102 103 180 123 65 62 0 146 13 59 209 237 5 189 15 8 78 116 74 215 70 27 211 174 100 85 184 124 34 111 86 82 171 67 37 144 102 31 183 195 15 167 168 165 90 46 164 94 72 115 165 100 87 235 174 145 215 39 189 168 68 72 209 83 154 7 22 83 70 86 67 180 207 19 140 188 114 41 4 27 71 44 225 155 254 169 223 60 244 195 221 122 125 251 120 95 24 103 221 43 20 144 50 161 31 143 16 179 115 128 8 108 225 114 47 214 79 121 62 15 232 191 224 8 74 51 6 92 213 71 130 57 218 233 175 78 182 142 30 223 254 35 91 53 77 219 94 118 47 165 50 210 148 18 148 232 124 128 31 104 183 151 91 176 126 55 167 143 207 95 3 15 229 180 155 60 68 42 159 231 241 27 47 165 167 25]
panic: flate: corrupt input before offset 5
goroutine 1 [running]:
runtime.panic(0x4a7180, 0x5)
/usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
main.main()
/home/isaev/side-projects/elliptics-manager/uncompress.go:20 +0x2a3
exit status 2
PS Ubuntu 14.10, Go 1.2.1
Your input is not a simple deflated block, it's a zlib stream.
According to the ZLIB Compressed Data Format Specification 3.3 the first 2 bytes are:
-------------
| CMF | FLG |
-------------
The Compression Method and flags. Your input starts with [120, 156] which is 78 9C in hexa. This is the Default Compression. Also no dictionary follows, so the subsequent data is the compressed data.
Bits 0 to 3 are CM Compression Method and bits 4 to 7 are CINFO Compression Info. In this case CINFO=7 indicates a 32K window size, CM=8 denotes the "deflate" compression method. FLG bit 5 tells if a dictionary is preset, which is in this case. Details of the FLG are also in the linked RFC 1950.
So your input basically tells the rest of the data was constructed using default compression, but the go flate package does not decode this.
Change your decompression to omit the first 2 bytes like this and it will work:
enflated, err := ioutil.ReadAll(flate.NewReader(bytes.NewReader(content[2:])))
Try it on the Go Playground. But...
Use Proper ZLib decompression!
We got lucky this time because the compression level is the default and dictionary was preset. If not, you won't be able to decode it using the flate package. Since the input is a zlib stream, you should use the compress/zlib package to properly decode it and not rely on luck:
r, err := zlib.NewReader(bytes.NewReader(content))
if err != nil {
panic(err)
}
enflated, err := ioutil.ReadAll(r)
if err != nil {
panic(err)
}
fmt.Println(string(enflated))
Try the zlib variant on the Go Playground.