extract table values form curl output - bash

I'm trying to extract random values that proceed a unique string, the layout is
<tr><td><a>uniquestring"</a></td>
<td>RANDOM NUMBER k/b</td>
<td>RANDOM NUMBER</td>
<td>RANDOM NUMBER</td>
<td>RANDOM NUMBER</tr>
I want to do something like
curl -is http://webpage.com/ |grep uniquestring | echo RANDOM NUMBER k/b
I'd also like to return all values on a single line ie echo
uniquestring RANDOMNUMBER k/b RANDOMNUMBER RANDOMNUMBER RANDOMNUMBER
The page generates multiple 'blocks' of 5 the lines above and i'm only interested in obtaining the values that are after a specific uniquestring.

To return all values on single line
curl -s webpage.com | grep -A 5 uniquestring | sed 's/<[^>]\+>//g' | tr '\n' ' '
To just return RANDOM NUMBER k/b
curl -s webpage.com | grep -A 1 uniquestring | grep -v "uniquestring" | sed 's/<[^>]\+>//g' | tr '\n' ' '

Using TXR:
$ txr -c '#(skip)
<tr><td><a>#uniq"</a></td>
<td>#num1 k/b</td>
<td>#num2</td>
<td>#num3</td>
<td>#num4</tr>
#(output)
#uniq #num1 k/b #num2 #num3 #num4
#(end)' - < data
eb09b744e3e914d67f86a1fee82e9002634ac 123098340 k/b 4949848 9648 334938
Here we match as much of the structure as possible, including the static piece of text k/b. It is assumed that the unique string is variable; we don't know what it is and want to extract it.
The sample data file contains:
$ cat data
<tr><td><a>eb09b744e3e914d67f86a1fee82e9002634ac"</a></td>
<td>123098340 k/b</td>
<td>4949848</td>
<td>9648</td>
<td>334938</tr>

Related

Trouble making mathematical operations with big hexadecimal numbers on bash

I'm trying to perform multiple mathematical operations with big hexadecimal numbers. In the following code a big random value is given to p and q, both variables are multiplied together making the result the value of n. Then make a subtraction "(n-2)" but when subtracting 2 to n the result seems like being wrong
#!/bin/bash
generate_random() {
head -c 256 /dev/urandom | xxd -p -u -c 256 | tr -d '[:space:]\\'
}
p="$(generate_random)"
q="$(generate_random)"
n=$(echo "${p} * ${q}" | tr -d '[:space:]\\')
echo "obase=16; ${n} - 2" | bc | tr -d '[:space:]]\\'
I'm using this webstie to check the results and so far I haven't managed to get any correct output from my script
Value examples:
Generated value for p
28DA11279C9BF6B6A93CF1288C63845503D87C17C93554421C3069A5283547CEE4C127F3289BDC03663143808B7318B321B35B25ECC6D48EF60283DA6111104291070E3FBEDBD27A942D2B3630AA8683C6BB4F6F6EE277A24A4C9B93AEEB61D48AC7657FA6E61DC95A8EF133F97C6ED3B285E275487746F070005B2BCDEA9C7C12F294DFCE55BC7013417F8E47CEC605F13AFD5C54A3578BB041278285511248E975FE29F4013CA059599EE95E43E28B886D0651EFDDFF760DBB298096C7CA1A46FE3D119914C23ABA5543C43BE546FA70D7FA36B22DA17210A6CABDCD299751ADEE381A3230E9978946B193AB02921947887A2FC7A5DC84D2193F9CFC865B52
Generated value for q
54DEBA70F8F052F5120B77254EB999E12180241520DC2A12F62F6773992155AEFC5356E3F9B3271FE5AA9D425F7D2CD8233196C98595F993899C31D4063F75A801D1752AD178663E3CDF3CF38CEE1972C861DC6069B787249963D305E4FC970A48E67D3A680CD58F17229379EAE5445603E50E60CF605F1057766EFEAFAA2299CCBC0C4F161815DBD06294E4BBD43EF55F1E2D7544B39279EA4B9114AB9F9D0FC2B46135911CF62FB4A22A615936FDDAD23131B1F0AD2FB94D44C0879B3289530653C4714B2E3F3F9FFD17E92C44FBCE589982F68985207F788FBD1B531C56224E4EDA1F124E6AEC19C949AB396862F0856C435EBAAAB7FFB1251FBEB338676D
Result of n
28DA11279C9BF6B6A93CF1288C63845503D87C17C93554421C3069A5283547CEE4C127F3289BDC03663143808B7318B321B35B25ECC6D48EF60283DA6111104291070E3FBEDBD27A942D2B3630AA8683C6BB4F6F6EE277A24A4C9B93AEEB61D48AC7657FA6E61DC95A8EF133F97C6ED3B285E275487746F070005B2BCDEA9C7C12F294DFCE55BC7013417F8E47CEC605F13AFD5C54A3578BB041278285511248E975FE29F4013CA059599EE95E43E28B886D0651EFDDFF760DBB298096C7CA1A46FE3D119914C23ABA5543C43BE546FA70D7FA36B22DA17210A6CABDCD299751ADEE381A3230E9978946B193AB02921947887A2FC7A5DC84D2193F9CFC865B52*54DEBA70F8F052F5120B77254EB999E12180241520DC2A12F62F6773992155AEFC5356E3F9B3271FE5AA9D425F7D2CD8233196C98595F993899C31D4063F75A801D1752AD178663E3CDF3CF38CEE1972C861DC6069B787249963D305E4FC970A48E67D3A680CD58F17229379EAE5445603E50E60CF605F1057766EFEAFAA2299CCBC0C4F161815DBD06294E4BBD43EF55F1E2D7544B39279EA4B9114AB9F9D0FC2B46135911CF62FB4A22A615936FDDAD23131B1F0AD2FB94D44C0879B3289530653C4714B2E3F3F9FFD17E92C44FBCE589982F68985207F788FBD1B531C56224E4EDA1F124E6AEC19C949AB396862F0856C435EBAAAB7FFB1251FBEB338676D
Expected value for n
d8b187c5754df9d215796bcae5b552e87d4f7a590bb257a208e327cb1e3bf48cbefba07388a0a49f1782f78a232147e9b1137b4eb54611200eab02faa0edd005ee52a33832d391b0aa766fdca712a441a26586fa9418b791e71056117bbf80b07ab68502491a5b70222ad058fdd733b30701f0b46cf7486d8a412096d4a05f43b5c58052baadd51bb08aadf551366513fde427d4fd1ac35778762cd960d3697f4a661b7dda642e5e71284d9fca947171cba5b5e9387cfa078833c3ace7c42baced889cdda8744b021524aec2ed20ef0e0379cf03b206ed0707917c0f38ef66fb79f4198bb1d25c046901b3f54a2910b90ba3b595511042cd682ba38eab459b77b41a37c05d3cffda58346d8dc7bf180ec72aa7b4c4e0cb532c4f374e7bec1e5e970d129dd755a38ca070dd700b5133a3d0c8ba5fbab7c309b94480aa9996d762c1c3cb87170dda878bc9c51db4573681e8dd57db6f4bb1375f386323f643045fef1391498ca7fcca8fe830f6e268db877d7950861a7ab661cb63ab7831934b5a2bdacf53d3c7eac80e1130f786e1b95b5b3f98374ced618a36bb5b3ceab60861ddbb7ae227e9d2abc26931e55483a6a4e891ef54f786519c7250e70a39b821602eb5820fcc4422215452e3a6355140af77697e752aced92cc778e4c6d4df3d8a230a8a8e4756e5f5e347ab0d9bab51f1dc25b5d6099246ae1e2238be3e2dfea
Since the input is in hexadecimal you should also use ibase.
#!/bin/bash
generate_random() {
head -c 256 /dev/urandom | xxd -p -u -c 256 | tr -d '[:space:]\\'
}
p="$(generate_random)"
q="$(generate_random)"
n=$(echo "${p} * ${q}" | tr -d '[:space:]\\')
echo "obase=16;ibase=16; ${n} - 2" | bc | tr -d '[:space:]]\\'
The result in the case you posted is -

This can be confirmed using python
In [1]: hex(0x28DA11279C9BF6B6A93CF1288C63845503D87C17C93554421C3069A5283547CEE4C127F3289BDC03663143808B7318B321B35B25ECC6D48EF60283DA6111104291070E3FBEDBD27A942D2B3630AA8683C6BB4F6F6EE277A24A4C9B93AEEB61D48AC7657F
...: A6E61DC95A8EF133F97C6ED3B285E275487746F070005B2BCDEA9C7C12F294DFCE55BC7013417F8E47CEC605F13AFD5C54A3578BB041278285511248E975FE29F4013CA059599EE95E43E28B886D0651EFDDFF760DBB298096C7CA1A46FE3D119914C23ABA5543
...: C43BE546FA70D7FA36B22DA17210A6CABDCD299751ADEE381A3230E9978946B193AB02921947887A2FC7A5DC84D2193F9CFC865B52*0x54DEBA70F8F052F5120B77254EB999E12180241520DC2A12F62F6773992155AEFC5356E3F9B3271FE5AA9D425F7D2CD82
...: 33196C98595F993899C31D4063F75A801D1752AD178663E3CDF3CF38CEE1972C861DC6069B787249963D305E4FC970A48E67D3A680CD58F17229379EAE5445603E50E60CF605F1057766EFEAFAA2299CCBC0C4F161815DBD06294E4BBD43EF55F1E2D7544B3927
...: 9EA4B9114AB9F9D0FC2B46135911CF62FB4A22A615936FDDAD23131B1F0AD2FB94D44C0879B3289530653C4714B2E3F3F9FFD17E92C44FBCE589982F68985207F788FBD1B531C56224E4EDA1F124E6AEC19C949AB396862F0856C435EBAAAB7FFB1251FBEB3386
...: 76D-2)
Out[1]: '0xd8b187c5754df9d215796bcae5b552e87d4f7a590bb257a208e327cb1e3bf48cbefba07388a0a49f1782f78a232147e9b1137b4eb54611200eab02faa0edd005ee52a33832d391b0aa766fdca712a441a26586fa9418b791e71056117bbf80b07ab68502491a5b70222ad058fdd733b30701f0b46cf7486d8a412096d4a05f43b5c58052baadd51bb08aadf551366513fde427d4fd1ac35778762cd960d3697f4a661b7dda642e5e71284d9fca947171cba5b5e9387cfa078833c3ace7c42baced889cdda8744b021524aec2ed20ef0e0379cf03b206ed0707917c0f38ef66fb79f4198bb1d25c046901b3f54a2910b90ba3b595511042cd682ba38eab459b77b41a37c05d3cffda58346d8dc7bf180ec72aa7b4c4e0cb532c4f374e7bec1e5e970d129dd755a38ca070dd700b5133a3d0c8ba5fbab7c309b94480aa9996d762c1c3cb87170dda878bc9c51db4573681e8dd57db6f4bb1375f386323f643045fef1391498ca7fcca8fe830f6e268db877d7950861a7ab661cb63ab7831934b5a2bdacf53d3c7eac80e1130f786e1b95b5b3f98374ced618a36bb5b3ceab60861ddbb7ae227e9d2abc26931e55483a6a4e891ef54f786519c7250e70a39b821602eb5820fcc4422215452e3a6355140af77697e752aced92cc778e4c6d4df3d8a230a8a8e4756e5f5e347ab0d9bab51f1dc25b5d6099246ae1e2238be3e2dfe8'
Not sure what is the calculator you're using but it has a different calculation for some reason.

How to sort lines based on specific part of their value?

When I run the following command:
command list -r machine-a-.* | sort -nr
It gives me the following result:
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
I wish to sort these lines based on the number at the end, in descending order.
( Clearly sort -nr doesn't work as expected. )
You just need the -t and -k options in the sort.
command list -r machine-a-.* | sort -t '-' -k 3 -nr
-t is the separator used to separate the fields.
By giving it the value of '-', sort will see given text as:
Field 1 Field 2 Field 3
machine a 9
machine a 8
machine a 72
machine a 71
machine a 70
-k is specifying the field which will be used for comparison.
By giving it the value 3, sort will sort the lines by comparing the values from the Field 3.
Namely, these strings will be compared:
9
8
72
71
70
-n makes sort treat the fields for comparison as numbers instead of strings.
-r makes sort to sort the lines in reverse order(descending order).
Therefore, by sorting the numbers from Field 3 in reverse order, this will be the output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Here is an example of input to sort:
$ cat 1.txt
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
Here is our short program:
$ cat 1.txt | ( IFS=-; while read A B C ; do echo $C $A-$B-$C; done ) | sort -rn | cut -d' ' -f 2
Here is its output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Explanation:
$ cat 1.txt \ (put contents of file into pipe input)
| ( \ (group some commands)
IFS=-; (set field separator to "-" for read command)
while read A B C ; (read fields in 3 variables A B and C every line)
do echo $C $A-$B-$C; (create output with $C in the beggining)
done
) \ (end of group)
| sort -rn \ (reverse number sorting)
| cut -d' ' -f 2 (cut-off first unneeded anymore field)

Bash: uniq count large dataset

I have a set of CVS files spanning over 70GB, with 35GB being about the field i'm interested in (with around 100 Bytes for this field in each row)
The data are highly duplicated (a sampling show that the top 1000 cover 50%+ of the rows) and I'm interested in getting the total uniq count
With a not so large data set I would do
cat my.csv | cut -f 5 | sort | uniq -c | sort --numeric and it works fine
However the problem I have is that (to my understanding) because of the intermediate sort , this command will need to hold in RAM (and then on disk because it does not fit my 16Go of RAM) the whole set of data, to after stream it to uniq -c
I would like to know if there's a command /script awk/python to do the sort | uniq -c in one step so that the RAM consumption should be far lower ?
You can try this:
perl -F, -MDigest::MD5=md5 -lanE 'say unless $seen{ md5($F[4]) }++' < file.csv >unique_field5.txt
it will holds in the memory 16byte long md5-digest for every unique field-5 (e.g. $F[4]). Or you can use
cut -d, -f5 csv | perl -MDigest::MD5=md5 -lnE 'say unless $seen{md5($_)}++'
for the same result.
Of course, the md5 isn't cryptographically safe these days, but probably will be enough for sorting... Of course, it is possible to use sha1 or sha256, just use the -MDigest::SHA=sha255. Of course, the sha-digests are longer - e.g. needs more memory.
It is similar as the awk linked in the comments, with a difference, here is used as hash-key not the whole input line, but just the 16byte long MD5 digest.
EDIT
Because me wondering about the performance, created this test case:
# this perl create 400,000,000 records
# each 100 bytes + attached random number,
# total size of data 40GB.
# each invocation generates same data (srand(1))
# because the random number is between 0 - 50_000_000
# here is approx. 25% unique records.
gendata() {
perl -E '
BEGIN{ srand(1) }
say "x"x100, int(rand()*50_000_000) for 1..400_000_000
'
}
# the unique sorting - by digest
# also using Devel::Size perl module to get the final size of the data hold in the memory
# using md5
domd5() {
perl -MDigest::MD5=md5 -MDevel::Size=total_size -lnE '
say unless $seen{md5($_)}++;
END {
warn"total: " . total_size(\%seen);
}'
}
#using sha256
dosha256() {
perl -MDigest::SHA=sha256 -MDevel::Size=total_size -lnE '
say unless $seen{sha256($_)}++;
END {
warn"total: " . total_size(\%seen);
}'
}
#MAIN
time gendata | domd5 | wc -l
time gendata | dosha256 | wc -l
results:
total: 5435239618 at -e line 4, <> line 400000000.
49983353
real 10m12,689s
user 12m43,714s
sys 0m29,069s
total: 6234973266 at -e line 4, <> line 400000000.
49983353
real 15m51,884s
user 18m23,900s
sys 0m29,485s
e.g.:
for the md5
memory usage: 5,435,239,618 bytes - e.g. appox 5.4 GB
unique records: 49,983,353
time to run: 10 min
for the sha256
memory usage: 6,234,973,266 bytes - e.g. appox 6.2 GB
unique records: 49,983,353
time to run: 16 min
In contrast, doing the plain-text unique search using the "usual" approach:
doplain() {
perl -MDevel::Size=total_size -lnE '
say unless $seen{$_}++;
END {
warn"total: " . total_size(\%seen);
}'
}
e.g running:
time gendata | doplain | wc -l
result:
memory usage is much bigger: 10,022,600,682 - my notebook with 16GB RAM starts heavy swapping (as having SSD, so a not big deal - but still..)
unique records: 49,983,353
time to run: 8:30 min
Result?
just use the
cut -d, -f5 csv | perl -MDigest::MD5=md5 -lnE 'say unless $seen{md5($_)}++'
and you should get the unique lines enough fast.
You can try this:
split --filter='sort | uniq -c | sed "s/^\s*//" > $FILE' -b 15G -d "dataset" "dataset-"
At this point you should have around 5 dataset-<i> each of which should be much less that 15G.
To merge the file you can save the following bash script as merge.bash:
#! /bin/bash
#
read prev_line
prev_count=${prev_line%% *}
while read line; do
count="${line%% *}"
line="${line#* }" # This line does not handle blank lines correctly
if [ "$line" != "$prev_line" ]; then
echo "$prev_count $prev_line"
prev_count=$count
prev_line=$line
else
prev_count=$((prev_count + count))
fi
done
echo "$prev_count $prev_line"
And run the command:
sort -m -k 2 dataset-* | bash merge.sh > final_dataset.
Note: blank line are not handled correctly, if it suits your needs you can remove them from your dataset or correct merge.bash.

Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. It is a tab deliminated file and there are row and column headers. I basically want to select a subset of the data that has a given column and/or row name.
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
I'm planning to have a file with a list of the columns I want.
colname_1 colname_3
I'm a newbie to bash scripting and I really don't know how to do this. I saw other examples, but they all new what column number they wanted in advance and I don't. Sorry if this is a repeat question, I tried to search.
I would want the result to be
colname_1 colname_3
row_1 1 3
row_2 2 9
row_3 2 4
Bash works best as "glue" between standard command-line utilities. You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal.
For simplicity, let's put the desired column headings into a file, one per line. (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.)
$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2
An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments.
Now, we want to know which column in the data file each of these column headings corresponds to. We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match).
Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr:
$ head -n1 giga.dat | tr '\t' '\n'
colname_1
colname_2
colname_3
colname_4
Note the blank line at the beginning. That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1.
So let's look up the column names. Here, we will use several grep options:
-F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes.
-x The pattern must match the complete line.
-n The output should be prefixed by the line number of the match.
If we have Gnu grep, we could also use -f columns to read the patterns from the file named columns. Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. But for now, we'll stay Posix compliant:
$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3
OK, that's pretty close. We just need to get rid of everything other than the line number; comma-separate the numbers, and put a 1 at the beginning.
$ { echo 1
> grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
> } | cut -f1 -d: | paste -sd,
1,2,4
cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4.
cut -d: Use : instead of tab as a field separator ("delimiter")
paste -s Concatenate the lines of a single file instead of corresponding lines of several files
paste -d, Use a comma instead of tab as a field separator.
So now we have the argument we need to pass to cut in order to select the desired columns:
$ cut -f"$({ echo 1
> head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns
> } | cut -f1 -d: | paste -sd,)" giga.dat
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list. After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns.
There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell.
The script takes two file names as input. The first is your original data file. The second is your column list file. An approach could be:
#!/bin/bash
declare -a cols ## array holding original columns from original data file
declare -a csel ## array holding columns to select (from file 2)
declare -a cpos ## array holding array indexes of matching columns
cols=( $(head -n 1 "$1") ) ## fill cols from 1st line of data file
csel=( $(< "$2") ) ## read select columns from file 2
## fill column position array
for ((i = 0; i < ${#csel[#]}; i++)); do
for ((j = 0; j < ${#cols[#]}; j++)); do
[ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
done
done
printf " "
for ((i = 0; i < ${#csel[#]}; i++)); do ## output header row
printf " %s" "${csel[i]}"
done
printf "\n" ## output newline
unset cols ## unset cols to reuse in reading lines below
while read -r line; do ## read each data line in data file
cols=( $line ) ## separate into cols array
printf "%s" "${cols[0]}" ## output row label
for ((j = 0; j < ${#cpos[#]}; j++)); do
[ "$j" -eq "0" ] && { ## handle format for first column
printf "%5s" "${cols[$((${cpos[j]}+1))]}"
continue
} ## output remaining columns
printf "%13s" "${cols[$((${cpos[j]}+1))]}"
done
printf "\n"
done < <( tail -n+2 "$1" )
Using your example data as follows:
Data File
$ cat dat/col+data.txt
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
Column Select File
$ cat dat/col.txt
colname_1 colname_3
Example Use/Output
$ bash colnum.sh dat/col+data.txt dat/col.txt
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
Give it a try and let me know if you have any questions. Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast.

Sorting and printing a file in bash UNIX

I have a file with a bunch of paths that look like so:
7 /usr/file1564
7 /usr/file2212
6 /usr/file3542
I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far:
cat temp| sort | uniq -c | sort -rk1 > temp
I am unsure how to only print the highest occurrences. I also want my output to be printed like this:
7 1564
7 2212
7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1.
To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'.
To filter for only the values with the highest number, allowing there to be more than one:
{
read firstnum name && printf '%s\t%s\n' "$firstnum" "$name"
while read -r num name; do
[[ $num = $firstnum ]] || break
printf '%s\t%s\n' "$num" "$name"
done
} < temp
If you want to avoid sort and you are allowed to use awk, then you can do this:
awk '{
if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else
if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \
temp

Resources