Save all hbase table names to the bash array - bash

I would like to store the names of all my hbase tables in an array inside my bash script.
All sed hotfixes are acceptable.
All better solutions (like readarray it from some zookeeper file I am not aware of) are acceptable
I have two hbase tables called MY_TABLE_NAME_1 and MY_TABLE_NAME_2, so what I want would be:
tables = (
MY_TABLE_NAME_1
MY_TABLE_NAME_2
)
What I tried:
Basing on HBase Shell in OS Scripts by Cloudera:
echo "list" | /path/to/hbase/bin/hbase shell -n > /home/me/hbase-tables
readarray -t tables < /home/me/hbase-tables
but inside my /home/me/hbase-tables is:
MY_TABLE_NAME_1
MY_TABLE_NAME_2
2 row(s) in 0.3310 seconds
MY_TABLE_NAME_1
MY_TABLE_NAME_2

You could use readarray/mapfile just fine. But to remove duplicates/skip empty lines and remove unnecessary strings, you need a filter using awk.
Also you don't need to create a temporary file and then parse that file, but directly use a technique called process substitution which allows the output of a command be available as if it is available in a temporary file
mapfile -t output < <(echo "list" | /path/to/hbase/bin/hbase shell -n | awk '!unique[$0]++ && !/seconds/ && NF')
Now the array would contain only the unique table names from the hbase output. That said, you should really look-up for the solution to remove the noise as part of the query output than post-process it this way.

Related

Reduce Unix Script execution time for while loop

Have a reference file "names.txt" with data as below:
Tom
Jerry
Mickey
Note: there are 20k lines in the file "names.txt"
There is another delimited file with multiple lines for every key from the reference file "names.txt" as below:
Name~~Id~~Marks~~Column4~~Column5
Note: there are about 30 columns in the delimited file:
The delimited file looks like something :
Tom~~123~~50~~C4~~C5
Tom~~111~~45~~C4~~C5
Tom~~321~~33~~C4~~C5
.
.
Jerry~~222~~13~~C4~~C5
Jerry~~888~~98~~C4~~C5
.
.
Need to extract rows from the delimited file for every key from the file "names.txt" having the highest value in the "Marks" column.
So, there will be one row in the output file for every key form the file "names.txt".
Below is the code snipped in unix that I am using which is working perfectly fine but it takes around 2 hours to execute the script.
while read -r line; do
getData `echo ${line// /}`
done < names.txt
function getData
{
name=$1
grep ${name} ${delimited_file} | awk -F"~~" '{if($1==name1 && $3>max){op=$0; max=$3}}END{print op} ' max=0 name1=${name} >> output.txt
}
Is there any way to parallelize this and reduce the execution time. Can only use shell scripting.
Rule of thumb for optimizing bash scripts:
The size of the input shouldn't affect how often a program has to run.
Your script is slow because bash has to run the function 20k times, which involves starting grep and awk. Just starting programs takes a hefty amount of time. Therefore, try an approach where the number of program starts is constant.
Here is an approach:
Process the second file, such that for every name only the line with the maximal mark remains.
Can be done with sort and awk, or sort and uniq -f + Schwartzian transform.
Then keep only those lines whose names appear in names.txt.
Easy with grep -f
sort -t'~' -k1,1 -k5,5nr file2 |
awk -F'~~' '$1!=last{print;last=$1}' |
grep -f <(sed 's/.*/^&~~/' names.txt)
The sed part turns the names into regexes that ensure that only the first field is matched; assuming that names do not contain special symbols like . and *.
Depending on the relation between the first and second file it might be faster to swap those two steps. The result will be the same.

Search and write line of a very large file in bash

I have a big csv file containing 60210 lines. Those lines contains hashes, paths and file names, like so:
hash | path | number | hash-2 | name
459asde2c6a221f6... | folder/..| 6 | 1a484efd6.. | file.txt
777abeef659a481f... | folder/..| 1 | 00ab89e6f.. | anotherfile.txt
....
I am filtering this file regarding a list of hashes, and to the facilitate the filtering process, I create and use a reduced version of this file, like so:
hash | path
459asde2c6a221f6... | folder/..
777abeef659a481f... | folder/..
The filtered result contains all the lines that have a hash which is not present in my reference hash base.
But to make a correct analysis of the filtered result, I need the previous data that I removed. So my idea was to read the filtered result file, search for the hash field, and write it in an enhanced result file that will contain all the data.
I use a loop to do so:
getRealNames() {
originalcontent="$( cat $originalfile)"
while IFS='' read -r line; do
hash=$( echo "$line" | cut -f 1 -d " " )
originalline=$( echo "$originalcontent" |grep "$hash" )
if [ ! -z "$originalline" ]; then
echo "$originalline" > "$resultenhanced"
fi
done < "$resultfile"
}
But in real usage, it is highly inefficient: for the previous file, this loop takes approximately 3 hours to run on a 4Go RAM, Intel Centrino 2 system, and it seems to me way too long for this kind of operation.
Is there any way I can improve this operation?
Given the nature of your question, it is hard to understand why you would prefer using the shell to process such a huge file given specialized tools like awk or sed to process them. As Stéphane Chazelas points out in the wonderful answer in Unix.SE.
Your problem becomes easy to solve once you use awk/perl which speeds up the text processing. Also you are consuming the whole file into RAM by doing originalcontent="$( cat $originalfile)" which is not desirable at all.
Assuming in the both the original and reference file, the hash starts at the first column and the columns are separated by |, you need to use awk as
awk -v FS="|" 'FNR==NR{ uniqueHash[$1]; next }!($1 in uniqueHash)' ref_file orig_file
The above attempts takes into memory only the first column entries from your reference file, the original file is not consumed at all. Once we consume the entries in $1 (first column) of the reference file, we do filter on the original file by selecting those lines that are not in the array(uniqueHash) we created.
Change your locale settings to make it even faster by setting the C locale as LC_ALL=C awk ...
Your explanation of what you are trying to do unclear because it describes two tasks: filtering data and then adding missing values back to the filtered data. Your sample script addresses the second, so I'll assume that's the what you are trying to solve here.
As I read it, you have a filtered result that contains hashes and paths, and you need to lookup those hashes in the original file to get the other field values. Rather than loading the original file into memory, just let grep process the file directly. Assuming a single space (as indicated by cut -d " ") is your field separator, you can extract the hash in your read command, too.
while IFS=' ' read -r hash data; do
grep "$hash" "$originalfile" >> "$resultenhanced"
done < "$resultfile"

SQLite: How to simultaneously read data from stdin and table name from variable?

From bash, I want to pass to sqlite3 my TSV table from stdin while also passing the name of the table to import into from a variable. How can this be accomplished? For example:
#!/bin/bash
output_sql_db="$1"
input_tsv="$2"
table_name="$3
tail -n +2 "$input_tsv" | sqlite3 "$output_sql_db" '.import "/dev/stdin" ${table_name}'
Of course in this example, ${table_name} does not get expanded correctly due to the use of ' in the sqlite3 command. How should this be done? It seems all the answers I find only show to hand handle one or the other (data import OR passing table name).
Use a bash heredoc, and decide when/where you escape input. At variable definition or command creation.
EDIT:
My point is to simplify the syntax/approach you're using, to something like.
output_sql_db="$1"
input_tsv="$2"
table_name="$3
importfile=$(tail -n +2 $input_tsv)
sqlite3 "$output_sql_db .import ${importfile} ${table_name}"
The main problem in your example is the use of single-quotes around your command, which prevents variable interpolation. If you want to use variables, you need to use double-quotes.
Another (unrelated) possible problem is your use of tail -n +2. If what you wanted was the 2 last lines of your $input_tsv file, the proper syntax is tail -2 $input_tsv.
Finally, the name of your $input_tsv variable suggests that your input is Tab-separated. If that is the case, you need to tell sqlite that you don't use it's default separator | but a Tab character instead, which can be written as $'\t' in Bash.
So this rewritten version of your script should work :
#!/bin/bash
output_sql_db="$1"
input_tsv="$2"
table_name="$3
tail -2 "$input_tsv" | sqlite3 -separator $'\t' "$output_sql_db" ".import /dev/stdin $table_name"

bash cat behavior on file versus variables storing file contents

I have a file file1 with the following contents:
Z
X
Y
I can use cat to view the file:
$ cat file1
Z
X
Y
I can sort the file:
$ sort -k1,1 file1
X
Y
Z
I can sort it and store the output in a variable:
sorted_file1=$(sort -k1,1 file1)
But when I try to use cat on the variable sorted_file1 I get an error:
$ cat "$sorted_file1"
cat: X
Y
Z: No such file or directory
I can use echo and it looks about right, but it behaves strangely in my scripts:
$ echo "$sorted_file1"
X
Y
Z
Why does this happen? How does storing the output of a command change how cat interprets it?
Is there a better way to store the output of shell commands within variables to avoid issues like this?
cat operates on files. Your invocation of cat (cat "$sorted_file1") expands to the same as cat $'X\nY\nZ', and of course there's no file of that name, hence the error you see.
Shell variables are not files. If you need to make their values available like files, you need to use echo to create a stream:
echo "$sorted_file1" | cat # portable, STDIN
cat <(echo "$sorted_file1") # Bash, file
cat <<<"$sorted_file1" # Bash, STDIN
(obviously cat is pointless here, but the principle applies to other programs that expect their input from files or STDIN).
Your mixing two concepts, files and variables. Both of these hold data, but they do so in different ways.
I will assume you know what a file is. A variable is like a little data store.
You generally use variables to store little bits of data that you may want to change, use immediately and don't mind losing when your script/program ends.
And you generally use files to store large amounts of data that you want to keep around after your script/program ends.
I believe what you want to do here is sort the file, and store the input in another file. To do this, you need to use redirection, like this
sort -k1,1 file1 > sorted_file1
What this does is sort the file and then outputs the result into a file called "sorted_file1". Then if you do your regular cat sorted_file you will see the sorted contents, as you expect.
You can read a bit more about it here.

Create files using grep and wildcards with input file

This should be a no-brainer, but apparently I have no brain today.
I have 50 20-gig logs that contain entries from multiple apps, one of which addes a transaction ID to its log lines. I have 42 transaction IDs I need to review, and I'd like to parse out the appropriate lines into separate files.
To do a single file, the command would be simply,
grep CDBBDEADBEEF2020X02393 server.log* > CDBBDEADBEEF2020X02393.log
that creates a log isolated to that transaction, from all 50 server.logs.
Now, I have a file with 42 txnIDs (shortening to 4 here):
CDBBDEADBEEF2020X02393
CDBBDEADBEEF6548X02302
CDBBDE15644F2020X02354
ABBDEADBEEF21014777811
And I wrote:
#/bin/sh
grep $1 server.\* > $1.log
But that is not working. Changing the shebang to #/bin/bash -xv, gives me this weird output (obviously I'm playing with what the correct escape magic must be):
$ ./xtrakt.sh B7F6E465E006B1F1A
#!/bin/bash -xv
grep - ./server\.\*
' grep - './server.*
: No such file or directory
I have also tried the command line
grep - server.* < txids.txt > $1
But OBVIOUSLY that $1 is pointless and I have no idea how to get a file named per txid using the input redirect form of the command.
Thanks in advance for any ideas. I haven't gone the route of doing a foreach in the shell script, because I want grep to put the original filename in the output lines so I can examine context later if I need to.
Also - it would be great to have the server.* files ordered numerically (server.log.1, server.log.2 NOT server.log.1, server.log.10...)
try this:
while read -r txid
do
grep "$txid" server.* > "$txid.log"
done < txids.txt
and for the file ordering - rename files with one digit to two digit, with leading zeroes, e.g. mv server.log.1 server.log.01.

Resources