HappyBase - Is there an equivalent of find_one or scan_one? - hadoop

All the rows in a particular HBase table that I am making a UI for happen to have the same columns and will have so for the foreseeable future. I would like my html data visualizer application to simply query for a single random row to take note of the column names, and put this list of column names into a variable to refer to throughout the program.
I didn't see any equivalent to find_one or scan_one in the docs for HappyBase.
What is the best way to accomplish this?

This will fetch only the first row:
row = next(table.scan(limit=1))
Additionally you can specify a filter string to avoid retrieving the values, which is only worthwile if your values are large and you're performing this query often.

I use the 'limit' option.
Here is my HappyBase Python code:
pool1 = happybase.ConnectionPool(size=2, host='172.00.00.01')
with pool1.connection() as conn1:
hTable = conn1.table('HBastTableHere')
for rowKey, rowData in hTable.scan( limit=1):
print rowData

You can create a Scanner object, without specifying the start row (so that it start at the first row in the table), and limit the scan to one row. You will get then the first row only.
From the HBase shell command it should look like this:
scan 'table_name', {LIMIT => 1}
I don't know for HappyBase but I think you should be able to do that

Related

Power Query - conditional replace/clear entire cell in multiple columns

I'm trying to clear the entire cell if it doesn't contain a given keyword.
I've managed to do this for one column:
Table.ReplaceValue(#"PrevStep",each [#"My Column"], each if Text.PositionOf([#"My Column"],"keyword")>-1 then [#"My Column"] else null,Replacer.ReplaceValue,{"My Column"})
The problem is I need to iterate/repeat that step for a number of columns... the number of columns may vary and column names also may be different every time. I can have all those column names put into a list but I'm not able to use it.
The solution I'm looking for may look like this
for each ColNam in MyColumnsList
Table.ReplaceValue(#"PrevStep",each [#"ColNam"], each if Text.PositionOf([#"ColNam"],"keyword")>-1 then [#"ColNam"] else null,Replacer.ReplaceValue,MyColumnsList)
next
but this is not the VBA code but Power Query M - and of course the problem is with #PrevStep as I would see it like a recursions... again... do not know how to process.
Is the path I follow correct or should it be done some other way
Thanks
Andrew
Unpivot your columns to turn all the columns into two columns. Apply your replacement to the single value column then pivot it back into the original format

Combining multiple sheets with different columns using Power Query

I have a workbook with multiple pages that need to get combined, i.e. stacked, into one table. While they have many similar column names, they do not all have the same columns and the column order differs. Because of this I cannot use the inherent merge functionality because it uses column order. Table.Combine will solve the problem, but I cannot figure out to create a statement that will use the "each" mechanic to do that.
For each worksheet in x workbook
Table.Combine(prior sheet, next sheet)
return all sheets stacked.
Would someone please help?
If you load your workbook with Excel.Workbook you can choose the Sheet Kind (instead of Table or DefinedName kinds) and ignore the sheet names.
let
Source = Excel.Workbook(File.Contents("C:\Path\To\File\FileName.xlsx"), null, true),
#"Filter Sheets" = Table.SelectRows(Source, each [Kind] = "Sheet"),
#"Promote Headers" = Table.TransformColumns(#"Filter Sheets", {{"Data", each Table.PromoteHeaders(_, [PromoteAllScalars=true])}}),
#"Combine Sheets" = Table.Combine(#"Promote Headers"[Data])
in
#"Combine Sheets"
Load each table into Power Query as a separate query
fix up the column names as needed for each individual query
save each query as a connection
in one of the queries (or in a separate query) use the Append command to append all the fixed up queries that now have the same column names.

Is there a way to dynamically identify and expand an embedded table's columns?

If I want to expand this embedded table...
...and I click on the expand button, I'm presented with the dropdown to select which columns I want to expand:
However, if I choose '(Select All Columns)' to expand them all, Power Query turns that into hard-coded column names of all the columns at the time I do that. Like this:
= Table.ExpandTableColumn(Source, "AllData", {"Column1", "Column2", "Column3", "Column4", "Custom"}, {"Column1", "Column2", "Column3", "Column4", "Custom"})
After that, if the underlying embedded table's columns change, the hard-coded column names will no longer be relevant and the query will "break."
So how can I tell it to dynamically identify and extract all of the current columns of the embedded table?
You can do something like this to get the list of column names:
List.Accumulate(Source[AllData], {}, (state, current) => List.Union({state, Table.ColumnNames(current)}))
This goes through each cell in the column, gets the column names from the table in that cell, and adds the new names to the result. It's easier to store this in a new step and then reference that in your next step.
Keep in mind that this method can be much slower than passing in the list of names you know about because it has to scan through the entire table to get the column names. You may also have problems if you use this for the third parameter in Table.ExpandTableColumn because it could use a column name that already exists.
Try using Table.Join which joins and expands the second table in one step.
"Merged Queries" = Table.Join(Source,{"Index.1"},Table2,{"Index.2"},JoinKind.LeftOuter)
You just need to make sure that the columns between the tables are unique.
Use Table.PrefixColumns to ensure column names are unique

How can I skip HBase rows that are missing specific column family?

For example, a HBase table has columnFamilyA, columnFamilyB and columnFamilyC, for some rows, columnFamilyA does not have any column in it. I would like to scan the table and return only the rows that have at least one column in columnFamilyA.
What kind of filter should I use? I checked SingleColumnValueFilter, but it seems to only work with specific column other than columnFamily. I need all rows where columnFamiliyA contains at least one column. Not just data in columnFamiliyA, but the entire row.
If you need only data from columnFamiliyA you can use addFamily method on Get or Scan objects.
Or you can do scan of scan. First do scan for columFamilyA cols. Then get the rows of first scan.

How can I do a double delimiter in Hive?

let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col

Resources