Efficient way to delete multiple rows in HBase - hadoop

Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase?
There is a table say 'chart', which contains items that are in charts. Row keys are in the following format:
chart|date_reversed|ranked_attribute_value_reversed|content_id
Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row found by a Scan? All the rows to be deleted are going to be close to each other.
I need to delete the rows, because I don't want one item (one content_id) to have multiple entries which it will have if its ranked_attribute_value had been changed (its change is the reason why chart needs to be regenerated).
Being a HBase beginner, so perhaps I might be misusing rows for something that columns would be better -- if you have a design suggestions, cool! Or, maybe the charts are better generated in a file (e.g. no HBase for output)? I'm using MapReduce.

Firstly, coming to the point of range delete there is no range delete yet in HBase, AFAIK. But there is a way to delete more than one rows at a time in the HTableInterface API. For this simply form a Delete object with row keys from scan and put them in a List and use the API, done! To make scan faster do not include any column family in the scan result as all you need is the row key for deleting whole rows.
Secondly, about the design. First my understanding of the requirement is, there are contents with content id and each content has charts generated against them and those data are stored; there can be multiple charts per content via dates and depends on the rank. In addition we want the last generated content's chart to show at the top of the table.
For my assumption of the requirement I would suggest using three tables - auto_id, content_charts and generated_order. The row key for content_charts would be its content id and the row key for generated_order would be a long, which would auto-decremented using HTableInterface API. For decrementing use '-1' as the amount to offset and initialize the value Long.MAX_VALUE in the auto_id table at the first start up of the app or manually. So now if you want to delete the chart data simply clean the column family using delete and then put back the new data and then make put in the generated_order table. This way the latest insertion will also be at the top in the latest insertion table which will hold the content id as a cell value. If you want to ensure generated_order has only one entry per content save the generated_order id first and take the value and save it into content_charts when putting and before deleting the column family first delete the row from generated_order. This way you could lookup and charts for a content using 2 gets at max and no scan required for the charts.
I hope this is helpful.

You can use the BulkDeleteProtocol which uses a Scan that defines the relevant range (start row, end row, filters).
See here

I ran into your situation and this is my code to implement what you want
Scan scan = new Scan();
scan.addFamily("Family");
scan.setStartRow(structuredKeyMaker.key(starDate));
scan.setStopRow(structuredKeyMaker.key(endDate + 1));
try {
ResultScanner scanner = table.getScanner(scan);
Iterator<Entity> cdrIterator = new EntityIteratorWrapper(scanner.iterator(), EntityMapper.create(); // this is a simple iterator that maps rows to exact entity of mine, not so important !
List<Delete> deletes = new ArrayList<Delete>();
int bufferSize = 10000000; // this is needed so I don't run out of memory as I have a huge amount of data ! so this is a simple in memory buffer
int counter = 0;
while (entityIterator.hasNext()) {
if (counter < bufferSize) {
// key maker is used to extract key as byte[] from my entity
deletes.add(new Delete(KeyMaker.key(entityIterator.next())));
counter++;
} else {
table.delete(deletes);
deletes.clear();
counter = 0;
}
}
if (deletes.size() > 0) {
table.delete(deletes);
deletes.clear();
}
} catch (IOException e) {
e.printStackTrace();
}

Related

How to use DWitemstatus in Power Builder

I'm learning about Power Builder, and i don't know how to use these, (DWitemstatus, getnextmodified, modifiedcount, getitemstatus, NotModified!, DataModified!, New!, NewModified!)
please help me.
Thanks for read !
These relate to the status of rows in a datawindow. Generally the rows are retrieved from a database but this doesn't always have to be the case - data can be imported from a text file, XML, JSON, etc. as well.
DWItemstatus - these values are constants and describe how the data would be changed in the database.
Values are:
NotModified! - data unchanged since retrieved
DataModified! - data in one or more columns has changed
New! - row is new but no values have been assigned
NewModifed! - row is new and at least one value has been assigned to a column.
So in terms of SQL, a row which is not modified would not generate any SQL to the DBMS. A DataModified row would typically generate an UPDATE statement. New and NewModifed would typically generate INSERT statements.
GetNextModifed is a method to search a set of rows in a datawindow to find the modified rows within that set. The method takes a buffer parameter and a row parameter. The datawindow buffers are Primary!, Filter!, and Delete!. In general you would only look at the Primary buffer.
ModifedCount is a method to determine the number of rows which have been modifed in a datawindow. Note that deleting a row is not considered a modification. To find the number of rows deleted use the DeletedCount method.
GetItemStatus is a method to get the status of column within a row in a data set in a datawindow. It takes the parameters row, column (name or number), and DWBuffer.
So now an example of using this:
// loop through rows checking for changes
IF dw_dash.Modifiedcount() > 0 THEN
ll = dw_dash.GetNextModified(0,Primary!)
ldw = dw_dash
DO WHILE ll > 0
// watch value changed
IF ldw.GetItemStatus(ll,'watch',Primary!) = DataModified! THEN
event we_post_item(ll, 'watch', ldw)
END IF
// followup value changed
IF ldw.GetItemStatus(ll,'followupdate',Primary!) = DataModified! THEN
event we_post_item(ll, 'followupdate', ldw)
END IF
ll = ldw.GetNextModified(ll,Primary!)
LOOP
ldw.resetupdate() //reset the modifed flags
END IF
In this example we first check to see if any row in the datawindow has been modified. Then we get the first modified row and check if either the 'watch' or 'followupdate' columns were changed. If they were we trigger an event to do something. We then loop to the next modified row and so on. Finally we reset the modified flags so the row would now show as not being mofified.

I need to find a faster solution to iterate rows in Google App Script

I'm trying to save some rows values for multiple columns on multiple tabs in GAS, but it's taking a lot of time and I'd like to find a faster way of doing this, if there's any.
A project e.g:'Project1' -as a key- has a value associated with it which corresponds to the column where it's stored, the tabs are 600+ iterations long.
this script opens up a tab called 'person1' at first and goes through all the rows for the column that corresponds to that project in 'projects' dictionary (it's the same format for every tab, but more projects will be added in the future)
right now i'm iterating through the 'members' dictionary (length=m), then through the projects dictionary (length=p) and finally through the length of the rows (length='r'), in the meantime it access the other spreadsheet where I want to save all those rows.
This means that the current time complexity of my algorithm is O(mpr) and it's WAY too slow.
for 15 people and 6 projects each, the amount of iterations would be 156600+ = 54,000 iterations at least (more people and more projects and more rows will be added).
is there any way to make my algorithm faster?
const members = {'Person1':'P1', 'Person2':'P2'};
const projects = {'Project1':'L','Project2':'R'}
function saveRowValue() {
let sourceSpreadsheet = SpreadsheetApp.getActiveSpreadsheet();
let targetSpreadsheet = SpreadsheetApp.openById('-SPREADSHEET-');
let targetSheet = targetSpreadsheet.getSheetByName('Tracking time');
let rowsToWrite = [];
rowsToWrite.push(['Project', 'Initials', 'Date', 'Tracking time'])
var rowsToSave = 1;
for(m in members){
Logger.log(m +' initials:'+ members[m]);
let sourceSheet = sourceSpreadsheet.getSheetByName(m);
for(p in projects){
let values = sourceSheet.getRange(projects[p]+"1:"+projects[p]).getValues();
Logger.log(values)
let list = [null, 0,''];
for(var i=0; i<values.length; i++){
try{
date = sourceSheet.getRange('B'+i).getValue();
let val = sourceSheet.getRange(projects[p]+i)
val = Utilities.formatDate(val.getValue(), "GMT", val.getNumberFormat())
Logger.log(val);
if(!(list.includes(val)) && date instanceof Date){
//rowsToWrite.push();
rowsToSave++;
targetSheet.getRange(rowsToSave,1,1,4).setValues([[p, members[m], date, val]]);
}
}catch(e){
Logger.log(e)
}
}
}
}
Logger.log(rowsToWrite);
[Here you can see how much time it takes to iterate 600 rows for a single project and a single member after changing what Yuri Khristich told me to change][1]
[1]: https://i.stack.imgur.com/CnRZY.png
First step is to try to get rid of getValue() and setValue() in loops. All data should be captured at once as 2D arrays in one step and put on the sheet in one step as well. No single cell or single row operations.
Next trick depends on your workflow. Say, it's unlikely that every time all 54000+ cells need to be checked. Probably there are ranges that have no changes. You can figure out some way to indicate the changes. And process only the changed ranges. Probably, the indication could be performed with onChange() trigger. For example you can add * to the name of the sheets and columns where changes have occurred and remove these * whenever you run your script.
Reference:
Use batch operations

how to change hbase table scan results order

I am trying to copy specific data from one hbase table to another which requires scanning the table for only rowkeys and parsing a specific value from there. It works fine but I noticed the results seem to be returned in ascending sort order & in this case alphabetically. Is there a way to specify a reverse order or perhaps by insert timestamp?
Scan scan = new Scan();
scan.setMaxResultSize(1000);
scan.setFilter(new FirstKeyOnlyFilter());
ResultScanner scanner = TestHbaseTable.getScanner(scan);
for(Result r : scanner){
System.out.println(Bytes.toString(r.getRow()));
String rowKey = Bytes.toString(r.getRow());
if(rowKey.startsWith("dm.") || rowKey.startsWith("bk.") || rowKey.startsWith("rt.")) {
continue;
} else if(rowKey.startsWith("yt")) {
List<String> ytresult = Arrays.asList(rowKey.split("\\s*.\\s*"));
.....
This table is huge so I would prefer to skip to the rows I actually need. Appreciate any help here.
Have you tried the .setReversed() property of the Scan? Keep in mind that in this case your start row would have to be the logical END of your rowKey range, and from there it would scan 'upwards'.

I can't seem to swap the location of parallel nodes/subtrees within a pugixml document....?

I need to re-sequence the majority of child nodes at one level within my document.
The document has a structure that looks (simplified) like this:
sheet
table
row
parameters
row
parameters
row
parameters
row
cell
header string
cell
header string
cell
header string
data row A
cell
data
cell
data
cell
data
data row B
cell
data
cell
data
cell
data
data row C
cell
data
cell
data
cell
data
data row D
cell
data
cell
data
cell
data
data row E
cell
data
cell
data
cell
data
row
parameters
row
parameters
row
parameters
row
parameters
row
parameters
I'm using pugixml now to load, parse, and traverse and access the large xml file, and I'm ultimately processing out a new sequence of the data rows. I know I'm parsing everything correctly and, looking at the resequence results, I can see that the reading and processing is correct. The resequence solution after all my optimizing and processing is a list of indicies in a revised order, like { D,A,E,C,B } for the example above. So now I need to actually resequence them into this new order and then output the resulting xml to a new file. The actual data is about 16 meg, with several hundred data element row nodes and more than a hundred data elements for each row
I've written a routine to swap two data rows, but something I'm doing is destroying the xml structural consistency during the swaps. I'm sure I don't understand the way pugi is moving nodes around and/or invalidating node handles.
I create and set aside node handles -- pugi::xml_node -- to the "table" level node, to the "header" row node, and to the "first data" row node, which in the original form above would be node "data row A". I know these handles give me correct access to the right data -- I can pause execution and look into them during the optimization and resequencing calculations and examine the rows and their siblings and see the input order.
The "header row" is always a particular child of the table, and the "first data row" is always the sibling immediately after the "header row". So I set these up when I load the file and check them for data consistency.
My understanding of node::insert_copy_before is this:
pugi:xml_node new_node_handle_in_document = parentnode.insert_copy_before( node_to_be_copied_to_child_of_parent , node_to_be_copied_nodes_next_sibling )
My understanding is that a deep recursive clone of node_to_be_copied_to_child_of_parent with all children and attributes will be inserted as the sibling immediately before node_to_be_copied_nodes_next_sibling, where both are children of parentnode.
Clearly, if node_to_be_copied_nodes_next_sibling is also the "first data row", then the node handle to the first data row may still be valid after the operation, but will no longer actually be a handle to the first data node. But will using insert_copy on the document force updates to individual node handles in the vicinity -- or not -- of the changes?
So let's look at the code I'm trying to make work:
// a method to switch data rows
bool switchDataRows( int iRow1 , int iRow2 )
{
// temp vars
int iloop;
// navigate to the first row and create a handle that can move along siblings until we find the target
pugi::xml_node xmnRow1 = m_xmnFirstDataRow;
for ( iloop = 0 ; iloop < iRow1 ; iloop++ )
xmnRow1 = xmnRow1.next_sibling();
// navigate to the second row and create another handle that can move along siblings until we find the target
pugi::xml_node xmnRow2 = m_xmnFirstDataRow;
for ( iloop = 0 ; iloop < iRow2 ; iloop++ )
xmnRow2 = xmnRow2.next_sibling();
// ok.... so now get convenient handles on the the locations of the two nodes by creating handles to the nodes AFTER each
pugi::xml_node xmnNodeAfterFirstNode = xmnRow1.next_sibling();
pugi::xml_node xmnNodeAfterSecondNode = xmnRow2.next_sibling();
// at this point I know all the handles I've created are pointing towards the intended data.
// now copy the second to the location before the first
pugi::xml_node xmnNewRow2 = m_xmnTableNode.insert_copy_before( xmnRow2 , xmnNodeAfterFirstNode );
// here's where my concern begins. Does this copy do what I want it to do, moving a copy of the second target row into the position under the table node
// as the child immediately before xmnNodeAfterFirstNode ? If it does, might this operation invalidate other handles to data row nodes? Are all bets off as
// soon as we do an insert/copy in a list of siblings, or will handles to other nodes in that list of children remain valid?
// now copy the first to the spot before the second
pugi::xml_node xmnNewRow1 = m_xmnTableNode.insert_copy_before( xmnRow1 , xmnNodeAfterSecondNode );
// clearly, if other handles to data row nodes have been invalidated by the first insert_copy, then these handles aren't any good any more...
// now delete the old rows
bool bDidRemoveRow1 = m_xmnTableNode.remove_child( xmnRow1 );
bool bDidRemoveRow2 = m_xmnTableNode.remove_child( xmnRow2 );
// this is my attempt to remove the original data row nodes after they've been copied to their new locations
// we have to update the first data row!!!!!
bool bDidRowUpdate = updateFirstDataRow(); // a routine that starts with the header row node and finds the first sibling, the first data row
// as before, if using the insert_copy methods result in many of the handles moving around, then I won't be able to base an update of the "first data row node"
// handle on the "known" handle to the header data row node.
// return the result
return( bDidRemoveRow2 && bDidRemoveRow1 && bDidRowUpdate );
}
As I said, this destroys the structural consistency of the resulting xml. I can save it, but nothing will read it except notepad. The table ends up being somewhat garbled. If I try to use my own program to read it, the reader reports an "element mismatch" error and refuses to load it, understandably.
So I'm doing one or more things wrong. What are they?

Tables got over-written

I want to loop thru a dbf and create word table for each record meeting the condition, and I got a one-page report with only the last rec in a single table. Look like all records are written to the same table. I tried to use n = n + 1 to place the variable as an element to the table
oTable = oDoc.tables[n]
But seems it only support numerical rather than variable ?
You have to add each table as you go, making sure to leave space in between them (because Word likes to combine tables).
You'll need something like this inside your loop:
* Assumes you start with oDoc pointing to the document,
* oRange set to an empty range at the beginning of the area where you want to add the tables,
* and that nRows and nCols give you the size of the table.
oTable = oDoc.Tables.Add(m.oRange, m.nRows, m.nCols)
oRange = oTable.Range()
oRange.Collapse(0)
oRange.InsertParagraphAfter()
oRange.Collapse(0)
After this code, you can use oTable to add the data you want to add. Then, on the next time through the loop, you're ready to add another table below the one you just filled.

Resources