Does HBase scan returns sorted columns? - hadoop

I am working on a HBase map reduce job and need to understand if the columns in a single column family are returned sorted by their names (key). If so, I wouldnt need to do it in the shuffle sort stage.
Thanks

I have a very similar data model as you. Upon insertion however, I set my own values for the timestamps on the Put object. However, I did so in a way that took a "seed" of the current time and appended a incrementing counter for each event I persisted in the batch.
When I pulled the results out from the Scan, I wrote a comparator:
public class KVTimestampComparator implements Comparator<KeyValue> {
#Override
public int compare(KeyValue kv1, KeyValue kv2) {
Long kv1Timestamp = kv1.getTimestamp();
Long kv2Timestamp = kv2.getTimestamp();
return kv1Timestamp.compareTo(kv2Timestamp);
}
}
Then sorted the raw row:
List<KeyValue> row = Arrays.asList(result.raw());
Collections.sort(row, new KVTimestampComparator());
Got this idea from person who answered this : Sorted results from hbase scanner

no, columns are not sorted
They are stored internally as key-value pairs in a long byte array. But, you should clarify your question about what you actually need this for.

Related

PageRequest and OrderBy method name Issue

in our Spring application we have a table that contains a lot of "Payment" record. Now we need a query that pages the results sorted from the one with the largest total to the smallest, we are facing an error because sometimes the same record is contained in two successive pages.
We are creating a PageRequest passed to the repository. Here our implementation:
Repository:
public interface StagingPaymentEntityRepository extends JpaRepository<StagingPaymentEntity, Long> {
Page<StagingPaymentEntity> findAllByStatusAndCreatedDateLessThanEqualAndOperationTypeOrderByEffectivePaymentDesc(String status, Timestamp batchStartTimestamp, String operationType, Pageable pageable);
}
public class BatchThreadReiteroStorni extends ThreadAbstract<StagingPaymentEntity> {
PageRequest pageRequest = PageRequest.of (index, 170);
Page<StagingPaymentEntity> records = ((StagingPaymentEntityRepository) repository).findAllByStatusAndCreatedDateLessThanEqualAndOperationTypeOrderByEffectivePaymentDesc("REITERO", batchStartTimestamp, "STORNO", pageRequest) ;
}
where index is the index of the page we are requesting.
There is a way to understand why it is happening ? Thank for support
This can have multiple reasons.
Non deterministic ordering: If the ordering you are using isn't deterministic, i.e. there are rows that might com in any order that order might change between selects resulting in items getting skipped or returned multiple times. Fix: add the primary key as a last column to the ordering.
If you change the entities in a way that affects the ordering, or another process does that you might end up with items getting processed multiple times.
In this scenario I see a couple of approaches:
do value based pagination. I.e. don't select pages but select the next N rows after .
Instead of paging use a Stream this allows to use a single select but still processing the results an element at a time. You might have to flush and evict entities and I'm not 100% sure that works, but certainly worth a try.
Finally you can mark all all rows that you want to process in a separate column, then select N marked entities and unmark them once they are processed.

Spark - sort by value with a JavaPairRDD

Working with apache spark using Java. I got an JavaPairRDD<String,Long> and I want to sort this dataset by its value. However, it seems that there only is sortByKey method in it. How could I sort it by the value of Long type?
dataset.mapToPair(x -> x.swap()).sortByKey(false).mapToPair(x -> x.swap()).take(100)
'Secondary sort' is not supported by Spark yet (See SPARK-3655 for details).
As a workaround, you can sort by value by swaping key <-> value and sorting by key as usual.
In Scala would be something like:
val kv:RDD[String, Long] = ???
// swap key and value
val vk = kv.map(_.swap)
val vkSorted = vk.sortByKey
I did this using a List, which now has a sort(Comparator c) method
List<Tuple2<String,Long>> touples = new ArrayList<>();
touples.addAll(myRdd.collect()); //
touples.sort((Tuple2<String, Long> o1, Tuple2<String, Long> o2) -> o2._2.compareTo(o1._2));
It is longer than #Atul solution and i dont know if performance wise is better, on an RDD with 500 items shows no difference, i wonder how does it work with a million records RDD.
You can also use Collections.sort and pass in the list provided by the collect and the lambda based Comparator

Javafx: Re-sorting a column in a TableView

I have a TableView associated to a TreeView. Each time a node in the TreeView is selected, the TableView is refreshed with different data.
I am able to sort any column in the TableView, just pressing the corresponding column header. That works fine.
But: when I select a different node in the tree-view, eventhough the column headers keep showing as sorted. The data is not.
Is there a way to programmatically enforce the sort order made by the user each time the data changes?
Ok, I found how to do it. I will summarize it here in case it is useful to others:
Before you update the contents of the TableView, you must save the sortcolum (if any) and the sortType:
TableView rooms;
...
TableColumn sortcolumn = null;
SortType st = null;
if (rooms.getSortOrder().size()>0) {
sortcolumn = (TableColumn) rooms.getSortOrder().get(0);
st = sortcolumn.getSortType();
}
Then, after you are done updating the data in the TableView, you must restore the lost sort-column state and perform a sort.
if (sortcolumn!=null) {
rooms.getSortOrder().add(sortcolumn);
sortcolumn.setSortType(st);
sortcolumn.setSortable(true); // This performs a sort
}
I do not take into account the possibility of having multiple columns in the sort, but this would be very simple to do with this information.
I had the same problem and found out that after an update of the data you only have to call the function sort() on the table view:
TableView rooms;
...
// Update data of rooms
...
rooms.sort()
The table view knows the columns for sorting thus the sort function will sort the new data in the wanted order.
This function is only available in Java 8.
If your TableView is not reinitialized, you can also do the following:
TableColumn<BundleRow, ?> sortOrder = rooms.getSortOrder().get(0);
rooms.getSortOrder().clear();
rooms.getSortOrder().add(sortOrder);
The example of fornacif works, but not if there is more than one sort order (try shift-click on a second column to create secondary sort order).
To do a re-sort on all columns you would need to do something like this:
List<TableColumn<Room, ?>> sortOrder = new ArrayList<>(roomTable.getSortOrder());
roomTable.getSortOrder().clear();
roomTable.getSortOrder().addAll(sortOrder);
If you use the TableView.setItems() method, it appears to reset several aspects of the TableView. Leave the ObservableList in the TableView in place, clear its contents, and then add your new items. Then, TableView.sort() will still know which columns were previously sorted and it will work. Like this:
tableView.getItems().clear();
tableView.getItems().addAll(newTableData);
tableView.sort();
Marco Jakob's answer is good for most cases, but I found that I needed to create a comparator that matches the table sort order for more flexibility. You can then use any method that takes a comparator to do sorting, searching, etc. To create the comparator, I extended that ComparatorChain class from apache's Common-Collections to easily do multiple column sorting. It looks like this.
public class TableColumnListComparator extends ComparatorChain {
public TableColumnListComparator(ObservableList<? extends TableColumn> columns) {
// Get list of comparators from column list.
for (TableColumn column : columns) {
addComparator(new ColumnComparator(column));
}
}
/**
* Compares two items in a table column as if they were being sorted in the TableView.
*/
private static class ColumnComparator implements Comparator {
private final TableColumn column;
/**
* Default Constructor. Creates comparator based off given table column sort order.
*
* #param column
*/
public ColumnComparator(TableColumn column) {
this.column = column;
}
#Override
public int compare(Object o1, Object o2) {
// Could not find a way to do this without casts unfortunately
// Get the value of the column using the column's cell value factory.
final ObservableValue<?> obj1 = (ObservableValue) column.getCellValueFactory().call(
new TableColumn.CellDataFeatures(column.getTableView(), column, o1));
final ObservableValue<?> obj2 = (ObservableValue) column.getCellValueFactory().call(
new TableColumn.CellDataFeatures(column.getTableView(), column, o2));
// Compare the column values using the column's given comparator.
final int compare = column.getComparator().compare(obj1.getValue(), obj2.getValue());
// Sort by proper ascending or descending.
return column.getSortType() == TableColumn.SortType.ASCENDING ? compare : -compare;
}
}
}
You can then sort at anytime with
Collections.sort(backingList, new TalbeColumnListComparator(table.getSortOrder());
I use this to sort multiple lists with the same sort, sort on background threads, do efficient updates without resorting the whole list, etc. I think there are going to be some improvements to table sorting in Javafx 8 so this won't be necessary in the future.
You can also use a SortedList.
SortedList<MatchTableBean> tableItems = new SortedList<>(
observableList, Comparator.comparing(MatchTableBean::isMarker).reversed().thenComparing(MatchTableBean::getQueryRT));
tableItems.comparatorProperty().bind(table.comparatorProperty());
table.setItems(tableItems);
This way the table is sorted, even when the content changes or is completely replaced.
You can also do this for 0 or more Sort-Columns:
List<TableColumn<Room, ?>> sortColumns = new LinkedList<>(rooms.getSortOrder());
// rooms.setItems(...)
rooms.getSortOrder().addAll(sortColumns);
The reason why you create a new LinkedList is that you don't wanna just point at rooms.getSortOrder() like this:
List<TableColumn<Room, ?>> sortColumns = rooms.getSortOrder();
because this way both rooms.getSortOrder() and sortColumns will become empty after you call rooms.setItems(...) which seems to clear the rooms.getSortOrder().

SingleColumnValueFilter not returning proper number of rows

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{
public String crawlIdentifier;
// counters
private static enum CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}
#Override
public void setup(Context context){
Configuration configuration=context.getConfiguration();
crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
}
#Override
public void map(ImmutableBytesWritable legacykey, Result row, Context context){
String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
}
}
}
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException("Crawl Identifier not set.");
}
// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String)
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

Is possible in siena to order by a calculated field?

I'm trying to get a query returned ordered on a filed which is calculated in Play.
This is the query I'm using.
return all().order("points").fetch();
where points is defined as
public Integer points;
and is retrieve thanks to this getter
public int getPoints(){
List<EventVote> votesP = votes.filter("isPositive", true).fetch();
List<EventVote> votesN = votes.filter("isPositive", false).fetch();
this.points= votesP.size()-votesN.size();
return this.points;
}
The getter is correctly called when I do
int votes=objectWithPoints.points;
I have the feeling I'm pretending a bit too much out of siena, but I would love this to work (or some similar code). Currently it just skips the order condition. Ordering on any other field works correctly.
I think you're true when you say you await a bit too much :)
The Siena query all().order("points").fetch() performs a request to the DB.
So it will order the values stored into the DB not into your program.
From what you say, I see that you have a getter getPoints which computes a value.
Yet, if you don't store this value into the database, the ordering can't be performed by Siena.
So either you compute the value, set it in your object and save the object to the DB.
objectWithPoints.points = getPoints();
objectWithPoints.save();
Either you order values by yourself in your program after computing them.

Resources