Spark not be able to retrieve all of Hbase data in specific column - hadoop

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!

Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);

For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

Related

How do I update one column of all rows in a large table in my Spring Boot application?

I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.
Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.

Datastax Cassandra Java Driver 4 paging problem

I have a requirement to fetch results from my cassandra database table in a paginated manner. I am using spring boot version 2.3.1 which in turn is using cassandra java driver 4. In previous driver versions, while paginating, there were no issues and the driver used to fetch the results equal to specified page size, like so:
Where select = QueryBuilder.select("column1", "column2", "column3")
.from("my_table")
.where(QueryBuilder.eq("column4", "some_value"))
.and(QueryBuilder.eq("column5", "some_value"));
select.setFetchSize(5);
if (!page.equals("0"))
select.setPagingState(PagingState.fromString(page));
ResultSet results = cassandraTemplate.getCqlOperations().queryForResultSet(select);
PagingState nextPage = results.getExecutionInfo().getPagingState();
int remaining = results.getAvailableWithoutFetching(); // gives 5 results as specified
In Java driver version 4, the method has been changed from setFetchSize(int) to setPageSize(int). But the same thing is not working here. It is fetching all the results even after specifying the size:
SimpleStatement stmt = QueryBuilder.selectFrom("my_keyspace", "my_table")
.columns(Arrays.asList("column1", "column2", "column3"))
.where(Relation.column("column4")
.isEqualTo(QueryBuilder.literal("some_value")),
Relation.column("column5")
.isEqualTo(QueryBuilder.literal("some_value))).build();
stmt.setPageSize(5);
if (!page.equals("0"))
stmt.setPagingState(Bytes.fromHexString(page));
ResultSet results = cassandraTemplate.getCqlOperations().queryForResultSet(stmt);
ByteBuffer nextPage = results.getExecutionInfo().getPagingState();
int remaining = results.getAvailableWithoutFetching(); // gives all the results, even though size is specified as 5
Am I doing something wrong? If I'm not then what should be the solution to this problem?
Statements in driver 4 are immutable. You need to change the following lines:
stmt = stmt.setPageSize(5);
if (!page.equals("0"))
stmt = stmt.setPagingState(Bytes.fromHexString(page));
IOW, each mutating method returns a new instance, so you need to capture that by reassigning the stmt variable each time.

Large Resultset with Spring Boot and QueryDSL

I have a Spring Boot application where I use QueryDSL for dynamic queries.
Now the results should be exported as a csv file.
The model is an Order which contains products. The products should be included in the csv file.
However, as there are many thousand orders with millions of products this should not be loaded into memory at once.
However, solutions proposed by Hibernate (ScrollableResults) and streams are not supported by QueryDSL.
How can this be achieved while still using QueryDSL (to avoid duplication of filtering logic)?
One workaround to this problem is to keep iterating using offset and limit.
Something like:
long limit = 100;
long lastLimitUsed = 0;
List<MyEntity> entities = new JPAQuery<>(em)
.from(QMyEntity.entity)
.limit(limit)
.offset(lastLimitUsed)
.fetch();
lastLimitUsed += limit;
With that approach you can fetch smaller chunks of data. It is important to analyze if the limit and offset field will work well with your query. There are situations where even if you use limit and offset you will end up making a full scan on the tables involved on the query. If that happens you will face a performance problem instead of a memory one.
Use JPAQueryFactory
// com.querydsl.jpa.impl.JPAQueryFactory
JPAQueryFactory jpaFctory = new JPAQueryFactory(entityManager);
//
Expression<MyEntity> select = QMyEntity.myEntity;
EntityPath<MyEntity> path = QMyEntity.myEntity;
Stream stream = this.jpaQueryFactory
.select(select)
.from(entityPath)
.where(cond)
.createQuery() // get jpa query
.getResultStream();
// do something
stream.close();

Adding partitions to Hive from a MapReduce Job

I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.
I have defined an external table logs in hive partitioned on date and origin server with an external location on hdfs /data/logs/. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like
"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...
From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches
alter table command -- Too many alter table commands
adding dynamic partitions
For approach two I see only examples of INSERT OVERWRITE which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?
To do this from within a Map/Reduce job I would recommend using Apache HCatalog, which is a new project stamped under Hadoop.
HCatalog really is an abstraction layer on top of HDFS so you can write your outputs in a standardized way, be it from Hive, Pig or M/R. Where this comes into the picture for you, is that you can directly load data in Hive from your Map/Reduce job using the output format HCatOutputFormat. Below is an example taken from the official website.
A current code example for writing out a specific partition for (a=1,b=1) would go something like this:
Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);
And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.
You can also use dynamic partitions with HCatalog, in which case you could load as many partitions as you want in the same job !
I recommend reading further on HCatalog on the website provided above, which should give you more details if needed.
In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.
I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:
In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.
The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added
HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);
Here's the code for writing into multiple tables with dynamic partitioning in one job using HCatalog, the code has been tested on Hadoop 2.5.0, Hive 0.13.1:
// ... Job setup, InputFormatClass, etc ...
String dbName = null;
String[] tables = {"table0", "table1"};
job.setOutputFormatClass(MultiOutputFormat.class);
MultiOutputFormat.JobConfigurer configurer = MultiOutputFormat.createConfigurer(job);
List<String> partitions = new ArrayList<String>();
partitions.add(0, "partition0");
partitions.add(1, "partition1");
HCatFieldSchema partition0 = new HCatFieldSchema("partition0", TypeInfoFactory.stringTypeInfo, null);
HCatFieldSchema partition1 = new HCatFieldSchema("partition1", TypeInfoFactory.stringTypeInfo, null);
for (String table : tables) {
configurer.addOutputFormat(table, HCatOutputFormat.class, BytesWritable.class, CatRecord.class);
OutputJobInfo outputJobInfo = OutputJobInfo.create(dbName, table, null);
outputJobInfo.setDynamicPartitioningKeys(partitions);
HCatOutputFormat.setOutput(
configurer.getJob(table), outputJobInfo
);
HCatSchema schema = HCatOutputFormat.getTableSchema(configurer.getJob(table).getConfiguration());
schema.append(partition0);
schema.append(partition1);
HCatOutputFormat.setSchema(
configurer.getJob(table),
schema
);
}
configurer.configure();
return job.waitForCompletion(true) ? 0 : 1;
Mapper:
public static class MyMapper extends Mapper<LongWritable, Text, BytesWritable, HCatRecord> {
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
HCatRecord record = new DefaultHCatRecord(3); // Including partitions
record.set(0, value.toString());
// partitions must be set after non-partition fields
record.set(1, "0"); // partition0=0
record.set(2, "1"); // partition1=1
MultiOutputFormat.write("table0", null, record, context);
MultiOutputFormat.write("table1", null, record, context);
}
}

Hadoop: Map Reduce: read from HBase, but filter rows by content of one column

I am really new to Hadoop and I am not able to find an answer to my question. I want to write a map reduce job, where I read from HBase and write then in a simple text file.
In HBase, Ive got a column representing an id. Now I dont want to work on all containing rows in my HBase Table, but only on those between a maxId and a minId.
I found out that I could possibly user filters (scan.setFilter), so that I can filter rows which dont match my request.
This is my first Map Reduce Job, so please be patient :-)
Ive got a Starter Class, where I configure the job and the Scan Object and then start the job.
Now, my first try looks like this:
private Scan getScan()
{
final Scan scan = new Scan();
// ** FILTER **
List<Filter> filters = new ArrayList<Filter>();
Filter filter1 = new ValueFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(minId))));
filters.add(filter1);
Filter filter2 = new ValueFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(maxId))));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
scan.setFilter(filterList);
scan.setCaching(500);
scan.setCacheBlocks(false);
// id
scan.addColumn("columnfamily".getBytes(), "id".getBytes());
return scan;
}
Well, Im not sure if this is the right way to do it. I also read that I could pass my minId and maxId maybe with the Configuration Object to the Map Job, but Im not sure how.
Besides, what have I to do afterwards? I would normally just initiate the job with initTableMapperJob and pass the Scan Object to it. Ive read something of ResultScanner and so, do I need them? I thought the MapReduce Framework would now automatically pass the correct rows to my map job, is that correct?

Resources