Spring GCP Datastore : Batch processing error: Binding site #limit for limit bound to non-integer value parameter., code=INVALID_ARGUMENT - spring

I have 5000 entity records in my GCP DataStore, if I use repo.findAll(), it takes 45 secs to fetch the results, below is the one liner code:
Iterable<StoreCache> storeCaches = your_kindRepository.findAll();
So I thought of using the pagination feature to fetch 25 records at a time but I am getting below run time error while running my code at line "repo.findAllSlice(DatastorePageable.of(page, 25));"
com.google.datastore.v1.client.DatastoreException: Binding site #limit for limit bound to non-integer value parameter., code=INVALID_ARGUMENT
This is my repo code:
#Repository
#Transactional
public interface your_kindRepository extends DatastoreRepository<your_kind, Long> {
#Query("select * from your_kind")
Slice<TestEntity> findAllSlice(Pageable pageable);
This is my service class code:
LOGGER.info("start");
int page = 0;
Slice<TestEntity> slice = null;
while (true) {
if (slice == null) {
slice = repo.findAllSlice(DatastorePageable.of(page, 25));
} else {
slice = repo.findAllSlice(slice.nextPageable());
}
if (!slice.hasNext()) {
break;
}
LOGGER.info("processed: " + page);
page++;
}
LOGGER.info("end");

Related

Spring GCP - Datastore performance: Batch processing, iteration through all entity list is very slow

Following code is work really slow, almost 30 second to process 400 entities:
int page = 0;
org.springframework.data.domain.Page<MyEntity> slice = null;
while (true) {
if (slice == null) {
slice = repo.findAll(PageRequest.of(page, 400, Sort.by("date")));
} else {
slice = repo.findAll(slice.nextPageable());
}
if (!slice.hasNext()) {
break;
}
slice.getContent().forEach(v -> v.setApp(SApplication.NAME_XXX));
repo.saveAll(slice.getContent());
LOGGER.info("processed: " + page);
page++;
}
I use following instead, 4-6 sec per 400 entities (gcp lib to work with datastore)
Datastore service = DatastoreOptions.getDefaultInstance().getService();
StructuredQuery.Builder<?> query = Query.newEntityQueryBuilder();
int limit = 400;
query.setKind("ENTITY_KIND").setLimit(limit);
int count = 0;
Cursor cursor = null;
while (true) {
if (cursor != null) {
query.setStartCursor(cursor);
}
QueryResults<?> queryResult = service.run(query.build());
List<Entity> entityList = new ArrayList<>();
while (queryResult.hasNext()) {
Entity loadEntity = (Entity) queryResult.next();
Entity.Builder newEntity = Entity.newBuilder(loadEntity).set("app", SApplication.NAME_XXX.name());
entityList.add(newEntity.build());
}
service.put(entityList.toArray(new Entity[0]));
count += entityList.size();
if (entityList.size() == limit) {
cursor = queryResult.getCursorAfter();
} else {
break;
}
LOGGER.info("Processed: {}", count);
}
Why I can't use spring to do that batch processing?
Full discussion here: https://github.com/spring-cloud/spring-cloud-gcp/issues/1824
First:
you need to use correct lib version: at least 1.2.0.M2
Second:
you need to implement new method in repository interface:
#Query("select * from your_kind")
Slice<TestEntity> findAllSlice(Pageable pageable);
Final code looks like:
LOGGER.info("start");
int page = 0;
Slice<TestEntity> slice = null;
while (true) {
if (slice == null) {
slice = repo.findAllSlice(DatastorePageable.of(page, 400, Sort.by("date")));
} else {
slice = repo.findAllSlice(slice.nextPageable());
}
if (!slice.hasNext()) {
break;
}
slice.getContent().forEach(v -> v.setApp("xx"));
repo.saveAll(slice.getContent());
LOGGER.info("processed: " + page);
page++;
}
LOGGER.info("end");

icCube ETL - Java View - group by on more than 1 column + retrieve max and min value

In the icCube Builder ETL, I want to group the data on more than one field. Also, as aggregation function, I would like to make use of MAX and MIN.
Example data:
(same data in text)
groupId phase startDate endDate
100 start 1-May-2018 5-May-2018
100 start 4-May-2018 7-May-2018
100 start 28-Apr-2018 1-May-2018
100 middle 4-May-2018 11-May-2018
100 middle 1-May-2018 10-May-2018
100 end 12-May-2018 15-May-2018
100 end 11-May-2018 13-May-2018
100 end 13-May-2018 14-May-2018
100 end 9-May-2018 12-May-2018
200 start 4-Apr-2018 2-May-2018
200 middle 18-Apr-2018 3-May-2018
200 middle 1-May-2018 1-May-2018
300 end 21-Apr-2018 24-Apr-2018
I would like to group this data on groupId and phase and get the minimum startDate and the maximum endDate:
How to best do that in the icCube ETL?
We're adding a new version of groupBy View in the ETL layer to support this. However you can create a Java view to perform the groupBy.
Something like :
package iccube.pub;
import java.util.*;
import java.lang.*;
import org.joda.time.*;
import crazydev.iccube.pub.view.*;
public class CustomJavaView implements IOlapBuilderViewLogic
{
private Map<List<Comparable>,List<Agg>> cached;
public CustomJavaView()
{
}
public void onInitMainTable(Map<String, IOlapCachedTable> cachedTables, IOlapDataTableDef mainTable)
{
cached = new HashMap();
}
public boolean onNewRow(IOlapViewContext context, Map<String, IOlapCachedTable> cachedTables, IOlapDataTableDef mainTable, IOlapReadOnlyDataRow mainTableRow)
{
// create the groupby key (list of values)
final List<Comparable> groupBy = Arrays.asList(mainTableRow.get("phase"), mainTableRow.get("groupId"));
// get the aggregators for values for the keys, build them if not already there
final List<Agg> aggs = cached.computeIfAbsent(groupBy, key -> Arrays.asList(new Agg(true), new Agg(false)));
// add values
aggs.get(0).add(mainTableRow.getAsDateTime("startDate"));
aggs.get(1).add(mainTableRow.getAsDateTime("endDate"));
return true; // false to stop
}
public void onProcessingCompleted(IOlapViewContext context, Map<String, IOlapCachedTable> cachedTables)
{
// now we can fire rows
for (Map.Entry<List<Comparable>, List<Agg>> entry : cached.entrySet())
{
final List<Comparable> groupByKey = entry.getKey();
final List<Agg> aggs = entry.getValue();
// create empty row
final IOlapDataTableRow row = context.newRow();
row.set("phase",groupByKey.get(0));
row.set("groupId",groupByKey.get(1));
row.set("startDate",aggs.get(0).date);
row.set("endDate",aggs.get(1).date);
context.fireRow(row);
}
}
// this is the Aggregator, you could implement something more complicated
static class Agg
{
final int isMin;
LocalDateTime date;
Agg(boolean isMin)
{
this.isMin = isMin ? -1 : 1;
}
void add(LocalDateTime ndate)
{
if (ndate != null)
{
date = ( date!= null && ((date.compareTo(ndate) * isMin) > 0)) ? date : ndate;
}
}
}
}

Spring Data Neo4j Ridiculously Slow Over Rest

public List<Errand> interestFeed(Person person, int skip, int limit)
throws ControllerException {
person = validatePerson(person);
String query = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n ORDER BY n.added DESC SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
String queryFast = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
Set<Errand> errands = new TreeSet<Errand>();
System.out.println(queryFast);
Result<Map<String, Object>> results = template.query(queryFast, null);
Iterator<Errand> objects = results.to(Errand.class).iterator();
return copyIterator (objects);
}
public List<Errand> copyIterator(Iterator<Errand> iter) {
Long start = System.currentTimeMillis();
Double startD = start.doubleValue();
List<Errand> copy = new ArrayList<Errand>();
while (iter.hasNext()) {
Errand e = iter.next();
copy.add(e);
System.out.println(e.getType());
}
Long end = System.currentTimeMillis();
Double endD = end.doubleValue();
p ((endD - startD)/1000);
return copy;
}
When I profile the copyIterator function it takes about 6 seconds to fetch just 10 results. I use Spring Data Neo4j Rest to connect with a Neo4j server running on my local machine. I even put a print function to see how fast the iterator is converted to a list and it does appear slow. Does each iterator.next() make a new Http call?
If Errand is a node entity then yes, spring-data-neo4j will make a http call for each entity to fetch all its labels (it's fault of neo4j, which doesn't return labels when you return whole node in cypher).
You can enable debug level logging in org.springframework.data.neo4j.rest.SpringRestCypherQueryEngine to log all cypher statements going to neo4j.
To avoid this call use #QueryResult http://docs.spring.io/spring-data/data-neo4j/docs/current/reference/html/#reference_programming-model_mapresult

hbase InternalScanner and filter in coprocessor

all:
Recently,I wrote a coprocessor in Hbase(0.94.17), A Class extends BaseEndpointCoprocessor, a rowcount method to count one table's rows.
And I got a problem.
if I did not set a filter in scan,my code works fine for two tables. One table has 1,000,000 rows,the other has 160,000,000 rows. it took about 2 minutes to count the bigger table.
however ,If I set a filter in scan, it only work on small table. it will throw a exception on the bigger table.
org.apache.hadoop.hbase.ipc.ExecRPCInvoker$1#2c88652b, java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
trust me,I check my code over and over again.
so, to count my table with filter, I have to write the following stupid code, first, I did not set a filter in scan,and then ,after I got one row record, I wrote a method to filter it.
and it work on both tables.
But I do not know why.
I try to read the scanner source code in HRegion.java,however, I did not get it.
So,if you know the answer,please help me. Thank you.
#Override
public long rowCount(Configuration conf) throws IOException {
// TODO Auto-generated method stub
Scan scan = new Scan();
parseConfiguration(conf);
Filter filter = null;
if (this.mFilterString != null && !mFilterString.equals("")) {
ParseFilter parse = new ParseFilter();
filter = parse.parseFilterString(mFilterString);
// scan.setFilter(filter);
}
scan.setCaching(this.mScanCaching);
InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()).getRegion().getScanner(scan);
long sum = 0;
try {
List<KeyValue> curVals = new ArrayList<KeyValue>();
boolean hasMore = false;
do {
curVals.clear();
hasMore = scanner.next(curVals);
if (filter != null) {
filter.reset();
if (HbaseUtil.filterOneResult(curVals, filter)) {
continue;
}
}
sum++;
} while (hasMore);
} finally {
scanner.close();
}
return sum;
}
The following is my hbase util code:
public static boolean filterOneResult(List<KeyValue> kvList, Filter filter) {
if (kvList.size() == 0)
return true;
KeyValue kv = kvList.get(0);
if (filter.filterRowKey(kv.getBuffer(), kv.getRowOffset(), kv.getRowLength())) {
return true;
}
for (KeyValue kv2 : kvList) {
if (filter.filterKeyValue(kv2) == Filter.ReturnCode.NEXT_ROW) {
return true;
}
}
filter.filterRow(kvList);
if (filter.filterRow())
return true;
else
return false;
}
Ok,It was my mistake. After I use jdb to debug my code, I got the following exception,
"org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
It is obvious ,my result list is empty.
hasMore = scanner.next(curVals);
it means, if I use a Filter in scan,this curVals list might be empty, but hasMore is true.
but I thought,if a record was filtered, it should jump to the next row,and this list should never be empty. I was wrong.
And my client did not print any remote error message on my console, it just catch this remote Exception, and retry.
after retry 10 times, it print an another exception,which was meaningless.

Replacing a foreach with LINQ

I have some very simple code that I'm trying to get running marginally quicker (there are a lot of these small types of call dotted around the code which seems to be slowing things down) using LINQ instead of standard code.
The problem is this - I have a variable outside of the LINQ which the result of the LINQ query needs to add it.
The original code looks like this
double total = 0
foreach(Crop c in p.Crops)
{
if (c.CropType.Type == t.Type)
total += c.Area;
}
return total;
This method isn't slow until the loop starts getting large, then it slows on the phone. Can this sort of code be moved to a relatively quick and simple piece of LINQ?
Looks like you could use sum: (edit: my syntax was wrong)
total = (from c in p.Crops
where c.CropType.Type == t.Type
select c.Area).Sum();
Or in extension method format:
total = p.Crops.Where(c => c.CropType.Type == t.Type).Sum(c => c.area);
As to people saying LINQ won't perform better where is your evidence? (The below is based on post from Hanselman? I ran the following in linqpad: (You will need to download and reference nbuilder to get it to run)
void Main()
{
//Nbuilder is used to create a chunk of sample data
//http://nbuilder.org
var crops = Builder<Crop>.CreateListOfSize(1000000).Build();
var t = new Crop();
t.Type = Type.grain;
double total = 0;
var sw = new Stopwatch();
sw.Start();
foreach(Crop c in crops)
{
if (c.Type == t.Type)
total += c.area;
}
sw.Stop();
total.Dump("For Loop total:");
sw.ElapsedMilliseconds.Dump("For Loop Elapsed Time:");
sw.Restart();
var result = crops.Where(c => c.Type == t.Type).Sum(c => c.area);
sw.Stop();
result.Dump("LINQ total:");
sw.ElapsedMilliseconds.Dump("LINQ Elapsed Time:");
sw.Restart();
var result2 = (from c in crops
where c.Type == t.Type
select c.area).Sum();
result.Dump("LINQ (sugar syntax) total:");
sw.ElapsedMilliseconds.Dump("LINQ (sugar syntax) Elapsed Time:");
}
public enum Type
{
wheat,
grain,
corn,
maize,
cotton
}
public class Crop
{
public string Name { get; set; }
public Type Type { get; set; }
public double area;
}
The results come out favorably to LINQ:
For Loop total: 99999900000
For Loop Elapsed Time: 25
LINQ total: 99999900000
LINQ Elapsed Time: 17
LINQ (sugar syntax) total: 99999900000
LINQ (sugar syntax) Elapsed Time: 17
The main way to optimize this would be changing p, which may or may not be possible.
Assuming p is a P, and looks something like this:
internal sealed class P
{
private readonly List<Crop> mCrops = new List<Crop>();
public IEnumerable<Crop> Crops { get { return mCrops; } }
public void Add(Crop pCrop)
{
mCrops.Add(pCrop);
}
}
(If p is a .NET type like a List<Crop>, then you can create a class like this.)
You can optimize your loop by maintaining a dictionary:
internal sealed class P
{
private readonly List<Crop> mCrops = new List<Crop>();
private readonly Dictionary<Type, List<Crop>> mCropsByType
= new Dictionary<Type, List<Crop>>();
public IEnumerable<Crop> Crops { get { return mCrops; } }
public void Add(Crop pCrop)
{
if (!mCropsByType.ContainsKey(pCrop.CropType.Type))
mCropsByType.Add(pCrop.CropType.Type, new List<Crop>());
mCropsByType[pCrop.CropType.Type].Add(pCrop);
mCrops.Add(pCrop);
}
public IEnumerable<Crop> GetCropsByType(Type pType)
{
return mCropsByType.ContainsKey(pType)
? mCropsByType[pType]
: Enumerable.Empty<Crop>();
}
}
Your code then becomes something like:
double total = 0
foreach(Crop crop in p.GetCropsByType(t.Type))
total += crop.Area;
return total;
Another possibility that would be even faster is:
internal sealed class P
{
private readonly List<Crop> mCrops = new List<Crop>();
private double mTotalArea;
public IEnumerable<Crop> Crops { get { return mCrops; } }
public double TotalArea { get { return mTotalArea; } }
public void Add(Crop pCrop)
{
mCrops.Add(pCrop);
mTotalArea += pCrop.Area;
}
}
Your code would then simply access the TotalArea property and you wouldn't even need a loop:
return p.TotalArea;
You might also consider extracting the code that manages the Crops data to a separate class, depending on what P is.
This is a pretty straight forward sum, so I doubt you will see any benefit from using LINQ.
You haven't told us much about the setup here, but here's an idea. If p.Crops is large and only a small number of the items in the sequence are of the desired type, you could build another sequence that contains just the items you need.
I assume that you know the type when you insert into p.Crops. If that's the case you could easily insert the relevant items in another collection and use that instead for the sum loop. That will reduce N and get rid of the comparison. It will still be O(N) though.

Resources