Apache Ignite indexing performance - clustered-index

I have a cache with string as a key and TileKey (class below) as a value, I've noticed that when I execute a query (below) the performance is affected almost linearly by the cache size even though all the fields that are used in the query are indexed.
Here is a representative benchmark - I've used the same query (below) with the same parameters for all benchmarks :
The query returns (the same) 30 entries in all benchmarks
Query on 5350 entries cache took 6-7ms
Query on 10700 entries cache took 8-10ms
Query on 48150 entries cache took 30-42ms
Query on 96300 entries cache took 50-70ms
I've executed the benchmark with 8gb single node and 4gb 2 nodes, the results were pretty much the same (in terms of query speed relative to cache size)
I've also tried using QuerySqlFieldGroup by using the "time" field as the first group field, it should reduce the result set to only 1000 entries in all benchmarks, i'm not sure that this is the right usage for QuerySqlFieldGroup as from my understanding it should be mainly used for join queries between caches.
Am I doing something wrong or these are the expected query performance using Ignite indexing?
Code :
String strQuery = "time = ? and zoom = ? and x >= ? and x <= ? and y >= ? and y <= ?";
SqlQuery<String, TileKey> query= new SqlQuery<String, TileKey>(TileKey.class, strQuery);
query.setArgs(time, zoom, xMin,xMax,yMin, yMax);
QueryCursor<Entry<String, TileKey>> tileKeyCursor = tileKeyCache.query(query);
Map<String, TileKey> tileKeyMap = new HashMap<String, TileKey>();
for (Entry<String, TileKey> p : keysCursor) {
tileKeyMap.put(p.getKey(), p.getValue());
}
Cache config :
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="KeysCache" />
<property name="cacheMode" value="PARTITIONED" />
<property name="atomicityMode" value="ATOMIC" />
<property name="backups" value="0" />
<property name="queryIndexEnabled" value="true"/>
<property name="indexedTypes">
<list>
<value>java.lang.String</value>
<value>org.ess.map.TileKey</value>
</list>
</property>
</bean>
Class :
#QueryGroupIndex.List(#QueryGroupIndex(name = "idx1"))
public class TileKey implements Serializable {
/**
*
*/
private static final long serialVersionUID = 1L;
private String id;
#QuerySqlField(index = true)
#QuerySqlField.Group(name = "idx1", order = 0)
private int time;
#QuerySqlField(index = true)
#QuerySqlField.Group(name = "idx1", order = 1)
private int zoom;
#QuerySqlField(index = true)
#QuerySqlField.Group(name = "idx1", order = 2)
private int x;
#QuerySqlField(index = true)
#QuerySqlField.Group(name = "idx1", order = 3)
private int y;
#QuerySqlField(index = true)
private boolean inCache;
}

I have found the problem, thank you bobby_brew for leading me in the right direction.
The indexing example of Ignite is incorrect, there is an open issue about it.
I've changes the indexed field annotations from
#QuerySqlField(index = true)
#QuerySqlField.Group(name = "idx1", order = x)
To
#QuerySqlField(index = true, orderedGroups = {#QuerySqlField.Group(name = "idx1", order = x)})
and now the query duration is solid 2ms in all scenarios

Related

Save millions of rows from CSV to Oracle DB Using Spring boot JPA

On regular Basis another application dumps a CSV that contains more than 7-8 millions of rows. I have a cron job that loads the data from CSV ans saves the data into my oracle DB. Here's my code snippet
String line = "";
int count = 0;
LocalDate localDateTime;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MMM-yy");
List<ItemizedBill> itemizedBills = new ArrayList<>();
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= line.split("\\|");
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
String date = data[1].substring(0,2);
String month = data[1].substring(3,6);
String year = data[1].substring(7,9);
month = WordUtils.capitalizeFully(month);
String modifiedDate = date + "-" + month + "-" + year;
localDateTime = LocalDate.parse(modifiedDate, formatter);
customer.setEventDate(localDateTime.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.valueOf(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) { customer.setFnfNum("Other"); }
else{ customer.setFnfNum("FNF Number"); }
itemizedBills.add(customer);
}
count++;
}
itemizedBillRepository.saveAll(itemizedBills);
} catch (IOException e) {
e.printStackTrace();
}
}
This feature works but takes a lot of time to process. How can I make it efficent and make this process faster?
There are a couple of things you should do to your code.
String.split, while convenient, is relatively slow as it will recompile the regexp each time. Better to use Pattern and the split method on that to reduce overhead.
Use proper JPA batching strategies as explained in this blog.
First enable batch processing in your Spring application.properties. We will use a batch size of 50 (you will need to experiment on what is a proper batch-size for your case).
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
Then directly save entities to the database and each 50 items do a flush and clear. This will flush the state to the database and clear the first level cache (which will prevent excessive dirty-checks).
With all the above your code should look something like this.
int count = 0;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MMM-yy");
Pattern splitter = Pattern.compile("\\|");
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= splitter.split(Line);
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
String date = data[1].substring(0,2);
String month = data[1].substring(3,6);
String year = data[1].substring(7,9);
month = WordUtils.capitalizeFully(month);
String modifiedDate = date + "-" + month + "-" + year;
LocalDate localDate = LocalDate.parse(modifiedDate, formatter);
customer.setEventDate(localDate.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.valueOf(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) {
customer.setFnfNum("Other");
} else {
customer.setFnfNum("FNF Number");
}
itemizedBillRepository.save(customer);
}
count++;
if ( (count % 50) == 0) {
this.entityManager.flush(); // sync with database
this.entityManager.clear(); // clear 1st level cache
}
}
} catch (IOException e) {
e.printStackTrace();
}
2 other optimizations you could do:
If your volume property is a long rather then a Long you should use Long.parseLong(data[4]); instead. It saves the Long creation and unboxing. With just 10 rows this might not be an issue, but with millions of rows, those milliseconds will add up.
Use ddMMMyy as the DateTimeFormatter and remove the substring parts in your code. Just do LocalDate.parse(date[1].toUpperCase(), formatted) to achieve the same result without the additional overhead of 5 additional String objects.
int count = 0;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("ddMMMyy");
Pattern splitter = Pattern.compile("\\|");
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= splitter.split(Line);
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
LocalDate localDate = LocalDate.parse(data[1].toUpperCase(), formatter);
customer.setEventDate(localDate.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.parseLong(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) {
customer.setFnfNum("Other");
} else {
customer.setFnfNum("FNF Number");
}
itemizedBillRepository.save(customer);
}
count++;
if ( (count % 50) == 0) {
this.entityManager.flush(); // sync with database
this.entityManager.clear(); // clear 1st level cache
}
}
} catch (IOException e) {
e.printStackTrace();
}
you can use spring data batch insert.This links explains how to do : https://www.baeldung.com/spring-data-jpa-batch-inserts
You can try streaming MySQL results using Java 8 Streams and Spring Data JPA. The below link explains it in details
http://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html

Hextoraw() not working with IN clause while using NamedParameterJdbcTemplate

I am trying to update certain rows in my oracle DB using id which is of RAW(255).
Sample ids 0BF3957A016E4EBCB68809E6C2EA8B80, 1199B9F29F0A46F486C052669854C2F8...
#Autowired
private NamedParameterJdbcTemplate jdbcTempalte;
private static final String UPDATE_SUB_STATUS = "update SUBSCRIPTIONS set status = :status, modified_date = systimestamp where id in (:ids)";
public void saveSubscriptionsStatus(List<String> ids, String status) {
MapSqlParameterSource paramSource = new MapSqlParameterSource();
List<String> idsHexToRaw = new ArrayList<>();
String temp = new String();
for (String id : ids) {
temp = "hextoraw('" + id + "')";
idsHexToRaw.add(temp);
}
paramSource.addValue("ids", idsHexToRaw);
paramSource.addValue("status", status);
jdbcTempalte.update(*UPDATE_SUB_STATUS*, paramSource);
}
This above block of code is executing without any error but the updates are not reflected to the db, while if I skip using hextoraw() and just pass the list of ids it works fine and also updates the data in table. see below code
public void saveSubscriptionsStatus(List<String> ids, String status) {
MapSqlParameterSource paramSource = new MapSqlParameterSource();]
paramSource.addValue("ids", ids);
paramSource.addValue("status", status);
jdbcTempalte.update(UPDATE_SUB_STATUS, paramSource);
}
this code works fine and updates the table, but since i am not using hextoraw() it scans the full table for updation which I don't want since i have created indexes. So using hextoraw() will use index for scanning the table but it is not updating the values which is kind of weird.
Got a solution myself by trying all the different combinations :
#Autowired
private NamedParameterJdbcTemplate jdbcTempalte;
public void saveSubscriptionsStatus(List<String> ids, String status) {
String UPDATE_SUB_STATUS = "update SUBSCRIPTIONS set status = :status, modified_date = systimestamp where id in (";
MapSqlParameterSource paramSource = new MapSqlParameterSource();
String subQuery = "";
for (int i = 0; i < ids.size(); i++) {
String temp = "id" + i;
paramSource.addValue(temp, ids.get(i));
subQuery = subQuery + "hextoraw(:" + temp + "), ";
}
subQuery = subQuery.substring(0, subQuery.length() - 2);
UPDATE_SUB_STATUS = UPDATE_SUB_STATUS + subQuery + ")";
paramSource.addValue("status", status);
jdbcTempalte.update(UPDATE_SUB_STATUS, paramSource);
}
What this do is create a query with all the ids to hextoraw as id0, id1, id2...... and also added this values in the MapSqlParameterSource instance and then this worked fine and it also used the index for updating my table.
After running my new function the query look like : update
SUBSCRIPTIONS set status = :status, modified_date = systimestamp
where id in (hextoraw(:id0), hextoraw(:id1), hextoraw(:id2)...)
MapSqlParameterSource instance looks like : {("id0", "randomUUID"),
("id1", "randomUUID"), ("id2", "randomUUID").....}
Instead of doing string manipulation, Convert the list to List of ByteArray
List<byte[]> productGuidByteList = stringList.stream().map(item -> GuidHelper.asBytes(item)).collect(Collectors.toList());
parameters.addValue("productGuidSearch", productGuidByteList);
public static byte[] asBytes(UUID uuid) {
ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(uuid.getMostSignificantBits());
bb.putLong(uuid.getLeastSignificantBits());
return bb.array();
}

Using Bulk Insert dramatically slows down processing?

I'm fairly new to Oracle but I have used the Bulk insert on a couple other applications. Most seem to go faster using it but I've had a couple where it slows down the application. This is my second one where it slowed it down significantly so I'm wondering if I have something setup incorrectly or maybe I need to set it up differently. In this case I have a console application that processed ~1,900 records. Inserting them individually it will take ~2.5 hours and when I switched over to the Bulk insert it jumped to 5 hours.
The article I based this off of is http://www.oracle.com/technetwork/issue-archive/2009/09-sep/o59odpnet-085168.html
Here is what I'm doing, I'm retrieving some records from the DB, do calculations, and then write the results out to a text file. After the calculations are done I have to write those results back to a different table in the DB so we can look back at what those calculations later on if needed.
When I make the calculation I add the results to a List. Once I'm done writing out the file I look at that List and if there are any records I do the bulk insert.
With the bulk insert I have a setting in the App.config to set the number of records I want to insert. In this case I'm using 250 records. I assumed it would be better to limit my in memory arrays to say 250 records versus the 1,900. I loop through that list to the count in the App.config and create an array for each column. Those arrays are then passed as parameters to Oracle.
App.config
<add key="UpdateBatchCount" value="250" />
Class
class EligibleHours
{
public string EmployeeID { get; set; }
public decimal Hours { get; set; }
public string HoursSource { get; set; }
}
Data Manager
public static void SaveEligibleHours(List<EligibleHours> listHours)
{
//set the number of records to update batch on from config file Subtract one because of 0 based index
int batchCount = int.Parse(ConfigurationManager.AppSettings["UpdateBatchCount"]);
//create the arrays to add values to
string[] arrEmployeeId = new string[batchCount];
decimal[] arrHours = new decimal[batchCount];
string[] arrHoursSource = new string[batchCount];
int i = 0;
foreach (var item in listHours)
{
//Create an array of employee numbers that will be used for a batch update.
//update after every X amount of records, update. Add 1 to i to compensated for 0 based indexing.
if (i + 1 <= batchCount)
{
arrEmployeeId[i] = item.EmployeeID;
arrHours[i] = item.Hours;
arrHoursSource[i] = item.HoursSource;
i++;
}
else
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
//reset counter and array
i = 0;
arrEmployeeId = new string[batchCount];
arrHours = new decimal[batchCount];
arrHoursSource = new string[batchCount];
}
}
//process last array
if (arrEmployeeId.Length > 0)
{
UpdateDbWithEligibleHours(arrEmployeeId, arrHours, arrHoursSource);
}
}
private static void UpdateDbWithEligibleHours(string[] arrEmployeeId, decimal[] arrHours, string[] arrHoursSource)
{
StringBuilder sbQuery = new StringBuilder();
sbQuery.Append("insert into ELIGIBLE_HOURS ");
sbQuery.Append("(EMP_ID, HOURS_SOURCE, TOT_ELIG_HRS, REPORT_DATE) ");
sbQuery.Append("values ");
sbQuery.Append("(:1, :2, :3, SYSDATE) ");
string connectionString = ConfigurationManager.ConnectionStrings["Server_Connection"].ToString();
using (OracleConnection dbConn = new OracleConnection(connectionString))
{
dbConn.Open();
//create Oracle parameters and pass arrays of data
OracleParameter p_employee_id = new OracleParameter();
p_employee_id.OracleDbType = OracleDbType.Char;
p_employee_id.Value = arrEmployeeId;
OracleParameter p_hoursSource = new OracleParameter();
p_hoursSource.OracleDbType = OracleDbType.Char;
p_hoursSource.Value = arrHoursSource;
OracleParameter p_hours = new OracleParameter();
p_hours.OracleDbType = OracleDbType.Decimal;
p_hours.Value = arrHours;
OracleCommand objCmd = dbConn.CreateCommand();
objCmd.CommandText = sbQuery.ToString();
objCmd.ArrayBindCount = arrEmployeeId.Length;
objCmd.Parameters.Add(p_employee_id);
objCmd.Parameters.Add(p_hoursSource);
objCmd.Parameters.Add(p_hours);
objCmd.ExecuteNonQuery();
}
}

Spring Data Neo4j Ridiculously Slow Over Rest

public List<Errand> interestFeed(Person person, int skip, int limit)
throws ControllerException {
person = validatePerson(person);
String query = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n ORDER BY n.added DESC SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
String queryFast = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
Set<Errand> errands = new TreeSet<Errand>();
System.out.println(queryFast);
Result<Map<String, Object>> results = template.query(queryFast, null);
Iterator<Errand> objects = results.to(Errand.class).iterator();
return copyIterator (objects);
}
public List<Errand> copyIterator(Iterator<Errand> iter) {
Long start = System.currentTimeMillis();
Double startD = start.doubleValue();
List<Errand> copy = new ArrayList<Errand>();
while (iter.hasNext()) {
Errand e = iter.next();
copy.add(e);
System.out.println(e.getType());
}
Long end = System.currentTimeMillis();
Double endD = end.doubleValue();
p ((endD - startD)/1000);
return copy;
}
When I profile the copyIterator function it takes about 6 seconds to fetch just 10 results. I use Spring Data Neo4j Rest to connect with a Neo4j server running on my local machine. I even put a print function to see how fast the iterator is converted to a list and it does appear slow. Does each iterator.next() make a new Http call?
If Errand is a node entity then yes, spring-data-neo4j will make a http call for each entity to fetch all its labels (it's fault of neo4j, which doesn't return labels when you return whole node in cypher).
You can enable debug level logging in org.springframework.data.neo4j.rest.SpringRestCypherQueryEngine to log all cypher statements going to neo4j.
To avoid this call use #QueryResult http://docs.spring.io/spring-data/data-neo4j/docs/current/reference/html/#reference_programming-model_mapresult

Setting defaultRowPrefetch has no effect on query

I've got a weird issue with using Spring JDBC + Oracle 10g. Here's my dataSource config:
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="oracle.jdbc.OracleDriver" />
<property name="url" value="jdbc:oracle:thin:#localhost:1521:XE" />
<property name="username" value="admin" />
<property name="password" value="admin" />
<property name="validationQuery" value="SELECT 1 FROM DUAL"/>
<property name="testOnBorrow" value="true"/>
<property name="connectionProperties" value="defaultRowPrefetch=1000" />
</bean>
At first, I thought the connectionProperties value was being set but as I fine-tuned the query in SQL Developer (cost went from 3670 to 285 and plan explain went from :45 to :03), the time in the application never fluctuated from the original 15 seconds. Removing the connectionProperties setting had no effect. So, what I did was this:
DAO class
private List<Activity> getAllActivitiesJustJDBC() {
String query = "select * " + "from activity a, work_order w "
+ "where a.ac_customer = 'CSC' "
+ "and w.wo_customer = a.ac_customer "
+ "and a.ac_workorder = w.wo_workorder ";
long startTime = System.currentTimeMillis();
List<Activity> activities = new ArrayList<Activity>();
try {
Connection conn = jdbcTemplate.getDataSource().getConnection();
PreparedStatement st = conn.prepareStatement(query);
st.setFetchSize(1000);
ResultSet rs = st.executeQuery();
ActivityMapper mapper = new ActivityMapper();
while (rs.next()) {
Activity activity = mapper.mapRow(rs, 1);
activities.add(activity);
}
} catch (Exception ex) {
ex.printStackTrace();
}
System.out.println("Time it took...."
+ (System.currentTimeMillis() - startTime));
System.out.println("Number of activities = " + activities.size());
return activities;
}
This time, the time it took to fetch 11,115 rows took on average 2 seconds. The key statement is the setFetchSize(1000). So....I like option #2 but do I need to close the connection or is Spring handling this for me? In option #1, I would use the jdbcTemplate to call the query method, passing in the parameterized query and the BeanPropertyRowMapper instance using my data object and then returning the List.
Ok from looking at other questions similar to this one, I do need to close out the connection starting with the result set, then statement, and then the connection using a finally. I also remembered (from the days before Spring) I needed to wrap all that around a try/catch and then not really do anything if an exception occurs closing a connection.
FYI, I wonder if there's a way to set the fetch size when defining the datasource using Spring.
Here's the final method:
private List<Activity> getAllActivitiesJustJDBC() {
String query = "select * " + "from activity a, work_order w "
+ "where a.ac_customer = 'CSC' "
+ "and w.wo_customer = a.ac_customer "
+ "and a.ac_workorder = w.wo_workorder ";
long startTime = System.currentTimeMillis();
List<Activity> activities = new ArrayList<Activity>();
Connection conn = null;
PreparedStatement st = null;
ResultSet rs = null;
try {
conn = jdbcTemplate.getDataSource().getConnection();
st = conn.prepareStatement(query);
st.setFetchSize(1000);
rs = st.executeQuery();
while (rs.next()) {
activities.add(ActivityMapper.mapRow(rs));
}
} catch (Exception ex) {
ex.printStackTrace();
}
finally {
try {
rs.close();
st.close();
conn.close();
}
catch (Exception ex){
//Not much we can do here
}
}
System.out.println("Time it took...."
+ (System.currentTimeMillis() - startTime));
System.out.println("Number of activities = " + activities.size());
return activities;
}

Resources