I have written a method insert() in which I am trying to use JDBC Batch for inserting half a million records into a MySQL database:
public void insert(int nameListId, String[] names) {
String sql = "INSERT INTO name_list_subscribers (name_list_id, name, date_added)" +
" VALUES (?, ?, NOW())";
Connection conn = null;
PreparedStatement ps = null;
try {
conn = getConnection();
ps = conn.prepareStatement(sql);
for (String s : names ) {
ps.setInt(1, nameListId);
ps.setString(2, s);
ps.addBatch();
}
ps.executeBatch();
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
closeDbResources(ps, null, conn);
}
}
But whenever I try to run this method, I get the following error:
java.lang.OutOfMemoryError: Java heap space
com.mysql.jdbc.ServerPreparedStatement$BatchedBindValues.<init>(ServerPreparedStatement.java:72)
com.mysql.jdbc.ServerPreparedStatement.addBatch(ServerPreparedStatement.java:330)
org.apache.commons.dbcp.DelegatingPreparedStatement.addBatch(DelegatingPreparedStatement.java:171)
If I replace ps.addBatch() with ps.executeUpdate() and remove ps.executeBatch(), it works fine, though it takes some time. Please let me know if you know if using Batch is appropriate in this situation, and if it is, then why does it give OurOfMemoryError?
Thanks
addBatch and executeBatch give you the mechanism to perform batch inserts, but you still need to do the batching algorithm yourself.
If you simply pile every statement into the same batch, as you are doing, then you'll run out of memory. You need to execute/clear the batch every n records. The value of n is up to you, JDBC can't make that decision for you. The larger the batch size, the faster things will go, but too large and you'll get memory starvation and things will slow down or fail. It depends how much memory you have.
Start off with a batch size of 1000, for example, and experiment with different values from there.
final int batchSize = 1000;
int count = 0;
for(String s : names ) {
ps.setInt(1, nameListId);
ps.setString(2, s);
ps.addBatch();
if (++count % batchSize == 0) {
ps.executeBatch();
ps.clearBatch(); //not sure if this is necessary
}
}
ps.executeBatch(); // flush the last few records.
It is out of memory because it hold all the transaction in memory and only send it over to the database when you call executeBatch.
If you don't need it to be atomic and would like the get better performance, you can keep a counter and call executeBatch every n number of records.
Related
This is related to the previous question I have posted. I think that while it is related, it might be different enough to warrant its own question.
The code used is:
public static void main(String[] args){
ChronicleQueue QUEUE = SingleChronicleQueueBuilder.single("./chronicle/roll")
.rollCycle(RollCycles.MINUTELY).build();
ExcerptTailer TAILER = QUEUE.createTailer();
ArrayList<Long> seqNums = new ArrayList<>();
//this reads all roll cycles starting from first and carries on to next rollcycle.
//busy spinner that spins non-stop trying to read from queue
int currentCycle = TAILER.cycle();
System.out.println(TAILER.cycle());
while(true){
//if it moves over to new cycle, start over the sequencing (fresh start for next day)
int cycleCheck = TAILER.cycle();
long indexCheck = TAILER.index();
System.out.println(cycleCheck);
System.out.println("idx: "+indexCheck);
if (currentCycle != cycleCheck){
LOGGER.warn("Changing to new roll cycle, from: "+currentCycle+" to: "+cycleCheck+". Clearing list of size "+seqNums.size());
seqNums.clear(); // this may cause a memory issue see: https://stackoverflow.com/a/6961397/16034206
currentCycle = cycleCheck;
TAILER.moveToCycle(currentCycle);
cycleCheck = TAILER.cycle();
indexCheck = TAILER.index();
System.out.println("cycle: "+cycleCheck);
System.out.println("idx: "+indexCheck);
}
//TODO:2nd option, on starting the chronicle runner, always move to end, and wait for next day's cycle to start
if (TAILER.readDocument(w -> w.read("packet").marshallable(
m -> {
long seqNum = m.read("seqNum").readLong();
int size = seqNums.size();
if (size > 0){
int idx;
if ((idx = seqNums.indexOf(seqNum)) >= 0){
LOGGER.warn("Duplicate seqNum: "+seqNum+" at idx: "+idx);
}else{
long previous = seqNums.get(size-1);
long gap = seqNum - previous;
if (Math.abs(gap) > 1L){
LOGGER.error("sequence gap at seqNum: "+previous+" and "+seqNum+"! Gap of "+gap);
}
}
}
seqNums.add(seqNum);
System.out.println(m.read("moldUdpHeader").text());
}
))){ ; }else { TAILER.close(); break; }
//breaks out from spinner if nothing to be read.
//a named tailer could be used to pick up from where is left off.
}
}
At this point, I have 2 roll cycle files, one ends in a sequence Number of 1001, then the next file starts with seqNum of 0. Using the while loop, it would read both files, but there is an if statement to check that the cycle has changed or not and reset accordingly.
The output is as follows:
The output when .moveToCycle() is commented:
As you can see, the first index of the next file is read as part of previous file, but when I use the TAILER.moveToCycle(currentCycle) it moves to start of the next file again, but it has a different index this time. If you comment this line of code out, it will not re-read the entry with seqNum of 0.
Alright, I tested the following and it works just fine. How it works is that it reads the value (I am assuming the internal workings would only shift the index and cycle after it reads an incoming value), then tests for cycle change (from testing before reading to testing after reading). This is probably how one should iterate over multiple roll cycle files, while keeping track of when it roll overs.
Also, note that previously it prints cycle and index before printing the object, now it prints object before printing cycle and index, so its likely that you may misread it and assume it doesn't work if you try to test the following code.
public static void main(String[] args){
ChronicleQueue QUEUE = SingleChronicleQueueBuilder.single("./chronicle/roll")
.rollCycle(RollCycles.FIVE_MINUTELY).build();
ExcerptTailer TAILER = QUEUE.createTailer();
ArrayList<Long> seqNums = new ArrayList<>();
//this reads all roll cycles starting from first and carries on to next roll cycle.
//busy spinner that spins non-stop trying to read from queue
int currentCycle = TAILER.cycle();
System.out.println(TAILER.cycle());
AtomicLong seqNum = new AtomicLong();
while(true){
if (TAILER.readDocument(w -> w.read("packet").marshallable(
m -> {
long val = m.read("seqNum").readLong();
seqNum.set(val);
System.out.println(m.read("moldUdpHeader").text());
}
))){
//if it moves over to new cycle, start over the sequencing (fresh start for next day)
int cycleCheck = TAILER.cycle();
long indexCheck = TAILER.index();
System.out.println("cycle: "+cycleCheck);
System.out.println("idx: "+indexCheck);
if (currentCycle != cycleCheck){
LOGGER.warn("Changing to new roll cycle, from: "+currentCycle+" to: "+cycleCheck+". Clearing list of size "+seqNums.size());
seqNums.clear(); // this may cause a memory issue see: https://stackoverflow.com/a/6961397/16034206
currentCycle = cycleCheck;
}
int size = seqNums.size();
long val = seqNum.get();
if (size > 0){
int idx;
if ((idx = seqNums.indexOf(seqNum)) >= 0){
LOGGER.warn("Duplicate seqNum: "+seqNum+" at idx: "+idx);
}else{
long previous = seqNums.get(size-1);
long gap = val - previous;
if (Math.abs(gap) > 1L){
LOGGER.error("sequence gap at seqNum: "+previous+" and "+seqNum+"! Gap of "+gap);
}
}
}
seqNums.add(val);
} else { TAILER.close(); break; }
//breaks out from spinner if nothing to be read.
//a named tailer could be used to pick up from where is left off.
}
}
I am trying to write a custom reader which serves me the purpose of reading a record (residing in two lines) with defined number of fields.
For Eg
1,2,3,4("," can be there or not)
,5,6,7,8
My requirement is to read the record and push it into mapper as a single record like {1,2,3,4,5,6,7,8}. Please give some inputs.
UPDATE:
public boolean nextKeyValue() throws IOException, InterruptedException {
if(key == null) {
key = new LongWritable();
}
//Current offset is the key
key.set(pos);
if(value == null) {
value = new Text();
}
int newSize = 0;
int numFields = 0;
Text temp = new Text();
boolean firstRead = true;
while(numFields < reqFields) {
while(pos < end) {
//Read up to the '\n' character and store it in 'temp'
newSize = in.readLine( temp,
maxLineLength,
Math.max((int) Math.min(Integer.MAX_VALUE, end - pos),
maxLineLength));
//If 0 bytes were read, then we are at the end of the split
if(newSize == 0) {
break;
}
//Otherwise update 'pos' with the number of bytes read
pos += newSize;
//If the line is not too long, check number of fields
if(newSize < maxLineLength) {
break;
}
//Line too long, try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
//Exit, since we're at the end of split
if(newSize == 0) {
break;
}
else {
String record = temp.toString();
StringTokenizer fields = new StringTokenizer(record,"|");
numFields += fields.countTokens();
//Reset 'value' if this is the first append
if(firstRead) {
value = new Text();
firstRead = false;
}
if(numFields != reqFields) {
value.append(temp.getBytes(), 0, temp.getLength());
}
else {
value.append(temp.getBytes(), 0, temp.getLength());
}
}
}
if(newSize == 0) {
key = null;
value = null;
return false;
}
else {
return true;
}
}
}
This is the nextKeyValue method which I am trying to work on. But still the mapper are not getting proper values.
reqFields is 4.
Look at how TextInputFormat is implemented. Look at it's superclass, FileInputFormat as well. You must subclass Either TextInputFormat of FileInputFormat and implement your own record handling.
Thing to be aware when implementing any kind of file input format is this:
Framework will split the file and give you the start offset and byte length of the piece of the file you have to read. It may very well happen that it splits the file right across some record. That is why your reader must skip the bytes of the record at the beginning of the split if that record is not fully contained in the split, as well as read past the last byte of the split to read the whole last record if that one is not fully contained in the split.
For example, TextInoutFormat treats \n characters as record delimiters so when it gets the split it skips the bytes until the first \n character and read past the end of the split until the \n character.
As for the code example:
You need to ask yourself the following question: Say you open the file, seek to a random position and start reading forward. How do you detect the start of the record? I don't see anything in your code that deals with that, and without it, you cannot write a good input format, because you don't know what are the record boundaries.
Now it is still possible to make the input format read the whole file end to end by making the isSplittable(JobContext,Path) method return false. That makes the file read wholly by single map task which reduces parallelism.
Your inner while loop seems problematic since it's checking for lines that are too long and is skipping them. Given that your records are written using multiple lines, it can happen that you merge one part of one record and another part of another record when you read it.
The string had to be tokenized using StringTokenizer and not split. The code has been updated with the new implmentation.
After I replaced mysql jdbc driver 5.1 with mariadb jdbc driver 1.1.5 and tested the existing code base that connected with MySQL Server 5.0 and MariaDB Server 5.2, everything works fine except a JDBC call to update a blob field in a table.
The blob field contains XML configuration file. It can be read out, and convert to xml and insert some values.
Then convert it to ByteArrayInputStream object, and call the method
statement.updateBinaryStream(columnLabel, the ByteArrayInputStream object, its length)
but an exception is thrown:
Perhaps you have some incorrect SQL syntax?
java.sql.SQLFeatureNotSupportedException: Updates are not supported
at
org.mariadb.jdbc.internal.SQLExceptionMapper.getFeatureNotSupportedException(SQLExceptionMapper.java:165)
at
org.mariadb.jdbc.MySQLResultSet.updateBinaryStream(MySQLResultSet.java:1642)
at
org.apache.commons.dbcp.DelegatingResultSet.updateBinaryStream(DelegatingResultSet.java:511)
I tried updateBlob method, the same exception was thrown.
The code works well with mysql jdbc driver 5.1.
Any suggestions on how to work around with this situation?
See the ticket updating blob with updateBinaryStream, which in commnet states that it isn't supported.
A workaround would be to use two SQL statements. One which is used to select the data and other to update the data. Something like this:
final Statement select = connection.createStatement();
try {
final PreparedStatement update = connection.prepareStatement( "UPDATE table SET blobColumn=? WHERE idColumn=?" );
try {
final ResultSet selectSet = select.executeQuery( "SELECT idColumn,blobColumn FROM table" );
try {
final int id = selectSet.getInt( "idColumn" );
final InputStream stream = workWithSTreamAndRetrunANew( selectSet.getBinaryStream( "blobColumn" ) ) );
update.setBinaryStream( 1,stream );
update.setInt( 2,id );
update.execute();
}
finally {
if( selectSet != null )
selectSet.close();
}
}
finally {
if( update != null )
update.close();
}
}
finally {
if( select != null )
select.close();
}
But be aware that you need some information how to uniquely identify a table entry, in this example the column idColumn was used for that purpose. Furthermore is you stored empty stream in the
database you might get an SQLException.
A simpler work around is using binary literals (like X'2a4b54') and concatenation (UPDATE table SET blobcol = blobcol || X'2a4b54') like this:
int iBUFSIZ = 4096;
byte[] buf = new byte[iBUFSIZ];
int iLength = 0;
int iUpdated = 1;
for (int iRead = stream.read(buf, 0, iBUFSIZ);
(iUpdated == 1) && (iRead != -1) && (iLength < iTotalLength);
iRead = stream.read(buf, 0, iBUFSIZ))
{
String sValue = "X'" + toHex(buf,0,iRead) + "'";
if (iLength > 0)
sValue = sBlobColumn + " || " + sValue;
String sSql = "UPDATE "+sTable+" SET "+sBlobColumn+"= "+sValue;
Statement stmt = connection.createStatement();
iUpdated = stmt.executeUpdate(sSql);
stmt.close();
}
I'm making an mobile app which needs thousands of fast string lookups and prefix checks. To speed this up, I made a Trie out of my word list, which has about 180,000 words.
Everything's great, but the only problem is that building this huge trie (it has about 400,000 nodes) takes about 10 seconds currently on my phone, which is really slow.
Here's the code that builds the trie.
public SimpleTrie makeTrie(String file) throws Exception {
String line;
SimpleTrie trie = new SimpleTrie();
BufferedReader br = new BufferedReader(new FileReader(file));
while( (line = br.readLine()) != null) {
trie.insert(line);
}
br.close();
return trie;
}
The insert method which runs on O(length of key)
public void insert(String key) {
TrieNode crawler = root;
for(int level=0 ; level < key.length() ; level++) {
int index = key.charAt(level) - 'A';
if(crawler.children[index] == null) {
crawler.children[index] = getNode();
}
crawler = crawler.children[index];
}
crawler.valid = true;
}
I'm looking for intuitive methods to build the trie faster. Maybe I build the trie just once on my laptop, store it somehow to the disk, and load it from a file in the phone? But I don't know how to implement this.
Or are there any other prefix data structures which will take less time to build, but have similar lookup time complexity?
Any suggestions are appreciated. Thanks in advance.
EDIT
Someone suggested using Java Serialization. I tried it, but it was very slow with this code:
public void serializeTrie(SimpleTrie trie, String file) {
try {
ObjectOutput out = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
out.writeObject(trie);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public SimpleTrie deserializeTrie(String file) {
try {
ObjectInput in = new ObjectInputStream(new BufferedInputStream(new FileInputStream(file)));
SimpleTrie trie = (SimpleTrie)in.readObject();
in.close();
return trie;
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
return null;
}
}
Can this above code be made faster?
My trie: http://pastebin.com/QkFisi09
Word list: http://www.isc.ro/lists/twl06.zip
Android IDE used to run code: http://play.google.com/store/apps/details?id=com.jimmychen.app.sand
Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.
Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).
You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.
public class SimpleTrie {
public class TrieNode {
boolean valid;
int[] children;
}
private TrieNode[] nodes;
private int numberOfNodes;
private TrieNode getNode() {
TrieNode t = nodes[++numberOnNodes];
return t;
}
}
Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.
Prefix look-up example:
Compare method:
private static int compare(String string, String prefix) {
if (prefix.length()>string.length()) return Integer.MIN_VALUE;
for (int i=0; i<prefix.length(); i++) {
char s = string.charAt(i);
char p = prefix.charAt(i);
if (s!=p) {
if (p<s) {
// prefix is before string
return -1;
}
// prefix is after string
return 1;
}
}
return 0;
}
Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)
private static int recursiveFind(String[] strings, String prefix, int start, int end) {
if (start == end) {
String lastValue = strings[start]; // start==end
if (compare(lastValue,prefix)==0)
return start; // start==end
return Integer.MAX_VALUE;
}
int low = start;
int high = end + 1; // zero indexed, so add one.
int middle = low + ((high - low) / 2);
String middleValue = strings[middle];
int comp = compare(middleValue,prefix);
if (comp == Integer.MIN_VALUE) return comp;
if (comp==0)
return middle;
if (comp>0)
return recursiveFind(strings, prefix, middle + 1, end);
return recursiveFind(strings, prefix, start, middle - 1);
}
Gets a String array and prefix, prints out occurrences of prefix in array
private static boolean testPrefix(String[] strings, String prefix) {
int i = recursiveFind(strings, prefix, 0, strings.length-1);
if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
// not found
return false;
}
// Found an occurrence, now search up and down for other occurrences
int up = i+1;
int down = i;
while (down>=0) {
String string = strings[down];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
down--;
}
while (up<strings.length) {
String string = strings[up];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
up++;
}
return true;
}
Here's a reasonably compact format for storing a trie on disk. I'll specify it by its (efficient) deserialization algorithm. Initialize a stack whose initial contents are the root node of the trie. Read characters one by one and interpret them as follows. The meaning of a letter A-Z is "allocate a new node, make it a child of the current top of stack, and push the newly allocated node onto the stack". The letter indicates which position the child is in. The meaning of a space is "set the valid flag of the node on top of the stack to true". The meaning of a backspace (\b) is "pop the stack".
For example, the input
TREE \b\bIE \b\b\bOO \b\b\b
gives the word list
TREE
TRIE
TOO
. On your desktop, construct the trie using whichever method and then serialize by the following recursive algorithm (pseudocode).
serialize(node):
if node is valid: put(' ')
for letter in A-Z:
if node has a child under letter:
put(letter)
serialize(child)
put('\b')
This isn't a magic bullet, but you can probably reduce your runtime slightly by doing one big memory allocation instead of a bunch of little ones.
I saw a ~10% speedup in the test code below (C++, not Java, sorry) when I used a "node pool" instead of relying on individual allocations:
#include <string>
#include <fstream>
#define USE_NODE_POOL
#ifdef USE_NODE_POOL
struct Node;
Node *node_pool;
int node_pool_idx = 0;
#endif
struct Node {
void insert(const std::string &s) { insert_helper(s, 0); }
void insert_helper(const std::string &s, int idx) {
if (idx >= s.length()) return;
int char_idx = s[idx] - 'A';
if (children[char_idx] == nullptr) {
#ifdef USE_NODE_POOL
children[char_idx] = &node_pool[node_pool_idx++];
#else
children[char_idx] = new Node();
#endif
}
children[char_idx]->insert_helper(s, idx + 1);
}
Node *children[26] = {};
};
int main() {
#ifdef USE_NODE_POOL
node_pool = new Node[400000];
#endif
Node n;
std::ifstream fin("TWL06.txt");
std::string word;
while (fin >> word) n.insert(word);
}
Tries that prealloate space all possible children (256) have a huge amount of wasted space. You are making your cache cry. Store those pointers to children in a resizable data structure.
Some tries will optimize by having one node to represent a long string, and break that string up only when needed.
Instead of a simple file you can use a database like sqlite and a nested set or celko tree to store the trie and you can also build a faster and shorter (less nodes) trie with a ternary search trie.
I don't like the idea of addressing nodes by index in array, but only because it requires one more addition (index to the pointer). But with array of preallocated nodes you will maybe save some time on allocation and initialization. And you can also save a lot of space by reserving first 26 indices for leaf nodes. Thus you'll not need to allocate and initialize 180000 leaf nodes.
Also with indices you will be able to read the prepared nodes array from disk in binary format. This has to be several times faster. But I'm not sure how to do this on your language. Is this Java?
If you checked that your source vocabulary is sorted, you may also save some time by comparing some prefix of the current string with the previous one. E.g. first 4 characters. If they are equal you can start your
for(int level=0 ; level < key.length() ; level++) {
loop from the 5-th level.
Is it space inefficient or time inefficient? If you are rolling a plain trie then space may be part of the problem when dealing with a mobil device. Check out patricia/radix tries, especially if you are using it as a prefix look-up tool.
Trie:
http://en.wikipedia.org/wiki/Trie
Patricia/Radix trie:
http://en.wikipedia.org/wiki/Radix_tree
You didn't mention a language but here are two implementations of prefix tries in Java.
Regular trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/Trie.java
Patricia/Radix (space-effecient) trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/PatriciaTrie.java
Generally speaking, avoid using a lot of object creations from scratch in Java, which is both slow and it also has a massive overhead. Better implement your own pooling class for memory management that allocates e.g. half a million entries at a time in one go.
Also, serialization is too slow for large lexicons. Use a binary read to populate array-based representations proposed above quickly.
I have the following query (column log is of type CLOB):
UPDATE table SET log=? where id=?
The query above works fine when using the setAsciiStream method to put a value longer than 4000 characters into the log column.
But instead of replacing the value, I want to append it, hence my query looks like this:
UPDATE table SET log=log||?||chr(10) where id=?
The above query DOES NOT work any more and I get the following error:
java.sql.SQLException: ORA-01461: can bind a LONG value only for insert into a LONG column
It looks to me like you have to use a PL/SQL block to do what you want. The following works for me, assuming there's an entry with id 1:
import oracle.jdbc.OracleDriver;
import java.sql.*;
import java.io.ByteArrayInputStream;
public class JDBCTest {
// How much test data to generate.
public static final int SIZE = 8192;
public static void main(String[] args) throws Exception {
// Generate some test data.
byte[] data = new byte[SIZE];
for (int i = 0; i < SIZE; ++i) {
data[i] = (byte) (64 + (i % 32));
}
ByteArrayInputStream stream = new ByteArrayInputStream(data);
DriverManager.registerDriver(new OracleDriver());
Connection c = DriverManager.getConnection(
"jdbc:oracle:thin:#some_database", "user", "password");
String sql =
"DECLARE\n" +
" l_line CLOB;\n" +
"BEGIN\n" +
" l_line := ?;\n" +
" UPDATE table SET log = log || l_line || CHR(10) WHERE id = ?;\n" +
"END;\n";
PreparedStatement stmt = c.prepareStatement(sql);
stmt.setAsciiStream(1, stream, SIZE);
stmt.setInt(2, 1);
stmt.execute();
stmt.close();
c.commit();
c.close();
}
}
BLOBs are not mutable from SQL (well, besides setting them to NULL), so to append, you would have to download the blob first, concatenate locally, and upload the result again.
The usual solution is to write several records to the database with a common key and a sequence which tells the DB how to order the rows.