Why is Document.html() so slow? - performance

I was under the impression that the most costly method in Jsoup's API is parse().
But I just discovered that Document.html() could be even slower.
Given that the Document is the output of parse() (i.e. this is after parsing), I find this surprising.
Why is Document.html() so slow?

Answering myself. The Element.html() method is implemented as:
public String html() {
StringBuilder accum = new StringBuilder();
html(accum);
return accum.toString().trim();
}
Using StringBuilder instead of String is already a good thing, and the use of StringBuilder.toString() and String.trim() may not explain the slowness of Document.html(), even for a relatively large document.
But in the middle, our method calls an overloaded version, Element.html(StringBuilder) which loops through all child nodes in the document:
private void html(StringBuilder accum) {
for (Node node : childNodes)
node.outerHtml(accum);
}
Thus if the document contains lots of child nodes, it will be slow.
It would be interesting to see whether there could be a faster implementation of this.
For example, if Jsoup stores a cached version of the raw html that was provided to it via Jsoup.parse(). As an option of course, to maintain backward compatibility and small footprint in memory.

Related

Extracting all children belongs to specific parent in graphql

I am using GrapgQL and Java. I need to extract all the children belongs to specific parent. I have used the below way but it will fetch only the parent and it does not fetch any children.
schema {
query: Query
}
type LearningResource{
id: ID
name: String
type: String
children: [LearningResource]
}
type Query {
fetchLearningResource: LearningResource
}
#Component
public class LearningResourceDataFetcher implements DataFetcher{
#Override
public LearningResource get(DataFetchingEnvironment dataFetchingEnvironment) {
LearningResource lr3 = new LearningResource();
lr3.setId("id-03");
lr3.setName("Resource-3");
lr3.setType("Book");
LearningResource lr2 = new LearningResource();
lr2.setId("id-02");
lr2.setName("Resource-2");
lr2.setType("Paper");
LearningResource lr1 = new LearningResource();
lr1.setId("id-01");
lr1.setName("Resource-1");
lr1.setType("Paper");
List<LearningResource> learningResources = new ArrayList<>();
learningResources.add(lr2);
learningResources.add(lr3);
learningResource1.setChildren(learningResources);
return lr1;
}
}
return RuntimeWiring.newRuntimeWiring().type("Query", typeWiring -> typeWiring.dataFetcher("fetchLearningResource", learningResourceDataFetcher)).build();
My Controller endpoint
#RequestMapping(value = "/queryType", method = RequestMethod.POST)
public ResponseEntity query(#RequestBody String query) {
System.out.println(query);
ExecutionResult result = graphQL.execute(query);
System.out.println(result.getErrors());
System.out.println(result.getData().toString());
return ResponseEntity.ok(result.getData());
}
My request would be like below
{
fetchLearningResource
{
name
}
}
Can anybody please help me to sort this ?
Because I get asked this question a lot in real life, I'll answer it in detail here so people have easier time googling (and I have something to point at).
As noted in the comments, the selection for each level has to be explicit and there is no notion of an infinitely recursive query like get everything under a node to the bottom (or get all children of this parent recursively to the bottom).
The reason is mostly that allowing such queries could easily put you in a dangerous situation: a user would be able to request the entire object graph from the server in one easy go! For any non-trivial data size, this would kill the server and saturate the network in no time. Additionally, what would happen once a recursive relationship is encountered?
Still, there is a semi-controlled escape-hatch you could use here. If the scope in which you need everything is limited (and it really should be), you could map the output type of a specific query as a (complex) scalar.
In your case, this would mean mapping LearningResource as a scalar. Then, fetchLearningResource would effectively be returning a JSON blob, where the blob would happen to be all the children and their children recursively. Query resolution doesn't descent deeper once a scalar field is reached, as scalars are leaf nodes, so it can't keep resolving the children level-by-level. This means you'd have to recursively fetch everything in one go, by yourself, as GraphQL engine can't help you here. It also means sub-selections become impossible (as scalars can't have sub-selections - again, they're leaf nodes), so the client would always get all the children and all the fields from each child back. If you still need the ability to limit the selection in certain cases, you can expose 2 different queries e.g. fetchLearningResource and fetchAllLearningResources, where the former would be mapped as it is now, and the latter would return the scalar as explained.
An object scalar implementation is provided by the graphql-java ExtendedScalars project.
The schema could then look like:
schema {
query: Query
}
scalar Object
type Query {
fetchLearningResource: Object
}
And you'd use the method above to produce the scalar implementation:
RuntimeWiring.newRuntimeWiring()
.scalar(ExtendedScalars.Object) //register the scalar impl
.type("Query", typeWiring -> typeWiring.dataFetcher("fetchLearningResource", learningResourceDataFetcher)).build();
Depending on how you process the results of this query, the DataFetcher for fetchLearningResource may need to turn the resulting object into a map-of-maps (JSON-like object) before returning to the client. If you simply JSON-serialize the result anyway, you can likely skip this. Note that you're side-stepping all safety mechanisms here and must take care not to produce enormous results. By extension, if you need this in many places, you're very likely using a completely wrong technology for your problem.
I have not tested this with your code myself, so I might have skipped something important, but this should be enough to get you (or anyone googling) onto the right track (if you're sure this is the right track).
UPDATE: I've seen someone implement a custom Instrumentation that rewrites the query immediately after it's parsed, and adds all fields to the selection set if no field had already been selected, recursively. This effectively allows them to select everything implicitly.
In graphql-java v11 and prior, you could mutate the parsed query (represented by the Document class), but as of v12, it will no longer be possible, but instrumentations in turn gain the ability to replace the Document explicitly via the new instrumentDocument method.
Of course, this only makes sense if your schema is such that it can not be exploited or you fully control the client so there's no danger. You could also only do it selectively for some types, but it would be extremely confusing to use.

JAVA 8 Extract predicates as fields or methods?

What is the cleaner way of extracting predicates which will have multiple uses. Methods or Class fields?
The two examples:
1.Class Field
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty)
.forEach(System.out::println);
}
private IntPredicate isOverFifty = number -> number > 50;
2.Method
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty())
.forEach(System.out::println);
}
private IntPredicate isOverFifty() {
return number -> number > 50;
}
For me, the field way looks a little bit nicer, but is this the right way? I have my doubts.
Generally you cache things that are expensive to create and these stateless lambdas are not. A stateless lambda will have a single instance created for the entire pipeline (under the current implementation). The first invocation is the most expensive one - the underlying Predicate implementation class will be created and linked; but this happens only once for both stateless and stateful lambdas.
A stateful lambda will use a different instance for each element and it might make sense to cache those, but your example is stateless, so I would not.
If you still want that (for reading purposes I assume), I would do it in a class Predicates let's assume. It would be re-usable across different classes as well, something like this:
public final class Predicates {
private Predicates(){
}
public static IntPredicate isOverFifty() {
return number -> number > 50;
}
}
You should also notice that the usage of Predicates.isOverFifty inside a Stream and x -> x > 50 while semantically the same, will have different memory usages.
In the first case, only a single instance (and class) will be created and served to all clients; while the second (x -> x > 50) will create not only a different instance, but also a different class for each of it's clients (think the same expression used in different places inside your application). This happens because the linkage happens per CallSite - and in the second case the CallSite is always different.
But that is something you should not rely on (and probably even consider) - these Objects and classes are fast to build and fast to remove by the GC - whatever fits your needs - use that.
To answer, it's better If you expand those lambda expressions for old fashioned Java. You can see now, these are two ways we used in our codes. So, the answer is, it all depends how you write a particular code segment.
private IntPredicate isOverFifty = new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
private IntPredicate isOverFifty() {
return new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
}
1) For field case you will have always allocated predicate for each new your object. Not a big deal if you have a few instances, likes, service. But if this is a value object which can be N, this is not good solution. Also keep in mind that someMethod() may not be called at all. One of possible solution is to make predicate as static field.
2) For method case you will create the predicate once every time for someMethod() call. After GC will discard it.

Fetch 1M records in orientdb: why is it 6x slower than bare SQL+MySQL

For some graph algorithm I need to fetch a lot of records from a database to memory (~ 1M records). I want this to be done fast and I want the records to be objects (that is: I want ORM). To crudely benchmark different solutions I created a simple problem of one table with 1M Foo objects like I did here: Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor? .
One can see that fetching them using bare SQL is extremely fast; also converting the records to objects using a simple for-loop is fast. Both execute in around 2-3 seconds. However using ORM's like SQLAlchemy and Hibernate, this takes 20-30 seconds: a lot slower if you ask me, and this is just a simple example without relations and joins.
SQLAlchemy gives itself the feature "Mature, High Performing Architecture," (http://www.sqlalchemy.org/features.html). Similarly for Hibernate "High Performance" (http://hibernate.org/orm/). In a way both are right, because they allow for very generic object oriented data models to be mapped back and forth to a MySQL database. On the other hand they are awfully wrong, since they are 10x slower than just SQL and native code. Personally I think they could do better benchmarks to show this, that is, a benchmark comparing with native SQL + java or python. But that is not the problem at hand.
Of course, I don't want SQL + native code, as it is hard to maintain. So I was wondering why there does not exist something like an object oriented database, which handles the database->object mapping native. Someone suggested OrientDB, hence I tried it. The API is quite nice: when you have your getters and setters right, the object is insertable and selectable.
But I want more than just API-sweetness, so I tried the 1M example:
import java.io.Serializable;
public class Foo implements Serializable {
public Foo() {}
public Foo(int a, int b, int c) { this.a=a; this.b=b; this.c=c; }
public int a,b,c;
public int getA() { return a; }
public void setA(int a) { this.a=a; }
public int getB() { return b; }
public void setB(int b) { this.b=b; }
public int getC() { return c; }
public void setC(int c) { this.c=c; }
}
import com.orientechnologies.orient.object.db.OObjectDatabaseTx;
public class Main {
public static void insert() throws Exception {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
int N=1000000;
long time = System.currentTimeMillis();
for(int i=0; i<N; i++) {
Foo foo = new Foo(i, i*i, i+i*i);
db.save(foo);
}
db.close();
System.out.println(System.currentTimeMillis() - time);
}
public static void fetch() {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
long time = System.currentTimeMillis();
for (Foo f : db.browseClass(Foo.class).setFetchPlan("*:-1")) {
if(f.getA() == 345234) System.out.println(f.getB());
}
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
public static void main(String[] args) throws Exception {
//insert();
fetch();
}
}
Fetching 1M Foo's using OrientDB takes approximately 18 seconds. The for-loop with the getA() is to force the object fields to be actually loaded into memory, as I noticed that by default they are fetched lazily. I guess this may also be the reason fetching the Foo's is slow, because it has db-access each iteration instead of db-access once when it fetches everything (including the fields).
I tried to fix that using setFetchPlan("*:-1"), I figured it may also apply on fields, but that did not seem to work.
Question: Is there a way to do this fast, preferably in the 2-3 seconds range? Why does this take 18 seconds, whilst the bare SQL version uses 3 seconds?
Addition: Using a ODatabaseDocumentTX like #frens-jan-rumph suggested only gave ma a speedup of approximately 5, but of approximatelt 2. Adjusting the following code gave me a running time of approximately 9 seconds. This is still 3 times slower than raw sql whilst no conversion to Foo's was executed. Almost all time goes to the for-loop.
public static void fetch() {
ODatabaseDocumentTx db = new ODatabaseDocumentTx ("plocal:/opt/orientdb-community-1.7.6/databases/pits2").open("admin", "admin");
long time = System.currentTimeMillis();
ORecordIteratorClass<ODocument> it = db.browseClass("Foo");
it.setFetchPlan("*:0");
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
time = System.currentTimeMillis();
for (ODocument f : it) {
//if((int)f.field("a") == 345234) System.out.println(f.field("b"));
}
System.out.println("Iterating all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
The answer lies in convenience.
During an interview, when I asked a candidate what they thought of LINQ (C# I know, but pertinent to your question), they quite rightly answered that it was a sacrifice of performance, over convenience.
A hand-written SQL statement (whether or not it calls a stored procedure) is always going to be faster than using an ORM that auto-magically converts the results of the query in to nice, easy-to-use POCOs.
That said, the difference should not be that great as you have experienced. Yes, there is overhead in doing it the auto-magical way, but it shouldn't be that great. I do have experience here, and within C# I have had to use special reflection classes to reduce the time it takes to do this auto-magical mapping.
With large swabs of data, I would expect an initial slow-down from an ORM, but then it would be negligible. 3 seconds to 18 seconds is huge.
If you profile your test, you would discover that around 60 - 80% of the CPU time is taken by execution of the following four methods:
com.orienttechnologies...OObjectEntitySerializer.getField(...)
com.orienttechnologies...OObjectEntityEnhancer.getProxiedInstance(...)
com.orienttechnologies...OObjectMethodFilter.isScalaClass(...)
javaassist...SecurityActions.getDeclaredMethods(...)
So yes, in this setup the bottleneck is in the ORM layer. Using ODatabaseDocumentTx provides a speedup of around 5x. Might just get you where you want to be.
Still a lot of time (close to 50%) is spent in com.orientechnologies...OJNADirectMemory.getInt(...). That's expensive for just reading an integer from a memory location. Don't understand why not just the java nio bytebuffers are used here. Saves a lot of crossing the Java / native border, etc.
Apart from these micro benchmarks and remarkable behaviour in OrientDB I think that there are at least two other things to consider:
Does this test reflect your expected workload?
I.e. you read a straightforward list of records. If so, why use a database? If not, then test on the actual workload, e.g. your searches, graph traversals, etc.
Does this test reflect your expected setup?
E.g. you are reading from a plocal database while reading from any database over tcp/ip might just as well have its bottleneck somewhere else. Also, you are reading from one thread / process; if you expect concurrent use of the database, this probably throws things off considerably (disk seeks, more book keeping overhead, etc.)
P.S. I would recommend warming up code before benchmarking
What you do here is a worst case scenario. As you wrote (or should have wrote) for your database your test is just reading a table and writes it directly to a stream of whatever.
So what you see is the complete overhead of alot of magic. Usually if you do something more complex like joining, selecting, filtering and ordering the overhead of your ORM comes down to a more reasonable share of 5 to 10%.
Another thing you should think about - I guess orient is doing the same - the ORM solution is creating new objects multiplying memory consumption and Java is really bad on memory consumption and the reason why I use custom in memory tables all the time I handle a lot of data / objects.
You know where an object is a row in a table.
Another thing your objects get also inserted into a list / map (at least Hibernate is doing it). It tracks the dirtiness of the objects once you change them. This insertion also takes a lot of time when you rescale it and is a reason why we use paginated lists or maps. copying 1M references is dead slow if the area grows.

Does Java optimize new HashSet(someHashSet).contains() to be O(1)?

This class is an example of where the issue arises:
public class ContainsSet {
private static HashSet<E> myHashSet;
[...]
public static Set<E> getMyHashSet() {
return new HashSet<E>(myHashSet);
}
public static boolean doesMyHashSetContain(E e) {
return myHashSet.contains(e);
}
}
Now imagine two possible functions:
boolean method1() {
return ContainsSet.getMyHashSet().contains(someE);
}
boolean method2() {
return ContainsSet.doesMyHashSetContain(someE);
}
Now is my question whether or not method 1 will have the same time complexity after Java optimization as method 2.
(I used HashSet instead of just Set to emphasize that myHashSet.contains(someE) has complexity O(1).)
Without optimization it would not. Although .contains() has complexity O(1), the new HashSet<E>(myHashSet) has complexity O(n), which would give method 1 a complexity of O(n) + O(1) = O(n), which is horrible compared to the beloved O(1).
The reason why I this issue is imported is because I am teached not to return lists or sets if you will not allow an external class to change the contents of it. Returning a copy is an obvious solution, but it can be really time-consuming.
No, javac does not (and cannot) optimize this away. It's required to emit the byte code you describe in your source to this level. And the JVM will not be nearly intelligent enough to optimize this away. It's way too likely to have side effects to prove.
Don't return a copy of the HashSet if you want immutability. Wrap it in an unmodifiable wrapper: Collections.unmodifiableSet(myHashSet)
What can I say here but that creating a new HashSet and populating it via the constructor is expensive!
Java will not "optimize away" this work: Even though you and I know it would give the same result as "passing through" the contains() call, java can not know this.
No. That is beyond optimization. You returned a new object and you could use it in somewhere else, Java is not supposed to omit it. A new HashSet will be created.
This is not a good practice to return a copy. It's not only time consuming but also space consuming. As Sean said, you can wrap with unmodifiableSet or you can wrap it in your own class.
You can try this:
public static Set<E> getMyHashSet() {
return Collection.unmodifiableSortedSet(myHashSet);
}
Note: use that method will return a view of your set, not a copy.

Tree traversal without recursive / stack usage (C#)?

I'm working with WPF and I'm developing a complex usercontrol, which is composed of a tree with rich functionality etc.
For this purpose I used a View-Model design pattern, because some operations couldn't be achieved directly in WPF. So I take the IHierarchyItem (which is a node and pass it to this constructor to create a tree structure)
private IHierarchyItemViewModel(IHierarchyItem hierarchyItem, IHierarchyItemViewModel parent)
{
this.hierarchyItem = hierarchyItem;
this.parent = parent;
List<IHierarchyItemViewModel> l = new List<IHierarchyItemViewModel>();
foreach (IHierarchyItem item in hierarchyItem.Children)
{
l.Add(new IHierarchyItemViewModel(item, this));
}
children = new ReadOnlyCollection<IHierarchyItemViewModel>(l);
}
The problem is that this constructor takes about 3 seconds !! for 200 items on my dual-core.
Am I doing anythig wrong or recursive constructor call is that slow?
Thank you very much!
OK I found a non recursive version by myself, although it uses the Stack.
It traverses the whole tree:
Stack<MyItem> stack = new Stack<MyItem>();
stack.Push(root);
while (stack.Count > 0)
{
MyItem taken = stack.Pop();
foreach (MyItem child in taken.Children)
stack.Push(MyItem);
}
There should be nothing wrong with a recursive implementation of a tree, especially for such a small number of items. A recursive implementation is sometimes less space efficient, and slightly less time efficient, but the code clarity often makes up for it.
It would be useful for you to perform some simple profiling on your constructor. Using one of the suggestions from: http://en.csharp-online.net/Measure_execution_time you could indicate for yourself how long each piece is taking.
There is a possibility that one piece in particular is taking a long time. In any case, that might help you narrow down where you are really spending the time.

Resources