Loading large, non local file, in Marklogic 5 - saxparser

I have a very large file that is not on the same box as the Marklogic server. Putting the file on the same server is not an option.
What is the best way to load the file into the database? I'm thinking that a SAX parser could pick off nodes and load them into the database.
<a>
<b>xxx</b>
<b>yyy</b>
<b>zzz</b>
</a>
So, using the above xml I'd create a document in Marklogic containing .
Then, using the SAX parser, I'd grab the first <b> element and insert it as child of <a>. I'd repeat for all remaining <b> elements.
Does that sound like the best approach? Would it be performant?
TIA

As Eric indicates, one large document is probably not what you want. MarkLogic is designed to work best with many (thousands to billions) of bite-sized (dozens to hundreds of kilobytes) documents. If you’re familiar with relational databases, you can think of relational rows (not tables) as roughly equivalent to documents in a MarkLogic database.
Can you provide some more details on what you’re trying to do? What types of queries do you expect to perform? What does your data actually look like? How large is “very large”?

First, once you have all that data in MarkLogic, you're not likely going to want it all in one Document.
You can use a tool like mlcp to help you break up your doc and load it. See http://developer.marklogic.com/products/mlcp as well as http://docs.marklogic.com/guide/ingestion and in particular the bit on mlcp (content pump) at http://docs.marklogic.com/guide/ingestion/content-pump

You may find the StAX scripting features of xmlsh useful to you.
This parses like SAX but is much easier to deal with because its a PULL not a PUSH technology so no callbacks. And its fully scriptable so much less code then pure Java, while being nearly as fast.
http://www.xmlsh.org/FunctionsStAX
-David

Related

I want to create a desktop app with database-like search functions but without the SQL database

I know basic SQL, and SQL is all I know when it comes to storing and retrieving data. I want to create 1 .exe and it should contain all ~100,000 key-value pairs (i have the data in .txt files) and maybe an extra attribute for description (this I would add myself - like a note to myself).
I also would like to write it in a new language I don't know yet; like python or C# (I have made desktop apps written in Java & VB.net all with SQL databases). So language will not be an issue and I would appreciate suggestions.
These key-value pairs might not need to be updated and I'm willing to re-compile/repackage the code to make 1 change in the data. The key is 6 letters long and 2 numbers at the end like hxnaaa01. Each of these letters represent or describe something about itself so I would also need to search for a specific letter on a specific position to get exactly what I need.
I know that regex would work well with what I need but all I mentioned is all I know. I don't know enough and I don't know what keywords to google.
I have read about XML and CSV. I don't really know what they are and I'm not sure how all of this would fit in 1 executable.
To summarize, I need:
1 executable (Windows Desktop App)
Search function ~100k KVP+1more attribute (using regex?)
no database
with GUI
ability to add a "note" to each KVP
should be fast and lightweight
1 executable (Windows Desktop App), no database
Data persistence will require either additional files, or a database. It's pretty much unavoidable, you can store data in memory, but it's only persisted for as long as it resides there.
You have another requirement: "fast and lightweight".
To achieve this requirement, you'll need to really think about your solution, what technology you use and how you can improve it in future.
Although searching through data is pretty trivial, an efficient solution is not. It requires upfront research into algorithms, data structures and general practices. (which is a rabbit hole itself).
In the case of JSON [1], you'll need to create an additional file to contain all your key/value pairs, you can use C# to create the extra file (on first launch, for example).
JSON promises to be lightweight, I tend to agree, some may not. When dealing with the filesystem, I think it can be agreed is often far from lightweight solution.
JSON is very readable though:
{
"key": "value",
"comment": "oh this is cool"
}
There's a lot of factors that play into something being fast and lightweight, so there's a need for some research on your part.
Honestly, depending on your experience, I wouldn't focus so much on the fast, I'd focus more on it working, then refactor that into something that's fast if it's too slow. [2]
And again, depending on your experience, I'd stick to opening the file, using a for/loop to find my key and do something with the data found, plus reward myself for having something that works.
TL;DR: you need either a file, or database for truly persistent storage, JSON or a remotely hosted MySQL would work. Try not to focus too much on fast before you have something that works.
https://www.json.org/json-en.html [1]
https://stackoverflow.com/a/5581595/2932298
https://stackify.com/premature-optimization-evil/ [2]

Front and back end techniques to increase performance

What are some of the common and notable performance issues/bottlenecks that are typically encountered in a web application in both, the front-end layer, and the back-end layer?
An example of what I mean in a database is not having something you are querying on be an index. That would slow down the query. On the front-end it might be something funky going on with JavaScript that makes your application seem slow.
What are the general rules of thumb that help navigate such issues? And what are some good to-do's?
Thanks,
Alex
On front-end:
-push all of your assets - css files, images, static content - to a CDN. Edgecast is pretty good and reasonably priced.
-don't use load entire javascript frameworks when you only need a few features from it. only load what's needed.
On back-end
-memcache the results from all database calls by using a hash of the sql query as the key name, and the result set as the value
-make sure you are not making your database tables really 'wide' - tons of columns and column types like 'text' and 'blob'
For the front-end, there are well-known guidelines/rules you can follow, and there are some great tools like YSlow that can help you pinpoint the bottlenecks.
For the back-end, as you've noted, efficient use of indexes is a must. Other optimizations usually involve caching, and basic stuff like avoiding doing stuff within loops that can be done once. I'm sure people here will have suggestions, but remember "premature optimization is the root of all evil!" :-)
Millhouse is on to it. I can also add:
Batch expensive operations up. For example: don't make lots of individual calls to a database if you can do it all in one hit.
Avoid server hops where you can.
Process in parallel if you can (not so common for your 'average' web app but quite possible in larger Enterprise scale apps).
Pre-process: crunching data, pre-puiblishing content etc, the more you can do before it's needed the better.
Use a CQRS-based architecture. CQRS stands for Command/Query Responsability Segregation; it basically means that you have different code (services) for reading from the DB and writing to the DB. A good practice for scalability is to have separate DB's for reading and writing (it actually does make sense, if you read more about CQRS), and you can scale out the reading database by having copies run on multiple servers.
CQRS is not only interesting from a scalability point of view, but also from a code maintenance and clarity point of view. It does take some effort to learn about CQRS and understand it, though.
Check out these links:
http://www.slideshare.net/skillsmatter/ddd-exchange-2010-udi-dahan-on-architectural-innovation-cqrs
http://www.slideshare.net/pjvdsande/rethink-your-architecture-with-cqrs
convert dynamic contents to static contents. regenerate those static contents if their dependent objects changed. I saw one article said that more than 80 percent contents are static on Amazon website.

RETS data fetching problem

I am working on one real estate website which is Using RETS service to get the data to my local server.
but I have one little bit problem here,I can fetch data from RETS which is having about 3lacks record in RETS Database but I didn't find the way,How can I fetch that all records in bunch of 50k at a time ?
I didn't find any 'LIMIT' keyword on RETS.so how can I fetch without 'LIMIT' 50k records at a time?
Please help me.
RETS is not really much of a standard. It's more closely resembles a pseudo standard. It loosely defines an XML schema that describes real estate listings.
In version 1.x, the "standard" was composed of DTD documents. In 2.x, the "standard" uses XSD documents to describe the list.
http://www.rets.org/documentation
However, in practice, there is almost no consistency amongst implementers. Having connected to hundreds of "RETS Compliant" service providers, I'm convinced that not one of them is like any other one.
Furthermore, the 2.x "standard" has not changed in 3 years. It's an unmaintained, sloppy attempt at a standard. It (RETS) is often used as a business buzz word by non-technical people. In reality, it's just an arbitrary attempt at modeling real estate listing in XML.
Try asking the specific implementer for their documentation. Often, they don't have any. So, emailing the lead developer has frequently been helpful. Sometimes they'll provide a WSDL which will outline the supported calls. Often, the WSDL doesn't coincide with the actual service, so beware.
As for your specific question, try caching the results. Usually, the use of a limit on a RETS call is a sign of a direct dependency. As requests for your service increase, the load that your service puts on theirs will break (and not be appreciated). Also, if their service goes down (even temporarily), yours will be interrupted as well. Most importantly, it will make the live requests to your pages really, really slow (especially if their system is slow at the time). The listings usually don't change frequently enough for worries about stale data, so caching up to and hour is pretty acceptable.
Best of luck!
libRets provides support for generating a query with fetch limits:
http://www.crt.realtors.org/projects/rets/librets/documentation/api/classlibrets_1_1_search_request.html
But last I knew: I remember the company Intereality either ignored or outright didn't provide complete compatibility to RETS. Quickest way to know your dealing with them is that also thought making all "System" name's for table fields numeric.
If you're lucky, you're using a Rapattoni backed server and they do provide spec. compatible servers.
Last point, I can't for the life of me remember it's name, but I used to use a free Java based RETS tool to build valid queries ( included offset/limit clauses ) and that made it a tad easier to build automated fetchers for a client's batch processing system.
IN RETS if Count More Than limit then We can download using Batch form or we can remove that Limit using regex while downloading
Best way to solve Problem divide Data Count in small unit of download and while we have to consider download limit in mind Field for Divide that one in MLS/IDX I Suggest Modification Date and ListingDate

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred
Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.
I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.
Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation

Resources