Can Sphinx help user-driven reporting requirements? - reporting

I'm creating a mysql-centric site for our client who wants to do custom reporting on their data (via http), something like doing selects in PHPMyAdmin with readonly access (*). Perhaps like reporting in Pentaho, though I've never used it.
My boss is saying Sphinx is the answer, but reading through the Sphinx docs, I don't see that it will make user-driven reporting any easier. Does Sphinx help with user driven reporting? If so, what do they call it?
(*) the system has about 25 tables that would likely be used in reporting, with anywhere between 2 and 50 fields. Reports would likely need up to maybe 5 joins per report.
update: I found http://code.google.com/p/sphinx-report/ so I guess Sphinx doesn't natively do it.

I can only answer for SphinxSEARCH -- dont really know much about the other sphinxes.
It itself doesnt contain features particully for writing reports. Its a general purpose search backend. Just like mysql is not particully designed for reports, but it can be be used as such.
In general think of sphinx as providing a very flexible, fast and scalable 'index' to a database table. Almost like creating a materialized view in mysql.
Because you can think very carefully what data to include in this index. And the index can be 'denormalized' to include all the data (via complex joins if required) - you can then run very fast queries against this index.
Sphinx also supports "GROUP BY" which makes it very useful for creating reports - and because the attribute data is held entirely in memory its generally very fast.
Basically sphinx will be very good to provide the backend for a 'dynamic' and 'interactive' report system. Where speed is required - particully if combined with allowing the user to filter reports (via keywords) - that is where sphinx shines.
Because of the upfront work required in designing this index, its less suited for 'flexible' reports. In building the index, will probably be a number of comprimises, so might be limiting in what reports will be possible. (at least without creating lots of differnt indexes)
short version: lots of upfront work to build the system, for very fast queries down the line.
Sphinx wont really do anything that mysql can't. But using sphinx as part of the system will allow performance to be improved (over a pure mysql solution).

Related

HANA CDS Views vs Calculation Views vs Table Functions

In SAP HANA I am used to create Calculation Views.
Previously I learned that Calculation Views (which after compilation are column-views) are to be prefered over Database-SQL-Views.
Now with CDS-Views I am not sure if this is still the case. Especially with regards to performance.
Also what is now the difference between a table function (which replaced scripted calculation views) and CDS Views?
Ok, this is a question that I believe requires some background to be answered.
A long, long time ago...
When SAP HANA was first developed, it heavily reused concepts and technology from other, already existing SAP products (TREX, P*TIME, MaxDB, Business Warehouse Accelerator).
One of the fundamental elements of the high query performance was (and is) the column store data-storage, which came in large parts from the TREX/BWA products. These products, in turn, had been solutions to very specific problems (full-text search for catalogs and speed-up of analytical queries from the SAP Business Warehouse data warehouse product).
Especially the BWA use case reflects in the column views of SAP HANA. Due to the limited use case of supporting SAP BW queries, no general SQL/relational query support was required (e.g. no arbitrary join-chain optimizations, no SQL features beyond SQL:92 etc.) whereas other, rather exotic features (like "vertical join") that could be used by SAP BW, were built into a query tool/engine ("engine" clearly was a very popular term with the SAP developers).
Once HANA proved successful as a platform to run SAP BW on, the next step was to add flexibility and make more general platforms like SAP Netweaver (the software that SAP's business solution products run on/with) working on SAP HANA. Now, SQL features were added and those required additional capabilities from the query optimizer and execution "engines".
Query optimization had to be flexible and fast and should lead to query performance that would still beat the existing RDBMS vendors' offering (which had been around for 40+ years).
This, clearly, is a hard problem and throwing is operational aspects of DB development (scaling, solution deployment, data federation, etc.).
This led to an overlapping development of different tools addressing different aspects of DB development.
SQL support and the underlying SQL optimizer were made more powerful, so much so, that (some) SQL queries could be as fast or faster than those modeled in calculation views. And since both of these "query frontends" eventually had to talk to the same internal data structures (row/column store) it was desirable to have just a single query optimizer, that would support all the different use cases.
Somewhere around HANA 1 SPS11/12 most calculation views started to be "unrolled" internally to feed into the common optimizer (that was what the "Execute in SQL Engine" flag was about).
I'd say, since then, the performance argument for using calculation views only holds in very specific circumstances.
I mentioned the overlapping developments and CDS (core data services) is one of them. The idea here is a very different one from SQL. While SQL gives you "the way to talk to the database", CDS wants to give your application a single data definition, that is used by the UI, the program logic and the data storage/query execution.
SQL != CDS
This probably needs some context (again): a major usage pattern of how SQL databases are used by application developers is that the application is written in some form of OO-implementation and the talking to the DB is left to a mapping layer/library (e.g. O/R-mappers). This means, that the knowledge of what the application is about (aka business process knowledge), is spread out in the application.
There is some information about it in the UI (labels, formatting, visibility, ...), some of it is in the application-object model (object dependencies, hierarchies, value domains...) and then some of it is in the queries against the database.
Such scattered knowledge/definition makes it hard to make changes consistent, which in turn, slows the development process and in turn prolongs the time until the application can run and deliver some positive outcome.
"Time-to-value" is the thing under optimization here as this is important for companies that give the promise of "success through innovation".
Ok, so this CDS thing is now part of the development models proposed by SAP and nearly en-passant also addresses topics like schema evolution and deployment of the data model. It is, in fact, independent of the actual database platform as shown in the CDS for ABAP variety.
How does this lead back to query performance? It does not really.
CDS' advantage is that one can provide more information about the data model than what is possible in HANA SQL.
Associations and joins with cardinality declaration (albeit now retrofitted to plain SQL) can enable the optimizer to use additional optimizations. Yet, the same optimizer and the same query execution "engines" are used here.
So, from a (query execution) performance point of view, it does not make a big difference, as long as no query semantics are required for which CDS does not have syntax (e.g. some window functions).
The main point of CDS really is about application development process performance and whether that works well with how you do development really depends on how much of it you can use.
Now for the question "scripted calc view" vs. "table function" vs. "CDS view".
Looking at these different object types from the point of "what can I do with them functionally?" will result in the observation "basically, the same".
The difference lies in how these can be optimized (scripted calc views cannot be generally unrolled into the global query to be optimized), and what one can do with the object once created.
Table functions allow for very easy reuse across multiple views and queries. They also provide the option to provide parameters into the function (similar to parameterized views) and in addition allow for imperative coding.
Functionally speaking, table functions are a kind of swiss-army knife; one can do nearly anything with them and they still can be part of global query optimization.
CDS views, as mentioned above, are nothing "special" in terms of query runtime or optimization. The main reason why CDS views are "a thing" is that with HANA SAP started to develop development models (such as XS, XSA, CAM) that revolve around "virtual data models".
The idea for those is that the structure of tables very often is stable and changes only little over time.
In a way, this is the "write-schema" of applications that enter the data into tables.
The "read-schema" is most of the time different from that. Queries re-combine the normalized data into records that the application can map into objects. This allows applications to look at the data differently than the original application.
With "virtual data models" these queries are baked into tangible development artifacts (the views) that can be reused across the application. In fact, these can be treated as if this was the database with its tables, presented in a way that makes sense for the application.
Once again, if that is something that is beneficial for your application development depends on how your application development looks like.
Can you use HANA without CDS? Absolutely, and there are many areas where CDS lacks (i.e. the limited syntax and feature mapping to HANA features) but it does have its merits.
Should you abandon calculation views?
I would not necessarily change existing developments if they still serve their purpose, but calculation views certainly are an odd development object. Training folks in using those and SQL most likely is overly expensive compared to just sticking to SQL.
Personally, I prefer the code-based SQL development (better tooling available, allows for easier comparison with other DBMS, doesn't require WEB IDE/HANA Studio).
The only thing, SQL based development does not provide is the extended annotations/semantic information used by the SAP analytic frontend tools (SAC & BO) - these really are specific to CDS and Information Models (calculation views) but barely used by other analytic tools.
And that's my take on it.
I would add that
Calculation Views are semantically richer. A SQL View does not know about measures, dimensions, hierarchies. https://blogs.sap.com/2019/08/26/what-is-the-difference-calcview-versus-sql-view/
The difference from the execution plan point of view is getting less and less. In Hana 2.0 SP4 most graphical calc views are turned internally into a single SQL statement to be executed by the SQL engine. So in that sense, using a CalcView gives you the additional information about the model plus the query performance of the SQL engine.
Lars' explanation of CDS is perfect. Nothing to add there.
But Imagine the situation when you can't create a table function because of limited license (aka runtime version). Just stay with scripted views.
The main advantage of Hana artifacts over CDS at present is the ability to use input parameters in complex cases to optimize resources and query performance - when your logic is pushed down into DB instead of AS / app. But many native SQL features are still not available in graphical views (for example - exists, JOIN on BETWEEN), so I think that 10 years later HANA artifacts will become "very rare".
So learn CDS syntax :)
Always a glad experience reading an article or pov from Lars, on any media (StackOverflow, SAP blog, article, twitter).
I just want to point out that another thing that I miss from the SQL scripting (SP, TF, SF) is the join optimization and SQL propagation that Information View has.
This is for me the focus to flexible models (apart from dynamic join that is only relevant for certain scenarios), to deliver one view that will perform depending on which columns the user or app will request.
For the semantics use, I can simply expose a TF inside an information view to add some.
You can tell me that CDS have both options available (join optimization, SQL propagation, and annotation) but for advanced or complicated scenarios (window functions not present at CDS), and also for non-SAP developers, it will be more simple and the go-to approach for beginners

Text Search Engine for Mailing List Archive Cataloging and Search

I'm working with a mailing list archive and am tasked with setting up basic search, boolean search, and ultimately some sort of more intelligent tag-based searching.
I see both commercial products and some open-source projects (like Lucene.NET)
Has anyone else done any similar kind of work?
I'm working in Win2k3 server now, so the immediate thought was to use ASP Classic or ASP.NET. However, if there were another platform that was orders of magnitude better for the purpose, then I'd consider that as well. I'm not going to throw out something becuse of that ;)
Since you are setting up mail search you will need two things : a search engine and a database.
There are many search engines that offer what you need.
Sphinx
Solr(Lucene and Solr are merged now)
PostgreSQL(inbuilt search)
They provide advanced search tools like keywords, field-restricted search, boolean queries, phrase search and more. Here is another SO post looking into various text search engines: Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Sphinx and Solr are pretty fast in search. Sphinx does full database search and also does partial indexing. Solr uses indexed based search, and is scalable with almost linear performance.
Second most important choice is the database where you store your mails. The mails will be in some format (schema), like fields in a table. It would be plain crazy not to use any format. It is not file search, right? Some search engines require particular DB's to work. Sphinx uses SQL databases only, Solr can be integrated with noSQL databases.
If you are not worried about scaling issues (you have thousands of users, having GB's of data, needing real time performance) then, you are fine with SQL databases. Otherwise you will have to use noSQL database with Solr.
SQL databases(like PostgreSQL) are simplest to work with, do what you need and require minimal setup/effort. Connectors will allow you to send query(mail search) from browser to your database.
Also you said you use Win2k3, you will have to switch to linux distribution to take advantage of these search engines. Win2k3 is slow, does not offer performance comparable to linux distros.
First, you should think about what you need.
What do you want to search in your e-mail archive? Just full text search in the the e-mail’s plein data? You will not get matches in mails that are base64 encoded then, for example. Do you need ‘fielded’ search? E.g.: search only in ‘subject’, ‘from’, ‘to’, ‘body’, ‘attachments’?
How do you want to provide access to search in the mails? Via a web page? On a command line? In some windows program?
If you didn’t yet, you should examine what your data looks like. Maybe ‘mbox’ format (one file with mail plain text concatenated) ‘maildir’ (a directory with many files, each contain one mail), or something else?
Setting up a search engine means to think about how data needs to be prepared:
E-Mails can contain different data inside. You will have to deal with base64 encoded data, character encodings as UTF-8 and attachments.
Usegroup mails may even be split across multiple e-mail messages.
If you want to search different ‘fields’ (‘Subject’, ‘date’, ‘body’) they need to be extracted.
Data needs to be prepared by linguistic means. You will need to find out which language the mails are in (if there are several) and process the data, eg. to make a search on mouse match on notions of mice and, perhaps, rats; or cursor and pointing device, depending on the topic of your mailing list.
Also think about:
Will there be updates to the data in future?
Are there deletes (including messages being relabeled later)?
Then compare the products (commercial or open source) that you favour how much of this they do provide already and what you will have to write yourself. Be aware that providing a search experience is more than downloading a search engine and dropping in a ton of data.

Writing Native SQL for CRM 2011

I wonder about that can I write native SQL to add or delete operations instead of using Query Expression or FetchXML etc. I know Query Expression is very useful but my real concern is performance and I've thought writing SQL can be faster than the others.
To put it simply, using direct SQL (especially for create/update actions) is not supported. DO NOT DO IT!
The database model for CRM is complex and updates to data can have effects that extend beyond a simple update to a single table or two.
my real concern is performance
Have you validated this concern? Take a look at this link which documents performance tests on CRM. This is an enterprise-level, scalable platform. If you have proven performance issues then perhaps your code needs optimising or your kit needs beefing up...? :)
I totally agree with Greg's answer, this is just as a side note regarding performance. If you really are seeing "performance issues", maybe you should spend your time focusing on seeing if adding an index would be helpful. Although database indexes aren't included within CRM solutions, and will require manual propagation between dev, qa, staging, and prod environments, and are only supported for on-site installations, they can make some queries, 10 or 100 times faster... (of course if they are abused, they can slow everything down as well. Know what you are doing before you use them)
On top of what #Greg & #Daryl have said, when you say performance do you mean its quicker for you to write sql?
Regardless, CRM has some unique ways of doing things.
For example activate/deactivate a record, invoice related actions or the way CRM converts an Opportunity.
It's not that hard to do. You should spend some time in the sdk...

Microsoft Access equivalent of explain in MySQL

I'm working on a very large query, in a inherited application. This is a large insert-query, that takes 4 tables with well over a million records. I know, I would also rather have this in SQL-server, but there is no infrastructure at this customer to do this :-)
This query has worked for over a year. However, the source-tables keep on growing, and last week it threw the dreaded 'out of system resources'-error. Bummer...!
I think it is possible to optimize this query. Working in MySQL, I would use the explain-command, to see where optimalisation might occur. Is there a equivalent of this in Access? I cannot seem to find it....
kind regards,
Paul
Probably Jet ShowPlan is closest to what you want. You will have to set a registry key. Then query plan information gets dumped to a text file named SHOWPLAN.OUT. You can read about the details in this article on TechRepublic: Use Microsoft Jet's ShowPlan to write more efficient queries
Also try the Performance Analyzer wizard. You can ask it to examine your query alone, or also ask it to examine table or other queries used by that query.
If you haven't compacted the database recently, see whether that improves performance. Compacting also updates index statistics which allows the engine to make better decisions for the query plan.

Best strategy for retrieving large dynamically-specified tables on an ASP.NET page

Looking for a bit of advice on how to optimise one of our projects. We have a ASP.NET/C# system that retrieves data from a SQL2008 data and presents it on a DevExpress ASPxGridView. The data that's retrieved can come from one of a number of databases - all of which are slightly different and are being added and removed regularly. The user is presented with a list of live "companies", and the data is retrieved from the corresponding database.
At the moment, data is being retrieved using a standard SqlDataSource and a dynamically-created SQL SELECT statement. There are a few JOINs in the statement, as well as optional WHERE constraints, again dynamically-created depending on the database and the user's permission level.
All of this works great (honest!), apart from performance. When it comes to some databases, there are several hundreds of thousands of rows, and retrieving and paging through the data is quite slow (the databases are already properly indexed). I've therefore been looking at ways of speeding the system up, and it seems to boil down to two choices: XPO or LINQ.
LINQ seems to be the popular choice, but I'm not sure how easy it will be to implement with a system that is so dynamic in nature - would I need to create "definitions" for each database that LINQ could access? I'm also a bit unsure about creating the LINQ queries dynamically too, although looking at a few examples that part at least seems doable.
XPO, on the other hand, seems to allow me to create a XPO Data Source on the fly. However, I can't find too much information on how to JOIN to other tables.
Can anyone offer any advice on which method - if any - is the best to try and retro-fit into this project? Or is the dynamic SQL model currently used fundamentally different from LINQ and XPO and best left alone?
Before you go and change the whole way that your app talks to the database, have you had a look at the following:
Run your code through a performance profiler (such as Redgate's performance profiler), the results are often surprising.
If you are constructing the SQL string on the fly, are you using .Net best practices such as String.Concat("str1", "str2") instead of "str1" + "str2". Remember, multiple small gains add up to big gains.
Have you thought about having a summary table or database that is periodically updated (say every 15 mins, you might need to run a service to update this data automatically.) so that you are only hitting one database. New connections to databases are quiet expensive.
Have you looked at the query plans for the SQL that you are running. Today, I moved a dynamically created SQL string to a sproc (only 1 param changed) and shaved 5-10 seconds off the running time (it was being called 100-10000 times depending on some conditions).
Just a warning if you do use LINQ. I have seen some developers who have decided to use LINQ write more inefficient code because they did not know what they are doing (pulling 36,000 records when they needed to check for 1 for example). This things are very easily overlooked.
Just something to get you started on and hopefully there is something there that you haven't thought of.
Cheers,
Stu
As far as I understand you are talking about so called server mode when all data manipulations are done on the DB server instead of them to the web server and processing them there. In this mode grid works very fast with data sources that can contain hundreds thousands records. If you want to use this mode, you should either create the corresponding LINQ classes or XPO classes. If you decide to use LINQ based server mode, the LINQServerModeDataSource provides the Selecting event which can be used to set a custom IQueryable and KeyExpression. I would suggest that you use LINQ in your application. I hope, this information will be helpful to you.
I guess there are two points where performance might be tweaked in this case. I'll assume that you're accessing the database directly rather than through some kind of secondary layer.
First, you don't say how you're displaying the data itself. If you're loading thousands of records into a grid, that will take time no matter how fast everything else is. Obviously the trick here is to show a subset of the data and allow the user to page, etc. If you're not doing this then that might be a good place to start.
Second, you say that the tables are properly indexed. If this is the case, and assuming that you're not loading 1,000 records into the page at once and retreiving only subsets at a time, then you should be OK.
But, if you're only doing an ExecuteQuery() against an SQL connection to get a dataset back I don't see how Linq or anything else will help you. I'd say that the problem is obviously on the DB side.
So to solve the problem with the database you need to profile the different SELECT statements you're running against it, examine the query plan and identify the places where things are slowing down. You might want to start by using the SQL Server Profiler, but if you have a good DBA, sometimes just looking at the query plan (which you can get from Management Studio) is usually enough.

Resources