RDB2RDF Approaches - etl

I read this W3C-Description of the two different approaches for RDB2RDF.
The ETL approach is pretty clear to me.
But I'm not sure if I did understand the "Virtual Mapping" approach. Is it a direct translation of SPARQL queries to SQL queries without any mapping file, or is virtual mapping using Direct Mapping (with/without) a mapping file?

A mapping has to be involved. Let it be the Direct Mapping (the default mapping of relational data to RDF) or a custome mapping written by somebody in R2RML.
With the mapping, you can do two things:
ETL: extract the relational data, transform to RDF per the mapping, so it can then be loaded into a triplestore
NoETL (the "virtual mapping" approach): view the relational database as if it were a triplestore which means that SPARQL queries are translated into SQL queries per the mapping

The wording here is indeed weird. The ETL approach means that you convert entire dataset into triples.
From the link document I understand that the Virtual Mapping is indeed an approach, where you translate SPARQL into SQL queries and run the latter directly on the source database.

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

Avoid data replication when using Elasticsearch + MySQL backend?

I'm working on a project where we have some legacy data in MySQL and now we want to deploy ES for better full text search.
We still want to use MySQL as the backend data storage because the current system is closely coupled with that.
It seems that most of the available solutions suggest syncing the data between the two, but this would result in storing all the documents twice in both ES and MySQL. Since some of the documents can be rather large, I'm wondering if there's a way to have only a single copy of the documents?
Thanks!
Impossible. This is analogous to asking the following: if you have legacy data in an Excel spreadsheet, can I use a MySQL database to query the data without also storing it in MySQL?
Elasticsearch is not just an application layer that interprets userland queries and turns them into database queries, it is itself a database system (in fact, it can be used as your primary data store, though it's not recommended due to various drawbacks). Its search functionality fundamentally depends on how its own backing storage is organized. Elasticsearch cannot query other databases.
You should consider what portions of your data actually need to be stored in Elasticsearch, i.e. what fields need text completion. You will need to build a component which syncs that view of the data between Elasticsearch and your MySQL database.

BIRT Scripted Data Source using existing JDBC DataSource

I know that my overall problem is generally approached using two of the more common solutions such as a join data set or a sub-table, sub-report. I have looked at those and I am not sure this will work effectively.
Background:
JDBC data source has local data which includes a series of id's that reference a record in a master data repository interfaced via a web service. This is where the need for a scripted data source arises. The data can be filtered on either attributes within the local JDBC data and/or the extended data from the web service. The complication is that my only interface is the id argument to the webservice.
Ideal Solution:
Aside from creating a reporting table or other truly desirable scenarios I am looking to creating a unified data source through a single scripting data source that will handle all the complexities. This leaves the report generation and parameter creation a bit cleaner, hopefully. The idea is to leverage the JDBC query as well as the web service queries in the scripted data source do the filtering and joins and create that singular unified view.
I tried using the following code as a reference to use the existing JDBC connection in the BIRT report definition to execute the query. However if I think my breakdown on what should be in open vs fetch given this came from beforeFactory for a completely different purpose may be giving me errors...truth is I see no errors it just returns 0 records.
a link
I have also found a code snippet to dynamically load a JDBC connection but that seems a bit obtuse and a ton of overhead for what I am needing to do. a link
In short: How in all-that-is-holy do you simply run a query against a database within a scripted data source if you wanted to do. The merit of doing that is another issue, but technically how?
Thanks in Advance!

Is no sql database a good solution for a small application?

I am developing an internal web application that needs a back end. The data stored is not really RDBMS type. Currently it is in XML document fashion that the application parses (XQuery) to display html tables and other type of fields.
It is likely that I will have a few more different types of XML documents and CSV(comma separated values) coming up. Given the scenario, I can always back the data up with a mySQL database, breaking the process that generates XML or CSV to insert straight in to database.
Is no-sql database a good choice in this scenario? or mySQL is still better? I do not see any need for clustering/high availability/distributed processing scenarios.
Define "better".
I think the choice should be made based on how relational (MySQL) or document-based (NoSQL) your data is.
A good way to know is to analyze typical use cases. Better yet, write two prototypes and measure.

NHibernate Criteria query on in-memory collection of entities

I would like to apply a Criteria query to an in-memory collection
of entities, instead of on the database. Is this possible?
To have Criteria API work like LINQ? Or alternatively, convert
Criteria query to LINQ query.
Thanks!
I don't believe you can use Criteria to query against an in-memory collection and come to think about it it doesn't seem to make much sense. If I'm understanding everything correctly you've already queried against your database. I'd suggest to either tune your original query (whichever method you choose) to include all of your filters. Or you could use LINQ (as you suggested) to refine your results.
Also, what's your reasoning for wanting to query from memory?
It sounds like you're rolling your own caching mechanism. I would highly recommend checking out NHibernate's 2nd level cache. It handles many complex scenarios gracefully such as invalidating query results on updates to the underlying tables.
http://ayende.com/Blog/archive/2009/04/24/nhibernate-2nd-level-cache.aspx

Resources