Suppose I want to use an Oracle database, and I have some flat binary file containing structured data. Suppose I have a relational model that fits this data structure.
Does Oracle provide an API to implement some adapter to be able to relationally query this sequence of bytes as a set of views?
If so:
where should the data reside?
what version offers this feature?
If no:
is there any other RDBMS that offers such an API?
You can use an external table. Normally, external tables must use text columns, but you can use the PREPROCESSOR directive to specify a script that will transform the source file before loading it.
You could also use UTL_FILE to load the table from disk and do whatever you want to it in the database. This could include a pipelined table function that you access with the TABLE operator.
i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link
I am developing an internal web application that needs a back end. The data stored is not really RDBMS type. Currently it is in XML document fashion that the application parses (XQuery) to display html tables and other type of fields.
It is likely that I will have a few more different types of XML documents and CSV(comma separated values) coming up. Given the scenario, I can always back the data up with a mySQL database, breaking the process that generates XML or CSV to insert straight in to database.
Is no-sql database a good choice in this scenario? or mySQL is still better? I do not see any need for clustering/high availability/distributed processing scenarios.
Define "better".
I think the choice should be made based on how relational (MySQL) or document-based (NoSQL) your data is.
A good way to know is to analyze typical use cases. Better yet, write two prototypes and measure.
I am thinking of storing bunch of data in XML files. Each file will has information about a distinct element lets say contacts. Now I am trying to do retrieve a contact based on some information eg: Find all the contacts who live in CA. How do I search for this information? Can I use something like LINQ. I am seeing XElement but does it work for multiple XML files.
Does converting to datasets help? So I am thinking I should have a constructor for my application which loads all the xml files into a dataset and perform queries on the dataset. If this is a good approach can someone point me to examples/resources?
And most importantly is this a good solution or should I use databases? The reason I am using XML files is I need to extend this solution to use xquery in the backend tiers (business logic, database) in future and I thought having data in xml files would be helpful.
Update I already have the schema here - http://ideone.com/ZRPco
If you put the data in a database then it's easy to output it as XML. Don't start off in XML just because you're going to need to end up there. If you're needing to do queries on the data then a database is almost certainly the best option.
You can use XML in your cause. just to understand your example.
you may have 1000 Employees in your company.
Each Employeer can have zero or more contacts( like primary, secondray, etc ).
so every employeer can have a contacts.xml ( identified based on Xml Databases like eXist, MarkLogic, Berkely etc ).
e.g) -contacts.xml
Once the Data is inside an Xml Database. Then Database can fetch you all sort details based on what ever facet you want.
like fetch contacts by ZipCode, by City, by Name etc.
All you need to is write specific XQuery to mine the Data for your request. ( in case of MarkLogic Xml Database Server ). The Terminology used in this world is Faceted browsing.
Xml Databases are designed to handle such information. View Contacts as a Mass Data rather than Rows/Columns.
Here are two reasons not to use XML ...
if the dataset is large, i would not use xml. you either have a use a dom parser (slow on big data) or a sax parser (faster, but you lose validation ability until the whole file is read).
if the data is going to change. You have to rewrite the whole xml file in order to change a portion of it.
Here is the reason I would use XML ..
If the dataset is small, is naturally hierarchical, and needs to be viewable/editable in a text editor.
If you need to output as xml, it is not a problem to output xml from a database.
Lots of comments here, nobody has much understanding of MarkLogic Server XML Databases, and how powerful XML can be as a storage format when multiple types of indexes are applied (element, value, attribute, xml structure, xml node order, word, phrase indexes)
MarkLogic can store/index billions of XML documents and allow sub-second searching across all of them, complex SUM COUNT MIN MAX operations, etc.
I've used relational XML files with C#.NET LINQ-to-XML to achieve what the original poster wants to achieve. (No MarkLogic at this point, just plain XML files and C# LINQ code that joins them together to achieve whatever type of search I'm looking for) You may have an XML file for contacts:
<contacts>
<contact id="1" companyid="1">
<name></name>
<address></address>
<city></city>
<state></state>
</contact>
</contacts>
You may also want to join this to another XML file for companies:
<companies>
<company id="1">
<name></name>
<address></address>
<city></city>
<state></state>
<company>
</companies>
Here is some sample C#.NET LINQ-to-XML syntax to achieve doing a LEFT OUTER JOIN between these two files:
using System.Xml.Linq.XDocument
XDocument xDocContacts = XDocument.Load("contacts.xml");
XDocument xDocCompanies = XDocument.Load("companies.xml");
var results = from ct in xDocContacts.Root.Element("contacts").Elements("contact")
join cp in xDocCompanies.Root.Element("companies").Elements("company")
on ct.Attribute("companyid").Value.ToString() equals cp.Attribute("id").Value.ToString()
into joined
select joined.DefaultIfEmpty();
foreach (var item in joinedResults)
{
}
I've used this with XML files of 90MB joining with smaller XML files of 4-5MB, and can perform complex searches with multiple WHERE conditions in the 2-3 sec range.
It definitely sounds like databases would be the correct solution. The two requirements I see here are you will need to run certain types of queries against the dataset and you need it to be in XML at a certain point. A SQL database will be able to handle complex queries a lot better than XML files while at the same time you can always convert the data to XML when you need it.
As per my experience, using XML as a master data source is not a good idea, it will be a pain at some point. Try SQLite instead, it is a powerful and portable relational database.
We a need a csv viewer which can look at 10MM-15MM rows on a windows environment and each column can have some filtering capability (some regex or text searching) is fine.
I strongly suggest using a database instead, and running queries (eg, with Access). With proper SQL queries you should be able to filter on the columns you need to see, without handling such huge files all at once. You may need to have someone write a script to input each row of the csv file (and future csv file changes) into the database.
I don't want to be the end user of that app. Store the data in SQL. Surely you can define criteria to query on before generating a .csv file. Give the user an online interface with the column headers and filters to apply. Then generate a query based on the selected filters, providing the user only with the lines they need.
This will save many people time, headaches and eye sores.
We had this same issue and used a 'report builder' to build the criteria for the reports prior to actually generating the downloadable csv/Excel file.
As other guys suggested, I would also choose SQL database. It's already optimized to perform queries over large data sets. There're couple of embeded databases like SQLite or FirebirdSQL (embeded).
http://www.sqlite.org/
http://www.firebirdsql.org/manual/ufb-cs-embedded.html
You can easily import CSV into SQL database with just few lines of code and then build a SQL query instead of writing your own solution to filter large tabular data.