I'm new in eXist-db. What I want to do is to store LARGE amount of data in XML format into a native XML database for fast processing (searching/updating/etc.) But unfortunately, the documentation provided doesn't explain clearly on how to save/modify data into a persistent database (or back to XML files).
Below is roughly what I want to do in eXide. The lines that I don't know how to do are commented in questions Q1, Q2, and Q3:
xquery version "3.0";
let $data := doc('file:///c:/eXist/database.xml')
let $newdata := doc('file:///c:/import/newdata.xml')
(: Q1. How to do merging of data like below? :)
update insert $newdata into $data
(: Q2. How to save the changes back to database.xml? :)
doc('file:///c:/eXist/database.xml') := $data
let $result := <result>
{
for $t in $data/book/title
where $t/../publisher = 'XYZ Company'
return $t
}
</result>
(: Q3 How to save query result to a new file? :)
doc('file:///c:/export/XYZ Company Report.xml') := $result
Thanks in advance.
Your doc() functions all point to files on your file system, which indicates a misunderstanding about how one works with XML data with eXist-db. When working with eXist-db, though, all manipulations of XML data happen in the database. Thus, you need to first store the XML data in the database, and then you can perform your manipulations. Then, if needed, you can serialize the data back out onto the file system.
To store data into the database, see Getting Data into eXist-db.
To merge data, see XQuery Update Extensions.
To serialize the data back out onto the file system, see the function documentation for file:serialize().
Related
Or, in practice, how does the dynamic nature help when processing data in a proc step.
I found this link http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000975382.htm. But it doesn't help much.
"A data file is static; a SAS view is dynamic. ...." - the dynamic aspect here is meant, IMHO, such that if the underlying data members (on which the view is based) changes, the view (the data it returns) is automatically updated, it returns fresh/latest data without the needed for a "refresh". This is simply because the view does not contain/store data, it's like a compiled data step that is "run" each time the view is accessed.
The actual sentence from docs is somewhat "promising", but there's not much special behind it once you understand the nature of views.
I'd say the statement could also be used as a tiny warning - if you change/loose/damage the underlying data, the view won't return the original data anymore - the dynamic could also be less safe.
Pls note, that if the underlying structures changes (adding/dropping/modifying column properties) you'd need to recreate the views (both data step view as SQL views) to keep the view valid and apply the changes in underlyings.
It allows you to avoid writing code to process the data every time.
Random example: If my IT dept choose to store my data in files that were monthly and followed a naming convention such as:
Y2014_M01
Y2014_M02
I could theoretically write a view that was
data Y2014/view=Y2014;
set Y2014:;
run;
And then when I needed to process the file I could simply refer to Y2014 as my data set. Then when it was updated monthly I wouldn't need to update my code. A bit of a contrived example but hope it helps to explain.
data class_F;
set sashelp.class;
where sex='F';
run;
data class_M;
set sashelp.class;
where sex='M';
run;
data class/view=class;
set class:;
run;
proc means data=class;
run;
I know how to store and retrieve data using isolated storage. My problem is how to sort the data I want to recover from that has already been stored previously. All my data is stored in a single file.
The user stores the data everyday and maybe on a particular date he makes two entries and none on some other day. In that case how should I search for the particular days info I need.
And could you also explain how data is stored in the isolated storage that is in packets of data or some other way? If I store two data sets in same file, does it automatically shift to a next line for storing other data or do I have to specify for it to do so?
Moreover, if I want to save data in the same line, does it automatically separate the data in a line by some tab or inserting some character in between two data sets in a line or does the developer have to take care of this?
If the date is saved in the title of the file then you can use the following code to search for it:
var appStorage = IsolatedStorageFile.GetUserStoreForApplication();
string date = appStorage.GetFileNames("the date you are looking for")
In answer to your other question if you use .Write() when writing text to a file, then it will create one long stream of data, if you use .WriteLine() then you will write the information, then a new line reference will be added at the end.
I'm not sure as you weren't very clear, but just in-case, here is a general procedure for reading files from IsolatedStorage:
var appStorage = IsolatedStorageFile.GetUserStoreForApplication();
using (StreamReader reader = new StreamReader(appStorage.OpenFile(fileName, FileMode.Open, FileAccess.Read)))
{
fileContent = reader.ReadToEnd();
}
I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!
I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.
I have a LINQ query mapped with the Entity Framework that looks something like this:
image = this.Context.ImageSet
.Where(n => n.ImageId == imageId)
.Where(n => n.Albums.IsPublic == true)
.Single();
This returns a single image object and works as intended.
However, this query returns all the properties of my Image table in the DB.
Under normal circumstances, this would be fine but these images contain a lot of binary data that takes a very long time to return.
Basically, in it current state my linq query is doing:
Select ImageId, Name, Data
From Images
...
But I need a query that does this instread:
Select ImageId, Name
From Images
...
Notice i want to load everything except the Data. (I can get this data on a second async pass)
Unfortunately, if using LINQ to SQL, there is no optimal solution.
You have 3 options:
You return the Entity, with Context tracking and all, in this case Image, with all fields
You choose your fields and return an anonymous type
You choose your fields and return a strongly typed custom class, but you lose tracking, if thats what you want.
I love LINQ to SQL, but thats the way it is.
My only solution for you would be to restructure your DataBase, and move all the large Data into a separate table, and link to it from the Image table.
This way when returning Image you'd only return a key in the new DataID field, and then you could access that heavier Data when and if you needed it.
cheers
This will create a new image with only those fields set. When you go back to get the Data for the images you select, I'd suggest going ahead and getting the full dataset instead of trying to merge it with the existing id/name data. The id/name fields are presumably small relative to the data and the code will be much simpler than trying to do the merge. Also, it may not be necessary to actually construct an Image object, using an anonymous type might suit your purposes just as well.
image = this.Context.ImageSet
.Where(n => n.ImageId == imageId)
.Where(n => n.Albums.IsPublic == true)
.Select( n => new Image { ImageId = n.ImageId, Name = n.Name }
.Single();
[If using Linq 2 SQL] Within the DBML designer, there is an option to make individual table columns delay-loaded. Set this to true for your large binary field. Then, that data is not loaded until it is actually used.
[Question for you all: Does anyone know if the entity frameworks support delayed loaded varbinary/varchar's in MSVS 2010? ]
Solution #2 (for entity framework or linq 2 sql):
Create a view of the table that includes only the primary key and the varchar(max)/varbinary(max). Map that into EF.
Within your Entity Framework designer, delete the varbinary(max)/varchar(max) property from the table definition (leaving it defined only in the view). This should exclude the field from read/write operations to that table, though you might verify that with the logger.
Generally you'll access the data through the table that excludes the data blob. When you need the blob, you load a row from the view. I'm not sure if you'll be able to write to the view, I'm not sure how you would do writes. You may be able to write to the view, or you may need to write a stored procedure, or you can bust out a DBML file for the one table.
You cannot do it with LINQ at least for now...
The best approach I know is create View for the table you need without large fields and use LINQ with that View.
Alternatively you could use the select new in the query expression...
var image =
(
from i in db.ImageSet
where i.ImageId == imageId && i.Albums.IsPublic
select new
{
ImageId = i.ImageId,
Name = i.Name
}
).Single()
The LINQ query expressions actually get converted to the Lambda expression at compile time, but I prefair using the query expression generally because i find it more readable and understandable.
Thanks :)
I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };
For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.
I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.