xsd - validating values from external dictionary file - validation

I would like to define a schema for a document like:
...
<car>
<make>ford</make>
<model>mondeo</model>
</car>
...
the problem is that I would like to constraint possible values (so ford/mondeo or audi/a4 would be valid values for make/model, but audi/mondeo would not) from external data dictionary. In case when new car models needs to be added only external data file would change, but xsd schema would remain the same.
Is this possible at all? I have looked at key/keyref constraint, I see I can use them within a single document, but this is not I'm looking for. I don't want to repeat full data dictionary with every document instance, I would prefer to have the data file rather constitute part of the schema.

That is not possible in XML Schema 1.0.
XML Schema 1.1 will add some support that will allow expressing this kind of constraints (although AFAIK not in external files) - but that is not yet a W3C recommendation.
It is possible to implement this now with Schematron, eventually embedded in XML Schema.
However, there was already work in this area with usable results. See OASIS Code Lists
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=codelist
More details can be found here:
http://www.genericode.org/
This is used in the OASIS Universal Business Language (UBL)
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=ubl
Best Regards,
George

Related

Validate a GraphQL schema against another reference schema

I'm not quite sure the wording I should be searching for on this.
I have a GraphQL schema which wraps a group of services using graphql-link-schema to perform the data resolution on the client side. The schema is intended to be built against a separate reference schema. How can I programmatically validate that my implementation matches the reference?
For bonus points- is it possible to determine whether a schema is a superset of another?
Thanks in advance (:
It's an interesting use case, but it's a bit unclear how validation like that would work. What causes validation to fail? Any differences between the two schemas? Extra types? Extra fields on existing types? Differences in return types? Differences in arguments or argument types?
Depending on your answer to the above questions, though, you may be able to cobble together your own validation function using the utility functions available here. Outside the main findBreakingChanges function, some of the utility functions available in that module:
findRemovedTypes
findTypesThatChangedKind
findFieldsThatChangedTypeOnObjectOrInterfaceTypes
findFieldsThatChangedTypeOnInputObjectTypes
findTypesRemovedFromUnions
findValuesRemovedFromEnums
findArgChanges
findInterfacesRemovedFromObjectTypes
If you have a reference or base schema available, though, rather than validating against it, you might also consider extending it when building the second schema. In doing so, you would effectively guarantee that the second schema matches the first except in whatever ways you intentionally deviate from it (by extending existing types, etc.). You could use extendSchema for relatively simply changes, or something like graphql-tool's mergeSchemas for more complicated changes.

What does code generation mean in avro - hadoop

Kindly regret if this question is silly. I am finding it difficult to get what it really means.When i read 'Hadoop the definitive guide' it says that the best advantage of avro is that code generation is optional in Avro. This link has a program for avro serialization/deserialization with/without code generation. Could some one help me in understanding exactly what with/without code generation mean and the real context of the same.
It's not a silly question -- it's actually a very important aspect of Avro.
With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.
In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc, and then uses the student_marks class directly.
In this case, each instance of the class student_marks inherits from SpecificRecord, with custom methods for accessing the data inside (such as getStudentId() to fetch the student_id field).
Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).
In this case, there's no student class generated, but you can still read Avro records in an Avro container. You won't have instances of student, but instances of GenericRecord. There won't be any helpful methods like getStudentId(), but you can use methods get("student_marks") or get(0).
Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.
A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.

How do I get CsvDozerBeanWriter to pull column headers from Dozer XML mapping files

I'm writing a feature to produce CSV snapshots of screen data.
I need this to be data-driven. Thus I need to avoid hard-coding each snapshot in Java, but rather load it from a data source such as an XML file or a database. The data is contained in Java beans.
I'm using SuperCSV with the Dozer extension both at 2.1.0.
This combination seems perfect since I can code the mappings from the beans to the columns in Dozer XML mapping files.
This works well for the data, but I have not found a way to specify the strings to use for the CSV's column headers other than to hard-code them in Java as is done in all of the examples and test cases I've looked at. That is not data-driven.
Is there a way for me to code the column headers in the mapper file. Or even to extract them from the mapper file, construct a List and pass them to the writerHeader() method?
I think it would be OK to just use the bean property names as the headers, although ideal situation is that I am provided some additional meta-data notation in the XML's <Field> tag that specifies the header.
I'd have posted this on SourceForge, but I'm getting a 500 error there.
I'm a Super CSV developer. You're the first person I've heard of who's using CsvDozerBeanWriter with their own DozerBeanMapper - great to hear that feature is useful :)
So what's the goal of being 'data driven'? It sounds like you want your code to be really generic, so you can alter the CSV just by changing the XML. Is that right? Of course, you can't configure the cell processors dynamically...or are you trying to do that too!!??
I'd take a look at the MappingMetadata API of Dozer, which you can access by calling getMappingMetadata() on the DozerBeanMapper. I've never used it, but it looks like you could derive the column names this way (though you'd probably be limited to the field names).
Otherwise, you'll have to parse the XML file yourself (I'd probably use XPath). You'd have to do it this way if you want to use some other metadata in the XML for the column name.

How can I find all elements in an XML Schema whose value is specified as a QName?

Suppose that...
I have a complex XML schema, one that imports/includes other schema files, which in turn import/include even more schema files.
I want to find all the elements in this XML schema that have a value (i.e., text node) that is declared to be of type QName.
I want the location (path) of these elements to be expressed as XPath statements (e.g., /foo/bar).
If I'm writing a Java application, what's the right technology for this job? Is it a schema object model like XSOM? Is it the Java XPath API? Something else?
Edit: For those who want a jumpstart on accessing the SCM in Saxon (per Michael Kay's recommendation below), here's some Java code (sans exception handling):
// Load the XSD into Saxon
Processor processor = new Processor(true);
SchemaManager schemaManager = processor.getSchemaManager();
DocumentBuilder documentBuilder = processor.newDocumentBuilder();
SAXSource saxSource = new SAXSource(new InputSource("path/to/yourSchema.xsd"));
XdmNode schema = documentBuilder.build(saxSource);
schemaManager.load(saxSource);
// Export the SCM
XdmDestination destination = new XdmDestination();
schemaManager.exportComponents(destination);
XdmNode xdmNode = destination.getXdmNode();
System.out.println(xdmNode.toString());
Querying schema documents is a difficult thing to get right, because in XSD there are so many ways of saying the same thing: for example named model groups and attribute groups complicate your task considerably.
If you're looking for types derived from QName as well as QName itself, then it really gets quite difficult.
Doing it on a "compiled" schema of some kind is therefore much easier than doing it on raw schema documents.
Using XSOM is one approach, though it doesn't have a query capability IIRC. Another approach is to use Saxon's SCM output: this is a representation of the compiled "schema component model" in XML form; being the compiled schema you don't have to worry about all the complexities of xs:include, xs:redefine, etc, while being XML means you can use XQuery on it. (I would recommend XQuery rather than XPath because there will be a lot of joins involved, including recursive joins for which you need user-defined functions.)

Appropriate data structure for flat file processing?

Essentially, I have to get a flat file into a database. The flat files come in with the first two characters on each line indicating which type of record it is.
Do I create a class for each record type with properties matching the fields in the record? Should I just use arrays?
I want to load the data into some sort of data structure before saving it in the database so that I can use unit tests to verify that the data was loaded correctly.
Here's a sample of what I have to work with (BAI2 bank statements):
01,121000358,CLIENT,050312,0213,1,80,1,2/
02,CLIENT-STANDARD,BOFAGB22,1,050311,2359,,/
03,600812345678,GBP,fab1,111319005,,V,050314,0000/
88,fab2,113781251,,V,050315,0000,fab3,113781251,,V,050316,0000/
88,fab4,113781251,,V,050317,0000,fab5,113781251,,V,050318,0000/
88,010,0,,,015,0,,,045,0,,,100,302982205,,,400,302982205,,/
16,169,57626223,V,050311,0000,102 0101857345,/
88,LLOYDS TSB BANK PL 779300 99129797
88,TRF/REF 6008ABS12300015439
88,102 0101857345 K BANK GIRO CREDIT
88,/IVD-11 MAR
49,1778372829,90/
98,1778372839,1,91/
99,1778372839,1,92
I'd recommend creating classes (or structs, or what-ever value type your language supports), as
record.ClientReference
is so much more descriptive than
record[0]
and, if you're using the (wonderful!) FileHelpers Library, then your terms are pretty much dictated for you.
Validation logic usually has at least 2 levels, the grosser level being "well-formatted" and the finer level being "correct data".
There are a few separate problems here. One issue is that of simply verifying the data, or writing tests to make sure that your parsing is accurate. A simple way to do this is to parse into a class that accepts a given range of values, and throws the appropriate error if not,
e.g.
public void setField1(int i)
{
if (i>100) throw new InvalidDataException...
}
Creating different classes for each record type is something you might want to do if the parsing logic is significantly different for different codes, so you don't have conditional logic like
public void setField2(String s)
{
if (field1==88 && s.equals ...
else if (field2==22 && s
}
yechh.
When I have had to load this kind of data in the past, I have put it all into a work table with the first two characters in one field and the rest in another. Then I have parsed it out to the appropriate other work tables based on the first two characters. Then I have done any cleanup and validation before inserting the data from the second set of work tables into the database.
In SQL Server you can do this through a DTS (2000) or an SSIS package and using SSIS , you may be able to process the data onthe fly with storing in work tables first, but the prcess is smilar, use the first two characters to determine the data flow branch to use, then parse the rest of the record into some type of holding mechanism and then clean up and validate before inserting. I'm sure other databases also have some type of mechanism for importing data and would use a simliar process.
I agree that if your data format has any sort of complexity you should create a set of custom classes to parse and hold the data, perform validation, and do any other appropriate model tasks (for instance, return a human readable description, although some would argue this would be better to put into a separate view class). This would probably be a good situation to use inheritance, where you have a parent class (possibly abstract) define the properties and methods common to all types of records, and each child class can override these methods to provide their own parsing and validation if necessary, or add their own properties and methods.
Creating a class for each type of row would be a better solution than using Arrays.
That said, however, in the past I have used Arraylists of Hashtables to accomplish the same thing. Each item in the arraylist is a row, and each entry in the hashtable is a key/value pair representing column name and cell value.
Why not start by designing the database that will hold the data then you can use the entity framwork to generate the classes for you.
here's a wacky idea:
if you were working in Perl, you could use DBD::CSV to read data from your flat file, provided you gave it the correct values for separator and EOL characters. you'd then read rows from the flat file by means of SQL statements; DBI will make them into standard Perl data structures for you, and you can run whatever validation logic you like. once each row passes all the validation tests, you'd be able to write it into the destination database using DBD::whatever.
-steve

Resources