Check RDF file syntax using jena - syntax

I have a .ttl file that I have written. I am on mac environment. I was wondering how can I use apache jena to check if the file is valid. What terminal commands are needed to do this? I have downloaded the apache jena packet.

You can use riot with $PATH_TO_JENA/bin/riot --validate $PATH_TO_FILE
riot [--time] [--check|--noCheck] [--sink] [--base=IRI] [--out=FORMAT] [--compress] file ...
Parser control
--sink Parse but throw away output
--syntax=NAME Set syntax (otherwise syntax guessed from file extension)
--base=URI Set the base URI (does not apply to N-triples and N-Quads)
--check Addition checking of RDF terms
--strict Run with in strict mode
--validate Same as --sink --check --strict
--rdfs=file Apply some RDFS inference using the vocabulary in the file
--nocheck Turn off checking of RDF terms
--stop Stop parsing on encountering a bad RDF term
Output control
--output=FMT Output in the given format, streaming if possible.
--formatted=FMT Output, using pretty printing (consumes memory)
--stream=FMT Output, using a streaming format
--compress Compress the output with gzip
Time
--time Time the operation
Symbol definition
--set Set a configuration symbol to a value
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information

Related

Use fmpp command line parameter in template

I have some configuration templates which use FMPP to generate the
real runtime config files based upon info in a csv and properties
file (defined in config.fmpp).
I want to be able to configure a second cluster server for the same task using the same set of templates and config.fmpp information. However, there are slight differences needed in the generated runtime config and I can do this if I know which server instance I am on ("serverA" or "serverB") using a standard fmpp variable like ${myserver}.
But there must only be one set of templates and FMPP config files so I need to somehow get the value of "myserver" from the runtime
environment in each server.
Some of the options I might have are:
pass value of myserver on the command line tool invocation (best way); or
get it from an environment variable.
Does anyone have an example of the code to do any of these and any suggestions of the best approach? Online reference would be great.
fmpp -S /home/me/sample-project/src -Param myserver:serverA
Environment settings:
fmpp v0.9.14
freemarker v2.3.19
Use the -D command line option (see --help):
-D, --data=<TDD> Creates shared data that all templates will see. <TDD> is the
Textual Data Definition, e.g.:
-D "properties(style.properties), onLine:true"
Note that paths like "style.properties" are relative to the
data root directory.
Like:
fmpp -S /home/me/sample-project/src -D myserver:serverA
Note that there's a space after the -D. (It's not like the java command line syntax, but rather like the standard GNU command line syntax.
This -D has nothing to do with Java's -D option.
The documentation shows onLine:true, but such Boolean values are legacy and no longer accepted. Use online:yes to parse Boolean values.
For example:
fmpp \
-S /path/ \
--verbose \
-D "online:yes"
Then, within the template:
<p>
online: ${online}
</p>
Will result in:
online: yes
The --verbose command-line parameter is useful to show any errors when parsing the template.

What is the `#Name#` in command line?

I'm looking for Tsung source code. There is a line like following in file tsung.sh.in:
ERL_OPTS=" $ERL_DIST_PORTS -smp auto +P $MAX_PROCESS +A 16 +K true #ERL_OPTS# "
What does the #ERL_OPTS# mean?
This seems to be something that gets substituted by autoconf during the build process.
Generally, a .in file gets preprocessed by some build script. Autoconf uses #IDENTIFIER# to indicate the place where the actual value has to be put in. The preprocessed version loses the .in extension, thus generating tsung.sh in this particular case.

How to successfully run kmeans clustering using Mahout (esp. get human-readable output)

I tried to follow many online tutorials to run kmeans example present in Mahout.
But did not succeed yet to get meaningful output. The main problem I am facing is,
the conversion from text file to sequencefile and back.
When I followed the steps of "Mahout Wiki" for "Clustering of synthetic control data"
(https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html) I could run the clustering process (using $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) and that created some readable console output. But I wish to get output files (as the size is large) from the clustering process.
The output files which were generated by Mahout clustering are all sequence file and I cant convert them to readable files.
When I tried to do "clusterdump" ($MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10...) I got errors.
First it complains that "seqFileDir" option is unexpected and I guess either there is no "seqFileDir" for clusterdump or I am missing something.
Trying to use Mahout in the way of "mahout in action" seems tricky. I am not sure what are the required classes ("import ??") to compile that code.
Can you please suggest me the steps to successfully RUN kmeans on Mahout ? Specially how to get readable output from sequence files ?
Regarding 2nd question - you can obtain source code for the book from the repository. The code in master branch is for Mahout 0.5, while code in the branches mahout-0.6 & mahout-0.7 is for corresponding Mahout's version.
The source code is also posted to book's site, so you download it there (but this is version only for Mahout 0.5)
P.S. If you're reading book right now, then I recommend to use Mahout 0.5 or 0.6, as all code was checked for version 0.5, while for other versions it will be different - this is especially true for clustering code in Mahout 0.7
As for seqFileDir in clusterdump, you need to use --input not --seqFileDir.
I'm using Mahout 0.7. The call to clusterdump that i use to (for example) get a simple dump is:
mahout clusterdump --input output/clusters-9-final --pointsDir output/clusteredPoints --output <absolute path of dir where you want to output>/clusteranalyze.txt
Be sure that the path to the directory output/clusters-9-final above is correct for your system. Depending on the clustering algorithm, this directory may be different. Look in the output directory and make sure you use the directory with the word "final" init.
To dump data as CSV or GRAPH_ML, you would add the -of CSV argument to the above call. For e.g.:
mahout clusterdump --input output/clusters-9-final -of CSV --pointsDir output/clusteredPoints --output <absolute path of dir where you want to output>/clusteranalyze.txt
Hope that helps.

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

dblatex ignore --texstyle or -s command

I want to write an asciidoc document and convert it into a pdf document. However, I want to use a format style different than the default ones. To do so I convert the txt file to docbook using asciidoc and then try to convert the resulting docbook xml to a pdf file using dblatex.
The idea is to set a particular tex style for dblatex to obtain the desired pdf result. I've copied the existing docbook.sty style as it is recommended here to do a small style modification. The only change done in the ./docbook file is \setlength{\textwidth}{18cm} to \setlength{\textwidth}{12cm}. However, when I run the command
dblatex --texstyle=./docbook.sty test.txt
Or the command
dblatex -s ./docbook.sty test.txt
Both produce the same result in the style change: none. I mean, no matter which modification I do to ./docbook.sty file, these modifications are not applied to the output. I obtain always the same result, a pdf with the default formatting. Do you guys have any idea where is the problem?
Thanks in advance.
I would recommend:
Copy the Dblatex docbook.sty to a new filename in your working directory which is "obviously yours" (e.g., mydbstyle.sty).
Continue to supply a full or relative path argument to the --texstyle option (e.g., /path/to/mydbstyle.sty or ./mydbstyle.sty). Failing to do so requires that mydbstyle.sty be in a directory enumerated by the TEXINPUTS environment variable (which you likely have not explicitly set).
Within mydbstyle.sty, use the following directives to initialize your style:
\NeedsTeXFormat{LaTeX2e}
\ProvidesPackage{mydbstyle}[2013/02/15 DocBook Style]
\RequirePackageWithOptions{docbook}
% ...
% your LaTeX commands here
Pass a DocBook 4.5 XML file as an argument to Dblatex (in your example you are passing test.txt which makes me uncertain whether you're passing an AsciiDoc source file).
dblatex --texstyle=./mydbstyle.sty mybook.xml

Resources