I'm wanting to progress through a directory's subdirectories and either convert or place .TIF images into a pdf. I have a directory structure like this:
folder
item_one
file1.TIF
file2.TIF
...
fileN.TIF
item_two
file1.TIF
file2.TIF
...
...
I'm working on a Mac and considered using sips to change my .TIF files to .PNG files and then use pdfjoin to join all the .PNG files into a single .PDF file per folder.
I have used:
for filename in *; do sips -s format png $filename --out $filename.png; done
but this only works for the .TIF files in a single directory. How would one write a shellscript to progress through a series of directories as well?
once the .PNG files were created I'd do essentially the same thing but using:
pdfjoin --a4paper --fitpaper false --rotateoversize false *.png
Is this a valid way of doing this? Is there a better, more efficient way of performing such an action? Or am I being an idiot and should be doing this with some sort of software, like ImageMagick or something?
Try using the find command with the exec switch to call your image conversion solution. Alternatively, instead of using the exec switch, you could pipe the output of find to xargs. There is lots of information online about using find. Here's one example from StackOverflow.
As far as the image conversion, I think that really depends on your requirements for speed and efficiency. If you've verified the process you described, and this is a one-time process, and it only takes seconds or minutes to run, then you're probably fine. On the other hand, if you need to do this frequently, then it might be worth investing the time to find a one-step conversion solution that takes less time than your current, two-pass solution.
Note that, instead of two passes, you may be able to pipe the output of sips to pdfjoin; however, that would require some investigation to verify.
How to turn raster GIS files (tiff) into .json ?
Bostock's example used some JSON data to feed his D3.geom.contour plugin. But how to convert a GIS raster, let's say a tiny 11px/15px tiff image, into a JSON.
Final .JSON code such: [EDIT: this is NOT the topojson format]
[
[103,104,104,105,105,106,106,106,107,107,106],
[104,104,105,105,106,106,107,107,107,107,107],
[104,105,105,106,106,107,107,108,108,108,108],
[105,105,106,106,107,107,108,108,109,109,109],
[105,106,106,107,107,108,108,109,109,110,110],
[106,106,107,107,107,108,109,109,110,110,111],
[106,107,107,108,108,108,109,110,110,111,112],
[107,107,108,108,109,109,110,110,112,113,114],
[107,108,108,109,109,110,111,112,114,115,116],
[107,108,109,109,110,110,110,113,115,117,118],
[107,108,109,109,110,110,110,112,115,117,119],
[108,108,109,109,110,110,110,112,115,118,121],
[108,109,109,110,110,111,112,114,117,120,124],
[108,109,110,110,110,113,114,116,119,122,126],
[108,109,110,110,112,115,116,118,122,124,128]
]
Note: .shp to .json: There is already a tutorial on how turning shapefiles into lighter topojson, but not useful to here.
I don't think you can do it directly, it's probably a several step process:
Process
Convert .tiff -> .shp
gdal_contour -a elev input.tif output.shp -i 10.0
Convert .shp -> .json (topojson)
topojson input.shp -o output.json
Resources
Be aware of TopoJSON Command Line ;
Tutorial: Working with Terrain Data in QGIS point out to GDAL utilities ;
gdal_contour builds vector contour lines from a raster elevation model.
I tried to follow many online tutorials to run kmeans example present in Mahout.
But did not succeed yet to get meaningful output. The main problem I am facing is,
the conversion from text file to sequencefile and back.
When I followed the steps of "Mahout Wiki" for "Clustering of synthetic control data"
(https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html) I could run the clustering process (using $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) and that created some readable console output. But I wish to get output files (as the size is large) from the clustering process.
The output files which were generated by Mahout clustering are all sequence file and I cant convert them to readable files.
When I tried to do "clusterdump" ($MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10...) I got errors.
First it complains that "seqFileDir" option is unexpected and I guess either there is no "seqFileDir" for clusterdump or I am missing something.
Trying to use Mahout in the way of "mahout in action" seems tricky. I am not sure what are the required classes ("import ??") to compile that code.
Can you please suggest me the steps to successfully RUN kmeans on Mahout ? Specially how to get readable output from sequence files ?
Regarding 2nd question - you can obtain source code for the book from the repository. The code in master branch is for Mahout 0.5, while code in the branches mahout-0.6 & mahout-0.7 is for corresponding Mahout's version.
The source code is also posted to book's site, so you download it there (but this is version only for Mahout 0.5)
P.S. If you're reading book right now, then I recommend to use Mahout 0.5 or 0.6, as all code was checked for version 0.5, while for other versions it will be different - this is especially true for clustering code in Mahout 0.7
As for seqFileDir in clusterdump, you need to use --input not --seqFileDir.
I'm using Mahout 0.7. The call to clusterdump that i use to (for example) get a simple dump is:
mahout clusterdump --input output/clusters-9-final --pointsDir output/clusteredPoints --output <absolute path of dir where you want to output>/clusteranalyze.txt
Be sure that the path to the directory output/clusters-9-final above is correct for your system. Depending on the clustering algorithm, this directory may be different. Look in the output directory and make sure you use the directory with the word "final" init.
To dump data as CSV or GRAPH_ML, you would add the -of CSV argument to the above call. For e.g.:
mahout clusterdump --input output/clusters-9-final -of CSV --pointsDir output/clusteredPoints --output <absolute path of dir where you want to output>/clusteranalyze.txt
Hope that helps.
I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file