I've been compiling data to make a GTFS but i want to know what are some of the GIS systems that people are using to create the shapes.txt. What are some of the problems you are running into and what are some of the benefits of your workflow.
Related
Here is my question:
Is tere any service or technology to run parallel algorythms on more computer without knowing them?
For example: I write a parallel algorythm. My friends install a simple client app, and if they have internet connection, they can help my calculation with their free processor capacity. I would like to see them like an additional core in my CPU.
If there is no technology like that, is there any unsolvable problems with developing one? (I know there must be a lot problems with code trasfering, operation systems, and compatibility)
I believe that you can use BOINC to set up your own volunteer computing project. But I have no experience of this to report.
I've been hearing a bunch about Apache Thrift lately, though I know very little about it. I understand that it's a remote procedure call framework and abstracts calling functions across languages and on different machines. I've looked into MPI and found it's absurdly low-level. Would Thrift be a good higher level replacement to allow parallel computation to be performed on a networked group of machines?
The answer depends on your performance requirements. If you are looking for pure computational power using the networked group of machines then thrift is not quite ready.
Thrift has its own serialization to abstract away the type conversion between languages and versions of the API. This is great for enterprise/client server systems which can take the performance hit of doing these conversions for the benefits that come from allowing clients and servers in different languages. However for a high performance networked group of machines this may be useless as your nodes will probably use the same language.
Also the asynchronous I/O is fairly new and immature for most of the languages which means the use of blocking network I/O. This is probably not ideal for what you want to do.
I use thrift extensively and it solves a lot of problems and the community is fairly active. However its probably not the right tool for your problem.
I had a MAKE compilation process that took around 1 hour to complete earlier. I used the -j command and was able to reduce it to 40 mins. What I observed is that the CPU utilization was high and my mentor suggested me to distribute the jobs on different SERVERS or machines available with our organization. I read about distcc but it can be used for c code only and we have a mix of c and java code. Kindly suggest me an appropriate tool to look for and which is the easiest to install and deploy as I am the only one working on this project.
Specifications - platform - solaris-sparc and x86 also
Thank you
Ankit
ElectricAccelerator, a commercial product from Electric Cloud, is a drop-in replacement for GNU make that accelerates make-based builds by distributing the work to a cluster of computers. It can also distribute and parallelize ant-based builds. Accelerator uses a different mechanism than distcc so it is not tied to any particular toolchain or development language.
Disclaimer: I'm the architect and lead developer of ElectricAccelerator.
Check out DistCC:
http://distcc.samba.org/
Works for both solaris-sparc and x86.
Good Luck!
You can also hand-craft a solution. Suppose you build four libraries, and have four servers. Build on library on each server, using remote execution commands.
This is just one simple example, of course, to give you the idea.
besides distcc,
dmake is said to do what you call for: http://docs.oracle.com/cd/E19422-01/819-
3697/dmake.html
DMS http://www.nongnu.org/dms/faq.html also exists
See also ccache which speeds up compilation.
The other day I needed to archive a lot of data on our network and I was frustrated I had no immediate way to harness the power of multiple machines to speed-up the process.
I understand that creating a distributed job management system is a leap from a command-line archiving tool.
I'm now wondering what the simplest solution to this type of distributed performance scenario could be. Would a custom tool always be a requirement or are there ways to use standard utilities and somehow distribute their load transparently at a higher level?
Thanks for any suggestions.
One way to tackle this might be to use a distributed make system to run scripts across networked hardware. This is (or used to be) an experimental feature of (some implementations of) GNU Make. Solaris implements a dmake utility for the same purpose.
Another, more heavyweight, approach might be to use Condor to distribute your archiving jobs. But I think you wouldn't install Condor just for the twice-yearly archiving runs, it's more of a system for regularly scavenging spare cycles from networked hardware.
The SCons build system, which is really a Python-based replacement for make, could probably be persuaded to hand work off across the network.
Then again, you could use scripts to ssh to start jobs on networked PCs.
So there are a few ways you could approach this without having to take up parallel programming with all the fun that that entails.
I was looking for ETL tool and on google found lot about Pentaho Kettle.
I also need a Data Analyzer to run on Star Schema so that business user can play around and generate any kind of report or matrix. Again PentaHo Analyzer is looking good.
Other part of the application will be developed in java and the application should be database agnostic.
Is Pentaho good enough or there are other tools I should check.
Pentaho seems to be pretty solid, offering the whole suite of BI tools, with improved integration reportedly on the way. But...the chances are that companies wanting to go the open source route for their BI solution are also most likely to end up using open source database technology...and in that sense "database agnostic" can easily be a double-edged sword. For instance, you can develop a cube in Microsoft's Analysis Services in the comfortable knowledge that whatver MDX/XMLA your cube sends to the database will be intrepeted consistently, holding very little in the way of nasty surprises.
Compare that to the Pentaho stack, which will typically end interacting with Postgresql or Mysql. I can't vouch for how Postgresql performs in the OLAP realm, but I do know from experience that Mysql - for all its undoubted strengths - has "issues" with the types of SQL that typically crops up all over the place in an OLAP solution (you can't get far in a cube without using GROUP BY or COUNT DISTINCT). So part of what you save in licence costs will almost certainly be used to solve issues arising from the fact the Pentaho doesn't always know which database it is talking to - robbing Peter to (at least partially) pay Paul, so to speak.
Unfortunately, more info is needed. For example:
will you need to exchange data with well-known apps (Oracle Financials, Remedy, etc)? If so, you can save a ton of time & money with an ETL solution that has support for that interface already built-in.
what database products (and versions) and file types do you need to talk to?
do you need to support querying of web-services?
do you need near real-time trickling of data?
do you need rule-level auditing & counts for accounting for every single row
do you need delta processing?
what kinds of machines do you need this to run on? linux? windows? mainframe?
what kind of version control, testing and build processes will this tool have to comply with?
what kind of performance & scalability do you need?
do you mind if the database ends up driving the transformations?
do you need this to run in userspace?
do you need to run parts of it on various networks disconnected from the rest? (not uncommon for extract processes)
how many interfaces and of what complexity do you need to support?
You can spend a lot of time deploying and learning an ETL tool - only to discover that it really doesn't meet your needs very well. You're best off taking a couple of hours to figure that out first.
I've used Talend before with some success. You create your translation by chaining operations together in a graphical designer. There were definitely some WTF's and it was difficult to deal with multi-line records, but it worked well otherwise.
Talend also generates Java and you can access the ETL processes remotely. The tool is also free, although they provide enterprise training and support.
There are lots of choices. Look at BIRT, Talend and Pentaho, if you want free tools. If you want much more robustness, look at Tableau and BIRT Analytics.