Programming through PBS - cluster-computing

I want to schedule a program on multiple nodes how can i do so.Im new to programming so I got some hint about PBS.How can I do it with the following.
Thanks in Advance

If you have a particular problem, you should describe it in your question. If you have no idea how to deal with PBS, you should read the following:
http://www.phy.bme.hu/~cluster/docs/PBS.html
http://www.adaptivecomputing.com/resources/docs/torque/index.php (section 2)
Once again, I advise you to post a complete description of what you want to submit on your cluster (kind of job, input data, output data, number of jobs, ...), so that we can help you a bit more.

Related

LCI data format in BW2

So I want to import my own LCI database to Brightway2, and my process has 3 valuable products.
I found this example with co-products: https://github.com/massimopizzol/B4B/blob/main/02.2_Simple_LCA_co_products.py
The example shows more or less how it works, but I would like to use allocation for my process and not substitution. Should I just change the type to "allocation" instead of putting there "substitution", or is bw2 not supporting allocation? Also, if we have 3 valuable products, the first one is part of the main activity as type="production", and 2 others have 'type="substitution"? And for the other 2, we create 2 separate activities and they are just kinda one exchange activity, in which their type is production, like the example?
Besides just to make sure, if one of the inputs has the type="technosphere", we need to create another activity where we show the process behind it. When it comes to raw products, they have the type="biosphere", and their amounts are negative, in comparison to emissions.
I set other valuable products type as "substitution" and for each of them I created a new activity, where their type was equal to "production". Overall it worked, but the obtained LCA score wasn't correct, so I don't know if it wasn't conceptual mistake.
Thank you in advance for all your help and time!
So, Brightway currently does not have a model where you can enter a multifunctional process and get the software to do allocation for you. You will need to do the allocation yourself :) Here is a notebook I wrote up that shows a simple allocation procedure.
P.S. In the future please only post to one of the beginners mailing list or SO, otherwise everyone doesn't get notified twice.
Changing "substitution" by "allocation" will not work. If you want to use allocation / partition, I would create the activities with the exchanges already allocated.
The meaning of the "substitution" exchange as well as the sign conventions for biosphere flows is explained in the documentation here.

Eudyptula Challenge List

I was interested in the Eudyptula Challenges but I've just posted the solution to Task 1 and they are taking a huge time to respond. Can someone just post the Task List (Not the Solutions) so that I can do them at my own pace. Thanks in Advance :)
P.S I know that this is not exactly a programming question but didn't know where else to ask this.
See here: https://github.com/agelastic/eudyptula
All tasks are in the README file. Try not to look at the solutions!

Monitor Hadoop Cluster using Collectl

I am evaluating various system monitoring tools to use one to monitor my hadoop cluster.
One of the tools I am impressed by is collectl. I have been playing around with it since a couple of days.
I am struggling to find how can we aggregate the metrics captured by collectl when using colmux?
Say, I have 10 nodes in my hadoop cluster each running collectl as a service. Using colmux I can see the
performance metrics of each node in a single view (in single and multi-line formats). Great!
But what if I am considering aggregate of CPU, IO etc on all the nodes in the cluster. That is I want to find
how my cluster as a whole is performing by aggregating the performance metrics from each node into corresponding
numbers, thereby giving me cluster-level metrics instead of node-level.
Any help is greatly appreciated. Thanks!
I had already answered this on the mailing list but for the benefit of those not on it I'll repeat myself here..
That's a cool idea. So if I understand you correctly you might see some sort of total line at the bottom? I can always add to my wish list but no promises. But I think I may also have a solution if you don't mind doing a little extra work on your own ;) btw - can I assume you've installed readkey so you can change sort columns with the arrow keys?
If you run colmux with --noesc, it will take it out of full screen more and simply print everything as scrolling output. If you then also include "--lines 99999" (or some big number) it will print all the output from all the remote systems so you don't miss anything. Finally you can pipe the output through perl, python, bash, or whatever your favorite scripting tool might be and do the totals yourself. Then whenever you see a new header fly by, print the totals and reset the counters to 0. You could even add timestamps and maybe even ultimately make it your own opensource project. I bet others would find it useful too.
-mark

Big data analysis on CDR (call details record). Help :D

I have been assigned an analysis project on the internal call records of a call center. This being my first experience with big data analysis, can someone guide me on how to go about this project? Where to begin and the tools to be used? Pentaho, Etl tools,Hadoop. Suggestions?
Tips
Understanding the data
Identify what kind of insights you want to gather i.e. what questions do you want to ask?
Is it really BigData (Use 4 V's to figure this out) or something that tools like R can help you with?
Once you understood the above, you will know how to proceed

Hadoop for the Wikipedia pagecount dataset

I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like
en-Articlename: en:count de:count fr:count
For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).
My question is to find the right approach to solve this problem.
My sketched approach would be:
Process the pagecount file line by line (line-example 'de Brugge 2 48824')
Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
Aggreate all en-Articlename-values to one line (maybe in a second job?)
Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.
Do you think the current approach for is feasible or can you think of a different one?
On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay
Edit:
Here is a quite similar discussion which I just found now..
I think it isn't a good idea to query MediaApi during your batch processing due to:
network latency (your processing will be considerably slowed down)
single point of failure (if the api or your internet connection goes down your calculation will be aborted)
external dependency (its hard to repeat the calculation and got the same result)
legal issues and a ban possibility
The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.
Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.

Resources