I've come across a specific task to create a netcdf file which stores data only from certain processors. The matter is the following-I have 3d field, divided into (nx) x (ny) x (nz) domains. Each domain has a processor assigned to it. I would like to save data only from domains in certain position in x direction. This means that the data would come only from ny x nz processors. I've been trying to find examples on how to write such data, but unsucessfully. Does anyone know if this is doable and also-are there specific commands I should use.
For example, I tried to invoke writing data with using if conditions where I used if(mpid%rank==0) then... together with nf90_var_par_access(ncid, varid, nf90_independent) call, but without success, the procedure seems to get stuck.
thank you in advance!
Actually, I've managed to resolve the issue just hours after I posted the question. The main trouble was the dimension length definition in nf90_def_dim call. In it, the code I have assumed by default the dimension, which was a product of number of domains and points in them. I changed that definition to adapt to the case when only certain domains are used and the write process on only few processors worked.
regards to all
Related
I am trying to get data out of debug log messages created by a certain piece of open source software. It has many lines describing what it is doing during stages. It does not have a specific structure, i.e. some data covers multiple lines with different indents and no separator so does not import nicely into a pandas data frame, which would be my go-to usually.
Is there a good way to structure a python script that parses this data and one that can be used in the future for the same function, and also be extendable to extract different data? I have to do a bunch of different steps to extract the data. The other complication is that the file is much too big to store in memory (10^6 lines) so i need to iterate through the lines.
Please could anyone give me some tips on how to do this, is it best to move to do each step and save to a new file? Or my idea is to create a data object and store relevant line numbers as attributes in lists, that are generated in different method. Then each subsequent method only loads the lines from that list.
Or alternatively, maybe I am totally using the wrong tool and I need to learn awk or regex commands to do it? I just know python already so have a preference for it. Not looking for a specific answer necessarily, some tips and pointers would also be very useful!
(--details--) I am trying to trace on a freeradius server the difference between log messages of requests, accepts and rejects of a mac address to see if I can find out why it is sometimes accepted and other times rejected, seemingly randomly.
There are a lot of plugins running on the server setup before I got to dealing with it so the debug is a massive wall of text, labelling each request with a number. I can split it into requests by that number, find the request that mentions the mac, split those requests into different files, then run want to filter out all the boilerplate info that comes with each message and get to the things that are different between them. (--details--)
I'm working on a Beam IO for Elasticsearch in Golang and at the moment I have a working draft version but, only managed to make it work by doing something that's not clear to me why do I need it.
Basically I looked at existing IO's and found that writes only work if I add the following:
x := beam.AddFixedKey(s, pColl)
y := beam.GroupByKey(s, x)
A full example is in the existing BigQuery IO
Basically I would like to understand why do I need both AddFixedKey followed by a GroupByKey to make it work. Also checked the issue BEAM-3860, but doesn't have much more details about it.
Those two transforms essentially function as a way to group all elements in a PCollection into one list. For example, its usage in the BigQuery example you posted allows grouping the entire input PCollection into a list that gets iterated over in the ProcessElement method.
Whether to use this approach depends how you are implementing the IO. The BigQuery example you posted performs its writes as a batch once all elements are available, but that may not be the best approach for your use case. You might prefer to write elements one at a time as they come in, especially if you can parallelize writes among different workers. In that case you would want to avoid grouping the input PCollection together.
Below is a general ETL flow chart diagram.I am really confused if it is a good practice to draw such a flow chart.Especially at lines connecting the final output , and the big box used to generalize the whole process that goes from input format to 'Validate files as per type of file' to 'Adjustments for desired output' and finally to 'outputs'.
In concept, to provide an overview of the process, a Context diagram or Data Flow Diagram can be used as well as a Flow Chart. While all these diagrams are old, they are usually useful.
I suggest that you check the following points in your diagram - What I have here are personal suggestions based on long history with ETL.
Your processing shape (say rectangle) must have at least one entry (input) and at least one exist(output). Example: The two large rectangles don't have clear inputs and outputs! What is going on there?
You can't have outputs without a process - Example: Type 1 is somehow split into Type 1.1 and Type 1.2. How is that split happened? Is it by a program? Which one? Another example is Type 1.1 has an arrow connected to Output 1. Again same questions hold. Another example is "Rejected folder" to email.
The name of the process should indicate what the process is or the actual program component name or physical file names. This is valuable in ETL.
You may want to show the process triggering event such as time of day.
Use remarks.
Make the level of detail consistent.
I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like
en-Articlename: en:count de:count fr:count
For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).
My question is to find the right approach to solve this problem.
My sketched approach would be:
Process the pagecount file line by line (line-example 'de Brugge 2 48824')
Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
Aggreate all en-Articlename-values to one line (maybe in a second job?)
Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.
Do you think the current approach for is feasible or can you think of a different one?
On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay
Edit:
Here is a quite similar discussion which I just found now..
I think it isn't a good idea to query MediaApi during your batch processing due to:
network latency (your processing will be considerably slowed down)
single point of failure (if the api or your internet connection goes down your calculation will be aborted)
external dependency (its hard to repeat the calculation and got the same result)
legal issues and a ban possibility
The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.
Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.
ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.