Hadoop for the Wikipedia pagecount dataset - hadoop

I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like
en-Articlename: en:count de:count fr:count
For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).
My question is to find the right approach to solve this problem.
My sketched approach would be:
Process the pagecount file line by line (line-example 'de Brugge 2 48824')
Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
Aggreate all en-Articlename-values to one line (maybe in a second job?)
Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.
Do you think the current approach for is feasible or can you think of a different one?
On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay
Edit:
Here is a quite similar discussion which I just found now..

I think it isn't a good idea to query MediaApi during your batch processing due to:
network latency (your processing will be considerably slowed down)
single point of failure (if the api or your internet connection goes down your calculation will be aborted)
external dependency (its hard to repeat the calculation and got the same result)
legal issues and a ban possibility
The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.
Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.

Related

Is there a process for munging data from many different formats in RapidMiner?

I'm trying to help my team streamline a data ingestion process that is taking up a substantial amount of time. We receive data in multiple formats and with attributes arranged differently. Is there a way using RapidMiner to create a process that:
Processes files on a schedule that are dropped into a folder (this
one I think I know but I'd love tips on this as scheduled processes
are new to me)
Automatically identifies input filetype and routes to the correct operator ("Read CSV" for example)
Recognizes a relatively small number of attributes and arranges them accordingly. In some cases, attributes are named the same way as our ingestion format and in others they are not (phone vs phone # vs Phone for example)
The attributes we process mostly consist of name, id, phone, email, address. Also, in some cases names are split first/last and in some they are full name.
I recognize that munging files for such simple attributes shouldn't be that hard but the number of files we receive and lack of order makes it very difficult to streamline a process without a bit of automation. I'm also going to move to a standardized receiving format but for a number of reasons that's on the horizon and not an immediate solution.
I appreciate any tips or guidance you can share.
Your question is relative broad, so unfortunately I can't give you complete answer. But here are some ideas on how I would tackle the points you mentioned:
For a full process scheduling RapidMiner Server is what you are
looking for. In that case you can either define a schedule (e.g.,
check regularly for new files) or even define a web service to
trigger the process.
For selecting the correct operator depending on file type, you could
use a combination of "Loop Files" and macro extraction to get the
correct type and the use either "Branch" or "Select Subprocess" for
switching to different input routes.
The "Select Attributes" operator has some very powerful options to
select specific subsets only. In your example I would go for a
regular expression akin to [pP]hone.* to get the different spelling
variants. Also very helpful in that case would be the "Reorder
Attributes" operator and "Rename by Replacing" to create a common
naming schema.
A general tip when building more complex process pipelines is to organize your different tasks in sub-processes and use the "Execute Process" operator. This makes everything much more readable and maintainable. Also a good error handling strategy is important to handle unforeseen data formats.
For more elaborate answers and tips from many adavanced RapidMiner users, I also highly recommend the RapidMiner community.
I hope this gives a good starting point for your project.

How do I output the current db level on Mac?

Hello and thanks for checking out my question,
I am working on a project analysing film and visualizing the data I got from it. I'm quite new at programming and only have some basic experience in java and javascript.
For my project I want to store the db levels of a movie in a csv file, to later work with the data in processing. I couldn't find anything that wasn't too complex for me to comprehend for Mac (OSX.)
Help would be much appreciated!
Thank you.
You're going to have to break your problem down into smaller steps.
Step 1: Generating the CSV file.
There are probably a million different ways to do this, and that can be pretty confusing. But break this down into smaller sub-steps and then take those steps one at a time. Can you get a movie playing in Processing? There is a Video library that does just that. Then can you get the volume level every X seconds? You might start with a separate sketch that just prints something to the console every X seconds. For getting the volume, you might try out the Minim library. If that doesn't work, Google is your friend, and remember to keep breaking your problem down into smaller steps!
Step 2: Loading the CSV file.
Now that you have the CSV file, you have to load it into Proccessing. There are several functions in the reference that might come in handy. Again, start with an example program that just prints the values to the console. Get that working perfectly before moving on.
Step 3: Visualizing the data.
Now that you have the data in your Processing code, you can start thinking about how you want to visualize the data. Maybe a line chart that just shows the volume over time just to start with.
If you get stuck on a specific step, then try to break it down into smaller sub-steps. Create an example program that just tests one of those smaller sub-steps (also known as an MCVE), and you'll be able to ask a more specific code-oriented question. Good luck, sounds like an interesting project!

TDD strategy when implementing a multi-stage process?

At the moment I'm developing a piece of code which first gathers sentences from a set of documents, then tokenises these, then uses the results to analyse recurring frequencies of token sequences, including case variations (upper case/lower case/leading cap/other), then prints out the results.
Now I want to introduce two more stages before printing out the results:
1. firstly, removing "stop words" (i.e. words or short sequences the frequency of which can never be of interest, such as, in English, "the", "of the", "of which", etc.) - these stop words/"stop sequences" to be taken from a database table
2. secondly, bringing up a dialog enabling the user to identify sequences of new stop words, which would then remove the token sequences involved and also add the sequence in question to the database table.
The thing is, this is a multi-stage process, and I'm just wondering what TDD experts do faced with a situation like this: do I create a new test method for each individual stage...? The problem being that each individual stage requires the use of "live memory data" from the previous stage: another possibility could be to somehow serialise this data and then deserialise it when testing for the next stage... but then this would involve the app code doing things which were of benefit only for the testing code, i.e. it would mean tweaking ("distorting"?) the app code for the benefit of the testing code, which seems wrong in principle...
Also, if anyone can point me in the direction of a book or site which helps TDD newbs like myself go to "the next level" I would be very grateful.
later
To the person who marked this as "favorite": I've now got hold of a book called "Growing Object-Oriented Software, Guided by Tests", which is well-reviewed and appears to be for someone wanting to move from beginner to intermediate. First impressions good.
Any views on this book by experts also welcome, of course...
On the face of it, you seem to be building a pipeline. From what I can tell, you're currently implementing all of it within a single class, which stores both the data that's being worked on and implements the methods that do the processing. One approach that you could take would be to break down the problem into smaller chunks. Rather than having a single class, you have a class for each stage of the pipeline and another class for orchestrating the process which is responsible for plugging the stages together in the correct order.
So, scanning through what you've described, you appear to have the following processors:
DocumentReader (reads documents from somewhere into in memory document)
SentenceExtractor (document/list of documents in, list of sentences out)
1 or more SentenceAnalysers (sentences in, statistics out), you might want to break this down depending on the type of analysis and how complex it is.
StopWordExtractor (StopWordProvider and sentences in, sentences out)
There are additional supporting classes that would be needed, to support writing of new stopwords to the database and depending on how the stopwordprovider was implemented keeping it in sync as the user selects new ones.
Essentially, what I'm saying is that you appear to be doing too much in a single location. If you're really happy that the code as you've described it is a single unit, then there is nothing wrong with you testing it all in one place, but then your inputs will be your starting documents/sentences and your outputs will be the end of the process. If you agree with me that really, there are several distinct components involved in the process that could change independently, then I would suggest breaking the process down into smaller classes and testing that those perform as expected for given sets of inputs/outputs...

Automatically take list of terms, import into Windows search function (for content), and export lists of results. (AutoIT?)

My next big challenge is to write a script (I assume it would be in AutoIT, an area I have little experience with) to automate the Windows search function.
The end goal is to take a list of search terms from a .txt file (one string per line), and search the contents of every document on the computer for said search terms (one at a time).
I can make this happen by hand - turn on the search by content function, index all files on all attached drives, search the terms one by one, and highlight all > shift-click > Copy as path > paste in notepad, and save as [searchterm].txt.
However, I need to automate that whole process. I understand that I might need to write a separate script for each version of Windows it would be used with (XP, Vista, 7, 8).
Is this an easy enough task to accomplish, or would it take a lot of programming hours? Can anyone point me in the right direction? All help is appreciated.
Well, assuming your text file of queries is large enough, and you don't want to actually iterate the entire file system for each, you are describing a classic information retrieval problem.
Index the data from your file system (this is a preprocessing that is done only once)
For each query - search for it in the index, and get the relevant documents.
The field of Information Retrieval is a huge area of research, and I really don't encourage you to try implementing it from scratch.
I do encourage using built in libraries that are already developed and tested for you that do it. For example, in java a popular choice is lucene - which is very widely used for searching everywhere.
If you are not familiar with java, I am also aware of python (pylucene) and .NET (lucene.NET) bindings of this library.
To learn more about Information Retrieval I recommend Manning's Introduction to Information Retrieval

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation

Resources