How do I exclude everything with links from appearing in the pipes output? - yahoo

Yahoo Pipes is new to me and I want to know how to exclude tweets with links from appearing in the pipes output. I know filter can exclude everything with a keyword (say college) but I don't know how to exclude links.
I tried filter item.description that contains [each line with bit.ly, goo.gl] basically blocking everything with URL shortener. Is there a way to effectively block all links without having to enumerate all of them?
Any help would be greatly appreciated. Thanks

Related

Find Files in Folder - Search Query Includes Parentheses

I have a flow that pulls a list of filenames from an Excel file and then looks for them in a folder. Sometimes the filenames have parentheses in them, which causes issues with the search query and it doesn't even look for the file. I'm not sure how to handle the parentheses, but I don't want to remove the parentheses from the filenames (and ergo the search query). I thought about trimming the parentheses from the search query, but I want to make sure the right file is found. Perhaps I just need a way to escape the parentheses? I'm not sure how to do that though.
Here's a picture of the flow section in question:
I tried to find another post on this but after searching for a while I couldn't find anything, so I'm sorry if this has been answered already!
Any help is appreciated!
Edit: I'm going to try replacing any parentheses found with %28/%29 per Expiscornovus' suggestion.
Can you use a different search mode in the settings of your Find Files in Folder action (OneDriveSearch instead of Pattern)?
Ignore my previous suggested encoding. Inputting the Search query with parentheses should work. Look at the example below.

Multiple xpath expressions

Can somebody help me to write this expression as it is driving me crazy!
Take this site for example http://www.jigsaw-online.com/
I'm trying to build up an expression where I can get all links under a category of my choice.
E.g. I want all four /a under New In
I can get the New In link via
//header//li[#class='nav-level-1-list']/a[contains(text(),'New In')]
I've then tried going up a level to then get all the links via:
//header//li[#class='nav-level-1-list']//li/a
but that doesn't work because it's still trying to find an anchor that contains 'New In'
How can I combine these two expressions together so I can get all the links under the category?
//header//li[#class='nav-level-1-list']/ul[preceding-sibling::a[contains(text(),'New In')]]//li/a (did not test it, has nothing to run xpath on html)

How do I have my plugin fire before another in docpad so I can preprocess the content?

I would like to create a plugin which preprocesses content like markdown before it is passed to marked.
I don't want to create yet another extension to tack on the filename but would rather just search for a pattern in the content and if found do a substitution before marked has a chance to render.
I tried using the render event but my plugin seems to fire after marked even though its name sorts below it. What order do the plugins get used in?
I also tried using a renderBefore event but I can't figure out how to manipulate the content from there.
Any help would be appreciated.
Thanks in advance!
Jeff
Adding a plugin.priority will affect the order in which plugins are called with greater priorities being executed first.
The default plugin priority is 500.

RoboHelp CSH always goes to the first help page

I have a WebHelp content directory created using RoboHelp 9. From a web application, I'm trying to display a specific help page using their CSH JavaScript API:
RH_ShowHelp(0, "WebHelp/index.htm>MainWindow", HH_HELP_CONTEXT, <some map id>);
The problem is, the resultant popup always displays the first help topic, regardless of the map id I pass. Does the map file that was created for the RoboHelp project need to be included somewhere in the resultant WebHelp directory? I would think that RoboHelp would handle including whatever it needed in the generated content.
I think what's more likely is that I messed up somewhere in generating the map file/ids. To generate the map ids, I did the following:
Created a new map file
Double clicked it to open the map file window
Selected everything from the right list block (all the topics and help sections)
Clicked 'Auto Generate'
Are there further steps I need to follow before CSH will work?
Perhaps you forgot to include your mapfile in the generated output.
This is done in Web Help, under Content Categories.
Then, you can specify the topic number in the last argument to RH_ShowHelp.
Are you using the published output (not the generated output) in your content directory?
If that doesn't help, you can use simple links like this, which open the specified topic in help in the Help framework:
http://example.com/WebHelp/index.htm#someSubfolderThatIsAChildOfTheRootHelpFolder/theTopicYouWant.htm

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)

Resources