Yahoo Pipes: filter items in a feed based on words in a text file - yahoo

I have a pipe that filters an RSS feed and removes any item that contains "stopwords" that I've chosen. Currently I've manually created a filter for each stopword in the pipe editor, but the more logical way is to read these from a file. I've figured out how to read the stopwords out of the text file, but how do I apply the filter operator to the feed, once for every stopword?
The documentation states explicitly that operators can't be applied within the loop construct, but hopefully I'm missing something here.

You're not missing anything - the filter operator can't go in a loop.
Your best bet might be to generate a regex out of the stopwords and filter using that. e.g. generate a string like (word1|word2|word3|...|wordN).
You may have to escape any odd characters. Also I'm not sure how long a regex can be so you might have to chunk it over multiple filter rules.

In addition to Gavin Brock's answer the following Yahoo Pipes
filters the feed items (title, description, link and author) according to multiple stopwords:
Pipes Info
Pipes Edit
Pipes Demo
Inputs
_render=rss
feed=http://example.com/feed.rss
stopwords=word1-word2-word3

Related

How to use uBlock static filter operator

uBlock wiki says:
Filters such as ||example.com^ are still considered generic.
What is || operator? What does the above filter do?
uBlock uses/supports some AdBlockPlus syntax, you can refer to docs on it:
https://help.eyeo.com/en/adblockplus/how-to-write-filters#creating-filters
You might want to block http://example.com/banner.gif as well as https://example.com/banner.gif and http://www.example.com/banner.gif. You can do this by putting two pipe symbols in front of the filter. This ensures that the filter matches at the beginning of the domain name: ||example.com/banner.gif, and blocks all of these addresses while not blocking http://badexample.com/banner.gif or http://gooddomain.example/analyze?http://example.com/banner.gif.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

ElasticSearch Nest AutoComplete based on words split by whitespace

I have AutoComplete working with ElasticSearch (Nest) and it's fine when the user types in the letters from the begining of the phrase but I would like to be able to use a specialized type of auto complete if it's possible that caters for words in a sentence.
To clarify further, my requirement is to be able to "auto complete" like such:
Imagine the full indexed string is "this is some title". When the user types in "th", this comes back as a suggestion with my current code.
I would also like the same thing to be returned if the user types in "som" or "title" or any letters that form a word (word being classified as a string between two spaces or the start/end of the string).
The code I have is:
var result = _client.Search<ContentIndexable>(
body => body
.Index(indexName)
.SuggestCompletion("content-suggest" + Guid.NewGuid(),
descriptor =>
descriptor
.OnField(t => t.Title.Suffix("completion"))
.Text(searchTerm)
.Size(size)));
And I would like to see if it would be possible to write something that matches my requirement using SuggestCompletion (and not by doing a match query).
Many thanks,
Update:
This question already has an answer here but I leave it here since the title/description is probably a little easier to search by search engines.
The correct solution to this problem can be found here:
Elasticsearch NEST client creating multi-field fields with completion
#Kha i think it's better to use the NGram Tokenizer
So you should use this tokenizer when you create the mapping.
If you want more info, and maybe an example write back.

Yahoo pipes: how can I add an additional nodes/elements to RSS/feed items

I am merging two feeds using Yahoo pipes and using the output feed on a website. However, as would like to identify the "feed source" for each item in the output feed. Is it possible to manipulate the original feeds so I can add another node/element to the feed items?
Thanks
One way to do that is using the Regex operator. Let's say you want to add a new field called source. You could use Regex with parameters:
In: item.source
replace: .*
with: (the text you want)
See it in action here:
http://pipes.yahoo.com/janos/7a3b9993cfc143d414fe7b637b1bd95a
That is, I have two feeds, I added a source attribute in the first with value "Question 1" and in the second with value "Question 2".
As an added bonus interesting undocumented Yahoo Pipes hack, I used one more Regex after the Union to make the source appear in the title.
However, this only adds the attribute to the node in the pipe debugger. You can use it for further processing, like I added it here to the title, it won't create a <source> tag in the output. That's because the RSS output of Yahoo Pipes removes all other fields that are not in the RSS standard. You can still see it in the JSON output though.

Yahoo Pipes and YQL: Can I strip HTML tags from my items?

Given this pipe, I'm trying to strip all HTML tags from -div class="post-text"- and return plain text.
In other words, for this stackoverflow question, the first item should return:
"Background: Over the next month, I'll
be giving three talks...
{...}
complex generic signatures (e.g. Enumerable.Join)"
Can anyone assist?
At face value, it would be convenient to get the HTML-less text content in the YQL select clause, but I'd settle for a subsequent Regex module if that is the only way.

Resources