Only return slimmed down version of feed - yahoo-pipes

I'm trying to create a new yahoo pipe that will only returned a slimmed down version of an xml.
Say my original XML looks like:
<?xml version="1.0" encoding="UTF-8" ?>
<name>Joe bloggs</name>
<age>31</age>
<description>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse aliquam metus id eros blandit vel convallis nunc accumsan. Fusce adipiscing eros a enim feugiat vestibulum. Cras vulputate malesuada neque vel ultricies. Nunc commodo condimentum risus, eu interdum odio rutrum ut. Nullam nec neque eget dolor tristique dignissim sit amet non nibh. Donec sagittis, elit eget tempus laoreet, tellus eros gravida nunc, eu elementum sem turpis eget velit. In hac habitasse platea dictumst. Donec sed nibh nec arcu feugiat malesuada nec sollicitudin neque. Morbi egestas gravida blandit. Praesent luctus ipsum sed sem porta a tempus ipsum congue. Cras non lectus metus. Fusce non purus quam, vel convallis urna. Aenean dignissim consequat tincidunt. Nunc posuere pulvinar est, id pretium sem vestibulum non</description>.
I'm trying to create a yahoo pipe that will change the tag names, in which I'm using the rename module, and it works fine.
Now, I'm wanting to get rid of the description tag, so my XML only returns name and age.
How can I do that with yahoo pipes?
Cheers in advance for any help

Use the Regex module on the description field and replace .* with an empty textfield. That deletes the field.

Use the "Create RSS" module as the last step in the chain. Then only include the fields you want.

Related

Processing a specific part of a text according to pattern from AWK script

Im developing a script in awk to convert a tex document into html, according to my preferences.
#!/bin/awk -f
BEGIN {
FS="\n";
print "<html><body>"
}
# Function to print a row with one argument to handle either a 'th' tag or 'td' tag
function printRow(tag) {
for(i=1; i<=NF; i++) print "<"tag">"$i"</"tag">";
}
NR>1 {
[conditions]
printRow("p")
}
END {
print "</body></html>"
}
Its in a very young stage of development, as seen.
\documentclass[a4paper, 11pt, titlepage]{article}
\usepackage{fancyhdr}
\usepackage{graphicx}
\usepackage{imakeidx}
[...]
\begin{document}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla placerat lectus sit amet augue facilisis, eget viverra sem pellentesque. Nulla vehicula metus risus, vel condimentum nunc dignissim eget. Vivamus quis sagittis tellus, eget ullamcorper libero. Nulla vitae fringilla nunc. Vivamus id suscipit mi. Phasellus porta lacinia dolor, at congue eros rhoncus vitae. Donec vel condimentum sapien. Curabitur est massa, finibus vel iaculis id, dignissim nec nisl. Sed non justo orci. Morbi quis orci efficitur sem porttitor pulvinar. Duis consectetur rhoncus posuere. Duis cursus neque semper lectus fermentum rhoncus.
\end{document}
What I want, is that the script only interprets the lines that are between \begin{document} and \end{document}, since before they are imports of libraries, variables, etc; which at the moment do not interest me.
How do I make it so that it only processes the text within that pattern?
GNU AWK has feature called Range when you provide two conditions sheared by , then action will be applied only between lines with these conditions (including these lines), consider following simple example, let file.txt content be
junk
\begin{document}
desired text
more desired text
\end{document}
more junk
then
awk '$0=="\\begin{document}",$0=="\\end{document}"{print}' file.txt
gives output
\begin{document}
desired text
more desired text
\end{document}
(tested in gawk 4.2.1)
Use a regex to set a flag and then print based on that flag:
awk '/^\\begin{document}/{flag=1}
flag
/^\\end{document}/{flag=0}' file
That print everything between the start and ending strings inclusive:
\begin{document}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla placerat lectus sit amet augue facilisis, eget viverra sem pellentesque. Nulla vehicula metus risus, vel condimentum nunc dignissim eget. Vivamus quis sagittis tellus, eget ullamcorper libero. Nulla vitae fringilla nunc. Vivamus id suscipit mi. Phasellus porta lacinia dolor, at congue eros rhoncus vitae. Donec vel condimentum sapien. Curabitur est massa, finibus vel iaculis id, dignissim nec nisl. Sed non justo orci. Morbi quis orci efficitur sem porttitor pulvinar. Duis consectetur rhoncus posuere. Duis cursus neque semper lectus fermentum rhoncus.
\end{document}
If you only want the text between and not including the start and end strings:
awk '
/^\\begin{document}/{flag=1; next}
/^\\end{document}/{flag=0}
flag' file
Prints:
# leading blank line printed...
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla placerat lectus sit amet augue facilisis, eget viverra sem pellentesque. Nulla vehicula metus risus, vel condimentum nunc dignissim eget. Vivamus quis sagittis tellus, eget ullamcorper libero. Nulla vitae fringilla nunc. Vivamus id suscipit mi. Phasellus porta lacinia dolor, at congue eros rhoncus vitae. Donec vel condimentum sapien. Curabitur est massa, finibus vel iaculis id, dignissim nec nisl. Sed non justo orci. Morbi quis orci efficitur sem porttitor pulvinar. Duis consectetur rhoncus posuere. Duis cursus neque semper lectus fermentum rhoncus.
# ending blank line printed...

How to use HEREDOC to pass as an argument to a method?

Code example:
create_data_with(
first: "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
second: <<~TEXT
Aenean vel ex bibendum, egestas tortor sit amet, tempus lorem. Ut sit
amet rhoncus eros. Vestibulum ante ipsum primis in faucibus orci
luctus et ultrices posuere cubilia curae; Quisque non risus vel lacus
tristique laoreet. Curabitur quis auctor mauris, nec tempus mauris.
TEXT,
third: "Nunc aliquet ipsum at semper sodales."
)
The error is present in this line:
second: <<~TEXT
RuboCop describes it like this:
Lint/Syntax: unterminated string meets end of file
(Using Ruby 3.1 parser; configure using TargetRubyVersion parameter, under AllCops)
second: <<~TEXT
Can you please tell me what should be the syntax? I need to keep the look and use of <<~.
Another option is to move the heredoc after the method call. However, since the heredoc starts on the line following its identifier, your method call must not span multiple lines:
create_data_with(first: "foo", second: <<~TEXT, third: "bar")
Aenean vel ex bibendum, egestas tortor sit amet, tempus lorem. Ut sit
amet rhoncus eros. Vestibulum ante ipsum primis in faucibus orci
luctus et ultrices posuere cubilia curae; Quisque non risus vel lacus
tristique laoreet. Curabitur quis auctor mauris, nec tempus mauris.
TEXT
For longer values, you could use multiple heredocs:
create_data_with(first: <<~FIRST, second: <<~SECOND, third: <<~THIRD)
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
FIRST
Aenean vel ex bibendum, egestas tortor sit amet, tempus lorem. Ut sit
amet rhoncus eros. Vestibulum ante ipsum primis in faucibus orci
luctus et ultrices posuere cubilia curae; Quisque non risus vel lacus
tristique laoreet. Curabitur quis auctor mauris, nec tempus mauris.
SECOND
Nunc aliquet ipsum at semper sodales.
THIRD
With heredocs, the parser expects the exact delimiter to close the literal. You open with TEXT, but you close with TEXT, and ruby doesn't consider this literal closed. However, you can (and should in this case) put the comma after the opening delimiter. Here's a fix:
create_data_with(
first: "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
second: <<~TEXT,
Aenean vel ex bibendum, egestas tortor sit amet, tempus lorem. Ut sit
amet rhoncus eros. Vestibulum ante ipsum primis in faucibus orci
luctus et ultrices posuere cubilia curae; Quisque non risus vel lacus
tristique laoreet. Curabitur quis auctor mauris, nec tempus mauris.
TEXT
third: "Nunc aliquet ipsum at semper sodales."
)
You can even call methods this way. For example, the squiggly heredoc (<<~TEXT) was previously done in rails as <<-TEXT.strip_heredoc

How to split a file in bash by pattern if find a number

I have a text like:
1Lorem ipsum dolor sit amet, consectetur adipiscing elit. 2Vivamus dictum, justo mattis sollicitudin pretium, ante magna gravida ligula, 3a condimentum libero tortor sit amet lectus. Nulla congue mauris quis lobortis interdum. 4Integer eget ante mattis ante egestas suscipit. Suspendisse imperdiet pellentesque risus, a luctus sem pellentesque nec. Curabitur vel luctus eros. Morbi id magna sit amet 5risus hendrerit porta. Praesent vitae sapien in nunc aliquet pharetra vitae sed lectus. Donec id magna magna. Phasellus eget rhoncus purus, vitae vestibulum nisl. 6Phasellus massa mi, ultricies id mi sit amet, tristique auctor mi.
I want to split the text by the numbers found, whatever; like:
1Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2Vivamus dictum, justo mattis sollicitudin pretium, ante magna gravida ligula,
3a condimentum libero tortor sit amet lectus. Nulla congue mauris quis lobortis interdum.
...
In awk, I tried:
cat text | awk -F'/^[-+]?[0-9]+$/' '{for (i=1; i<= NF; i++) print $i}'
Where -F is /^[-+]?[0-9]+$/, a pattern used to test if is a number or not. But it`snt split the text.
If I change the pattern to any separator it works without problems, what is then the pattern that I should use for it?
I would harness GNU AWK for this task following way, let file.txt content be
1Lorem ipsum dolor sit amet, consectetur adipiscing elit. 2Vivamus dictum, justo mattis sollicitudin pretium, ante magna gravida ligula, 3a condimentum libero tortor sit amet lectus. Nulla congue mauris quis lobortis interdum. 4Integer eget ante mattis ante egestas suscipit. Suspendisse imperdiet pellentesque risus, a luctus sem pellentesque nec. Curabitur vel luctus eros. Morbi id magna sit amet 5risus hendrerit porta. Praesent vitae sapien in nunc aliquet pharetra vitae sed lectus. Donec id magna magna. Phasellus eget rhoncus purus, vitae vestibulum nisl. 6Phasellus massa mi, ultricies id mi sit amet, tristique auctor mi.
then
awk 'BEGIN{RS="[-+]?[0-9]+"}{printf "%s%s%s", $0, NR==1?"":"\n", RT}' file.txt
gives output
1Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2Vivamus dictum, justo mattis sollicitudin pretium, ante magna gravida ligula,
3a condimentum libero tortor sit amet lectus. Nulla congue mauris quis lobortis interdum.
4Integer eget ante mattis ante egestas suscipit. Suspendisse imperdiet pellentesque risus, a luctus sem pellentesque nec. Curabitur vel luctus eros. Morbi id magna sit amet
5risus hendrerit porta. Praesent vitae sapien in nunc aliquet pharetra vitae sed lectus. Donec id magna magna. Phasellus eget rhoncus purus, vitae vestibulum nisl.
6Phasellus massa mi, ultricies id mi sit amet, tristique auctor mi.
Explanation: I inform GNU AWK that row separator (RS) is (- or +) repeated 0 or 1 time and digit repeated 1 or more time. Then for every row I printf content of said line followed by newline (only for non-first word) followed by found row terminator (RT).
(tested in gawk 4.2.1)
This inserts a new line before every number, except the first, and also strips any whitespace before the new line.
sed -E 's/[[:blank:]]*([0-9]+)/\
\1/g; s/\n//'
You still have the problem of numbers within each line which are regular content. These will also have a new line prepended.
absolutely no need for vendor proprietary solutions :
{m,n,g}awk '
(NF=NF)+gsub("[0-9]+[^0-9]+[.]? ","&\n")+gsub("[ \t]+\n",FS)' FS='\n' OFS= \
RS='^$' ORS=
_
1Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2Vivamus dictum, justo mattis sollicitudin pretium, ante magna gravida ligula,
3a condimentum libero tortor sit amet lectus. Nulla congue mauris quis lobortis interdum.
4Integer eget ante mattis ante egestas suscipit. Suspendisse imperdiet pellentesque risus, a luctus sem pellentesque nec. Curabitur vel luctus eros. Morbi id magna sit amet
5risus hendrerit porta. Praesent vitae sapien in nunc aliquet pharetra vitae sed lectus. Donec id magna magna. Phasellus eget rhoncus purus, vitae vestibulum nisl.
6Phasellus massa mi, ultricies id mi sit amet, tristique auctor mi.

How to ignore URL when searching using ElasticSearch?

Hi,I have a set of documents which may contains some texts, but may have URLs inside them:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam tincidunt metus a convallis imperdiet. Praesent interdum magna ut lorem bibendum vehicula. Maecenas consectetur tortor a ex pulvinar, sit amet sollicitudin nunc maximus. Pellentesque non gravida ligula, imperdiet pharetra odio. Nunc non massa vitae mauris tempor tempus. Nulla ac laoreet tellus. Nulla consequat tortor eu eros euismod bibendum. Curabitur ante ligula, aliquet at lacus at, pretium convallis eros. Fusce id mi condimentum, tempor lorem ut, pharetra libero.
https://document.io/document/ipsum
In eget eleifend neque. Morbi ex leo, tincidunt non enim ut, rutrum suscipit metus. Cras laoreet ex ut massa consequat condimentum. Aenean finibus eu nisl ut rhoncus. Aliquam finibus nisl risus, id facilisis justo rutrum et. Aenean enim libero, commodo id mi ut, mattis sollicitudin tellus. Aliquam molestie ligula sit amet lorem malesuada, aliquet pretium dolor malesuada. Phasellus fringilla libero in sollicitudin tristique. Quisque molestie, enim et aliquam dapibus, ex erat ultrices nisi, luctus ornare lorem metus eu sapien.
I am using a match query to search words inside the document, however, as you can see sometimes the URL has words that are also part of the actual texts. This is messing the result up. I am just wondering if ElasticSearch has a way for me to simply ignore the URLs and just focus on the texts?
I am using english analyzer for this field at this moment.
You can use Pattern replace character filter in your analyzer. For removing URL from your text you can add this filter to your search analyzer:
Filter:
"char_filter": {
"type": "pattern_replace",
"pattern": "\\b(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]",
"replacement": ""
}
This filter will replace URL with empty string so you will not get result from URL match.

Is the following lossless data compression algorithm theoretically valid?

I am wondering if the following algorithm is a valid lossless data compression algorithm (although not practical with traditional computers, maybe quantum computers?).
At a high and simplified level, the compression steps are:
Calculate the character frequency of the uncompressed text.
Calculate the SHA3-512 (or another hash function) of the uncompressed text.
Concatenate the SHA3-512 and the character frequency (this is now the compressed text that would be written to a file).
And at a high and simplified level, the decompression steps are:
Using the character frequency in the compressed file, generate a permutation of the uncompressed text (keep track of which one).
Calculate the SHA3-512 of the generated permutation in step 1.
If the SHA3-512 calculated in step 2 matches the SHA3-512 in the compressed file, the decompression is complete. Else, go to step 1.
Would it be possible to have a SHA3-512 collision with a permutation of the uncompressed text (i.e. can two permutations of a given character frequency have the same SHA3-512?)? If so, when could this start happening (i.e. after how many uncompressed text characters?)?
One simplified example is as follows:
The uncompressed text is: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas et enim vitae ligula ultricies molestie at ac libero. Duis dui erat, mollis nec metus nec, porttitor scelerisque enim. Aenean feugiat tellus sit amet facilisis imperdiet. Fusce et nisl porta, aliquam quam eget, mollis sapien. Sed purus est, efficitur elementum quam quis, congue rutrum libero. Etiam metus leo, hendrerit ac dui in, hendrerit blandit sem. Etiam pellentesque enim dapibus luctus volutpat. Praesent aliquet ipsum vitae mauris pulvinar, et pharetra leo semper. Nulla a mauris tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Integer sollicitudin dui sapien, in tempus arcu facilisis in. Vivamus dui dolor, faucibus eu accumsan eu, porttitor id risus. In auctor congue pellentesque. Cras malesuada enim eget est vehicula pretium. Phasellus scelerisque imperdiet lorem, eu euismod lectus convallis consequat. Nam vitae euismod est, vitae lacinia arcu. Praesent fermentum sit amet erat feugiat cursus. Pellentesque magna felis, euismod vel vehicula eu, tincidunt ac ex. Vestibulum viverra justo nec orci semper, nec consequat justo faucibus. Curabitur dignissim feugiat nulla, in cursus nunc facilisis id. Suspendisse potenti. Etiam commodo turpis non fringilla semper. Vivamus aliquam ex non lorem tincidunt, et sagittis tellus placerat. Proin malesuada tortor eu viverra faucibus. Curabitur euismod orci lorem, ut fermentum velit consectetur vel. Nullam sodales cursus maximus. Curabitur nec turpis erat. Vestibulum eget lorem nunc. Morbi laoreet massa vel nulla feugiat gravida. Nulla a rutrum neque. Phasellus maximus tempus neque, eu sagittis ex volutpat ac. Duis malesuada sem vitae lacus suscipit, eu dictum elit euismod. Sed id sagittis leo. Sed convallis nisi nisl, vel pretium elit cursus vel. Duis quis accumsan odio. Ut arcu ex, iaculis a lectus sit amet, lacinia pellentesque enim. Donec maximus ante odio, a porta odio luctus at. Nullam dapibus aliquet sollicitudin. Sed ultrices iaculis blandit. Suspendisse dapibus, odio non venenatis faucibus, justo urna euismod neque, non finibus ante ante in massa. Sed sit amet nunc vel lacus dictum euismod. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Interdum et malesuada fames ac ante ipsum primis in faucibus. Fusce varius lacus velit, venenatis consequat justo rutrum nec. Nunc cursus odio arcu, nec egestas purus feugiat nec. Aliquam efficitur ornare ullamcorper. Mauris consectetur, quam vitae ultricies ullamcorper, nulla nulla tempus risus, aliquet euismod urna erat gravida neque. Suspendisse et viverra enim, ut facilisis enim. Quisque quis elit diam. Morbi quis nulla bibendum, molestie risus egestas, pharetra nisl. Aliquam sed massa dictum, scelerisque odio vel, finibus tellus. Nam tristique commodo sem, a dictum risus euismod sed. Morbi vel urna nec sem consectetur auctor quis ac augue. Donec ac pellentesque tortor. In hendrerit ultricies consequat. Pellentesque non metus vitae elit euismod efficitur in in leo. Nulla ac pulvinar nunc. Donec porttitor nunc ante, et congue augue laoreet ac. Vivamus bibendum id est eleifend efficitur. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc arcu neque, molestie ac lorem id, feugiat efficitur erat. Vestibulum vel condimentum lectus, eu euismod turpis.".
The character frequency is: "⎵:501 e:345 i:277 u:266 s:240 t:226 a:219 l:161 n:154 r:147 m:132 c:128 o:117 d:79 .:64 p:54 ,:47 v:40 q:39 f:35 g:31 b:31 h:11 P:9 N:9 S:8 x:7 D:6 V:6 M:5 I:4 C:4 j:4 L:3 A:3 E:3 F:2 U:1 Q:1".
The SHA3-512 is: "45ebde65cf667d1bfdcf779baab84301c1d4abe60448be821adda9cf7b99b36a61c53233db4a0eda93a04c75201be13bbb638b5e78f5047560fffc97f1c95adb".
The compressed file contents are: "45ebde65cf667d1bfdcf779baab84301c1d4abe60448be821adda9cf7b99b36a61c53233db4a0eda93a04c75201be13bbb638b5e78f5047560fffc97f1c95adb⎵:501 e:345 i:277 u:266 s:240 t:226 a:219 l:161 n:154 r:147 m:132 c:128 o:117 d:79 .:64 p:54 ,:47 v:40 q:39 f:35 g:31 b:31 h:11 P:9 N:9 S:8 x:7 D:6 V:6 M:5 I:4 C:4 j:4 L:3 A:3 E:3 F:2 U:1 Q:1".
Your compression method assumes that there is only one permutation of the given character frequency table that will generate the given hash code. That's provably false.
A 512-bit hash can represent on the order of 1.34E+154 unique values. The number of permutations in a 100-character file is 100!, or 9.33E+157.
Given a 100-character file, there are over 6,900 different permutations for each possible 512-bit hash code.
Using a larger hash code won't help. The number of hash codes doubles with each bit you add, but the number of possible permutations grows more with each character you add to the file.

Resources