Can anybody tell me what's wrong with my pipe?
I am trying to make a hand build RSS from this webpage, and put the full html content of the links to the description tag. The output seems to be ok in the debug panel, but when I run this pipe, it returns empty result.
I will try to explain what I think is wrong.
The first yql is good.
The first loop is unnecessary
The second loop's yql is a problem because it will give you an object
instead of a string. You want the fetch page module instead.
The final loop with the rss item builder, I believe you just want a
'rename' instead of this.
Related
Problem Summary:
Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.
So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.
Background info:
So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.
Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.
I've tried a lot of variations to get the data but all are returning empty arrays.
I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.
Code:
Here are some of the attempts that didn't work:
response.xpath("//div[#class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[#class='g5-component]")
response.xpath("//li[#class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[#class='mlb-scores__list-item']")
response.xpath("//div[#!data-game-pk-id > 0]")'
response.xpath("//div[contains(#class, 'g5-component')]")
Expected Results and Actual Results
I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.
So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).
So for all my attempts I either get the wrong objects or an empty array.
You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.
You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:
https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420
I'm building a Yahoo Pipe using the XPath Fetch Page module. You can see my pipe here.
I can see the right information shown in the Debugger, showing 13 items, but when I run my pipe I'm getting the List with 13 items as null.
Already tried choosing the "Emit items as string" option in Xpath Fetch Page module and then doing a regex to get rid of the unwanted HTML code, but I'm getting the same result on running.
Can you guide me on the right direction on how to fix this problem?
Thanks
I am working on a project using a bash shell script. The idea is to grep a wget retrieved page, in order to pick up a certain paragraph on the web page. The area I would like to copy, usually starts with a
<p><b>
but the paragraph also contains other bits of HTML code, such as anchor tags, that I don't want to be in the output of the grep.
I have tried
cat page.html| grep "<p><b>" >grep.txt
and then I grep the output file, which now contains the paragraph I want
cat grep.txt|grep -v '<p>|<b>|<a>' >grep.txt
but then all it does is clear everything from the file and not read anything. How can I get it to exclude only the HTML code?
I am also trying to follow the links that are in the paragraph that I grep, in order to do the same thing with those pages. Only 2 levels deep, so the main page and then what ever sub page(s) stem from the first paragraph of the main page. I know this is a difficult idea, hopefully I explained well enough to get some help. If you have any ideas, any help is appreciated.
Do you have to do this in bash? It seems to me that Python would lend itself to this problem, in particular a library called Beautiful Soup.
I've used this for parsing HTML in the past and it's the easiest tool I could find. It has good documentation for dealing with html.
Perhaps you could make a standalone python code that extracts the HTML and then echos the string you're after. The python code could then be called from inside your bash script if you have some bash functions you want to perform on the string.
I know this is 7 years old but just posting solution I have with bash
https://api.jquery.com/jquery.grep/
Using Yahoo Pipes, I know that there are many ways to take something from a string and insert it into a feed. I am wondering if it is possible the otherway around. So far I have been unsuccessful in taking something for instance an title (item.title) and turning it into a string.
What I want to accomplish is taking words from an RSS description and placing them into an url. For the URL Builder.
Use the Fetch feed module to get the raw feed.
Then use a loop module, and put a string builder inside it. Make the source in the string builder be item.description. If there are other parts you want to concatenate together with it, add those in to the string builder as well. Assign that value to item.link (or wherever you want it).
You can repeat the loop with a regular expression inside it if you need to process the URL further.
I'm looking into the possibilities of jmeter, and it looks just great. However, one of the things my testing script should be able to do, is search for some values, and click on a random resulting link.
So what I would need to automate is:
Entering the values in the searchbox (I could do this by using the correct GET url in a second page, but how do I do this 5000 times?)
Clicking on one of the results listed.
Thanks for the help!
This can be done using the CSV Data Set, it loads a CSV file, and feeds the content into the variables of your choice:
http://jmeter.apache.org/usermanual/component_reference.html#CSV_Data_Set_Config
After that, you can use the regular expression extractor to extract the URL's you want from the resulting HTML, and follow those links:
http://jmeter.apache.org/usermanual/component_reference.html#Regular_Expression_Extractor
You could use a HTML Link Parser for your second part (the clicking on one of the results):
http://jmeter.apache.org/usermanual/component_reference.html#HTML_Link_Parser
See http://theworkaholic.blogspot.co.at/2009/11/randomly-clicking-links-in-jmeter.html for an example of using the link parser in a context similar to your question.