How to select data in scrapy from html file having both class and id? - xpath

<div class="section-body" id="section-2"><p>Most people with aortic stenosis do not develop symptoms until the disease is advanced. The diagnosis may have been made when the health care provider heard a heart murmur and performed tests.</p><p>Symptoms of aortic stenosis include:</p><ul><li>Chest discomfort: The chest pain may get worse with activity and reach into the arm, neck, or jaw. The chest may also feel tight or squeezed.</li><li>Cough, possibly bloody.</li><li>Breathing problems when exercising.</li><li>Becoming easily tired.</li><li>Feeling the heartbeat (palpitations).</li><li>Fainting, weakness, or dizziness with activity.</li></ul><p>In infants and children, symptoms include:</p><ul><li>Becoming easily tired with exertion (in mild cases)</li><li>Failure to gain weight</li><li>Poor feeding</li><li>Serious breathing problems that develop within days or weeks of birth (in severe cases)</li></ul><p>Children with mild or moderate aortic stenosis may get worse as they get older. They are also at risk for a heart infection called bacterial endocarditis.</p></div></div></section>
I have above script and I want to scrap the data in the list. i.e. in
I have tried following commands in scrapy but not working. It is giving '[]' as a output.
response.css("article div.section-body p").extract() <-- this is giving all info under section body but I want only under section-2
response.css("article div.section-body.section-2 p::text").extract()
response.xpath("//article/*[contains(#id, 'setion-2')]").extract()
please help me to extract. Thanks

Try
response.css("article div.section-body#section-2 p::text").extract()
div.section-body#section-2 means select DIV having both Class section-body and ID section-2
Note that ID is selected by # and class is selected by . ... So your CSS Selector posted in your question was wrong.

Related

Algorithm for translating MLB play-by-play records into descriptive text

I'm trying to collect a dataset that could be used for automatically generating baseball articles.
I have play-by-play records of MLB games from retrosheet.org that I would like to be written out to plain text, as those that could possibly appear as part of a recap news article.
Here are some examples of the play-by-play records:
play,2,0,semim001,32,.CBFFFBBX,9/F
play,2,0,phegj001,01,FX,S7/G
play,2,0,martn003,01,CX,3/G
play,2,1,youne003,00,,NP
The following is what I would like to achieve:
For the first example
play,2,0,semim001,32,.CBFFFBBX,9/F,
I want it to be written out as something like:
"semim001 (Marcus Semien) was on three balls and two strikes in the second inning as the away player. He hit the ball into play after one called strike, one ball, three fouls, and another two balls. The fly ball was caught by the right outfielder."
The plays are formatted in the following way:
The first field is the inning, an integer starting at 1.
The second field is either 0 (for visiting team) or 1 (for home team).
The third field is the Retrosheet player id of the player at the plate.
The fourth field is the count on the batter when this particular event (play) occurred. Most Retrosheet games do not have this information, and in such cases, "??" appears in this field.
The fifth field is of variable length and contains all pitches to this batter in this plate appearance and is described below. If pitches are unknown, this field is left empty, nothing is between the commas.
The sixth field describes the play or event that occurred.
Explanations for all the symbols in the fifth and sixth field can be found on this Retrosheet page.
With Python 3, I've been able to format all the info of invariable length into a formatted sentence, which is all but the last two fields. I'm having difficulty in thinking of an efficient way to unparse (correct me if this is the wrong term to use here) the fifth and sixth fields, the pitches and the events that occurred, due to their variable length and wide variety of things that can occur.
I think I could write out all the rules based on the info on the Retrosheet website, but I'm looking for suggestions for a smarter way to do this. I wrote natural language processing as tags, hoping this could be a trivial problem in that field. Any pointers will be greatly appreciated!

Kendo ui grid (web) row height , fixed cell height

One of the column in my Kendo grid is the "Notes Column" (which has atleast 3000 characters).
Now the problem I'm facing is the grid cell(along with the row) expands to the size of the characters in the cell. It makes my cell huge.
I would like to make the grid cell single line with a fixed amount of characters and have a tooltip on the cell.
I'm not sure whether I can achieve it.
Please let me know the possible solution for the above case.
HAve tried some css changes :
.k-grid tbody tr{height:38px;} //Not working
Sample data in a cell in the Column (Notes):
17/11/2010 - Not received many enquiries, uses Extra Sure and Holmans. Finds our Medical Screening too lengthy. Took her through system and printed off Mes Screen questions, will use us more than N J Heritage. She will aim to use us for new enquiries.16/02/11 - L/M - R/C.08/03/2011 - JCO - Spoke to Matthew Salmon, not finding us competitive been using ExtraSure who are a lot cheaper. Advised our USP's and our benefits. Will use us for next travel enquiry.16/06/11 - Jane spoke to Richard and will be taking him through the system, SunWorld Plus.12/10/11-I Have spoken to Richard, I have emailed across details of Sunworld Extra & medical screening along with Username & Password. MP01/11/2011 - JCO spoke to Richard, Richard issuing a quotation today, likes that we don't have any age limits and restricted to 85 on AMT policies. First time to use us, usually use Citybond, likes the look of our product. JCO advised to contact me if requires detailed explanation of system. 25/05/12 - MW - Spoke to Richard, he is very nice, I thanked them for their continued support in using SunWorld and I am sending him the email with the Special Features and the SunWorld Extra info.. He said he rates SunWorld 8 and a half out of ten because he would like to see the option to increase the single article limit and also the rates have gone up quite a bit recently.. He said they don't really do that much travel but they are going to be pushing it over the next year because he said he thinks people are getting fed up of going on the internet to get insurance and realising they aren't actually covered for anything... He is generally happy with everything.08/08/12 - MW - Spoke to Luke, I asked why they hadn't used us since June and he said it was just because of a slow down in enquiries. Travel is not something they push, they just offer it to accommodate their existing clients. He said the only use us and one other provider so any enquiries they do get they always quote with us, he has done some quotes this week but they haven't come back. They are very happy with everything. No problems etc. I am sending him the Special Features for 2012 and also SunWorld Extra info,16/08/12 - MW - Spoke to Richard, asked if they would be interested in having a poster, he said it would not really be of any use to them as they are not a high street broker, they are in an office and not customer facing, he said they don't really do much travel, they are mainly a commercial broker but they are happy with any travel business they can do, what they would like is a flyer as opposed to a poster so they can email it out to their customers. He said he thinks the product is great, likes the age limits and limits etc,14/11/12 - MW - Spoke to Luke, told him about Snowman Cover and Broker Survey.04/12/12 - MW - Spoke to Luke, he said they are really quiet at the moment. Only using SunWorld but just not getting the enquiries. he is happy with SunWorld though and I have told him about the Changes for 2013.08/02/13 - MW - I can see that they said back in August that they would like some leaflets so I am sending them some out.28/02/13 - MW - Luke has sent this email - "Sorry, not sure if you are still doing this but can we havesome leaflets to send with our renewals to try and offer your services J thanks" , So I am sending them out some more leaflets.01/05/13 - MW - Spoke to Richard, he said the main person who does the travel, Luke, is on holiday in Turkey for the rest of the week and will be back on Tuesday. I have made a note in my diary to give him a call back on Wednesday.08/05/13 - MW - Spoke to Luke, he said SunWorld are their main travel provider, they have not really had many enquiries for travel lately. He said our rates are competitive for annual but people can get travel insurance so cheap online now that he thinks they have just been doing that. I said we will reduce the rates for him and he said that would be good. They are also set up to use us via the AXA route which he said is fine to be deactivated and they will carry on using this one as they have been. He said he would like some leaflets as he never received the last lot so I have checked his address and I am sending out 20 more. I told him about the new product and I am sending him the email with the Underwriting Changes and Special Features for 2013. RATES REDUCED01/07/13 - MW - Spoke to Luke, he said they are only using SunWorld so they must have not had any enquiries and that's why they haven't used us in the last month. He said they only really offer travel insurance to accommodate their existing clients, they don't really push for it. He said they have got the leaflets and they will be sending them out with renewals etc. As soon as they get the enquiries, we will be getting the business.13/09/13 - MW - Spoke to Richard, told him about the new product going live on the 1st October. I am sending him the email with the Underwriting Changes and Special Features for 2013. He said they would like some leaflets so I have confirmed their address and I am sending out 20.01/11/13 - MW - Spoke to Richard, he said they are very quiet at the moment, their customers are not going away and that's why they haven't issued anything. Luke is the main person who deals with this and he will be our contact going forward because he is the one who deals with it most of the time. I told Richard about the video tutorial and he is going to tell Luke and he will let us know if he has any queries. I am sending the email with the New Special Features and further information on the changes we have made to the website. luke.robson#aifltd.co.uk02/01/14 - MW - Spoke to Luke, I have confirmed all the contact information is correct and I have added his email address to the spreadsheet for the 2014 mailer and then he will forward it around to all the others. He said they only use SunWorld and it is the easiest to use, they just haven't had any enquiries for travel. I am sending him the email with the New Special Features and information about the changes we have made to the system. He said he has asked for some leaflets before but he has not received them. He really wants some to send out with all his renewals so I am sending him out 60 leaflets.
In addition to limit the height of the row, you have to say that the excess in text should be hidden.
Try adding the following style:
.k-grid tbody tr td {
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
}
See it here: http://jsfiddle.net/OnaBai/8U6rg/1/
For showing a tooltip you need to use a template that includes both as title and as content the value of the cell. Example of the column definition for a field called name:
{
field: "name",
title: "Name",
template: "<span title='${name}'>${name}</span>"
}
See it here: http://jsfiddle.net/OnaBai/8U6rg/3
EDIT: If you want to show an HTML formatted text in tooltip, you cannot use standard tooltips and you will have to use KendoUI tooltips.
To do so, you have to do:
In order to show the content of the cell as HTML, you should add the option encoded set to false, something like:
{
field: "name",
title: "Name",
encoded: false
}
Next, to use a KendoUI tooltip for this, what we are going to do is create a KendoUI Tooltip widget for each cell. You should do this once the grid is rendered, so I do it in Grid's dataBound handler:
dataBound: function() {
$("#grid").kendoTooltip({
...
});
}
To limit what to tooltip I'm going to mark the cells with the CSS class onabai, so now my column definition is:
{
field: "name",
title: "Name",
encoded: false,
template: "<span class='onabai'>#=name#</span>"
}
And the Tooltip in dataBound is:
dataBound: function() {
$("#grid").kendoTooltip({
filter: ".onabai",
position : "left",
width: 200,
...
});
}
But, we still have to say what is going to the content of the tooltip. To do so we have to define a content property and define a function that returns the content of the cell in the Grid. We do this using e.target.html()
dataBound: function() {
$("#grid").kendoTooltip({
filter: ".onabai",
position : "left",
width: 200,
content: function(e) {
return e.target.html();
}
});
You can see this running here: http://jsfiddle.net/OnaBai/8U6rg/8/

XPATH - cannot select grandparent node

I am trying to parse a live betting XML feed and need to grab each bet from within the code. In plain English I need to use the tag 'EventSelections' for my base query and 'loop' through these tags on the XML so I grab all that data and it creates and entity for each one which I can use on a CMS.
My problem is I want to go up two places in the tree to a grandparent node to gather that info. Each EventID refers to the unique name of a game and some games have more bets than others. It's important that I grab each bet AND the EventID associated with it, problem is, this ID is the grandparent each time. Example:
<Sportsbet Time="2013-08-03T08:38:01.6859354+09:30">
<Competition CompetitionID="18" CompetitionName="Baseball">
<Round RoundID="2549" RoundName="Major League Baseball">
<Event EventID="849849" EventName="Los Angeles Dodgers (H Ryu) At Chicago Cubs (T Wood)" Venue="" EventDate="2013-08-03T05:35:00" Group="MTCH">
<Market Type="Match Betting - BIR" EachWayPlaces="0">
<EventSelections BetSelectionID="75989549" EventSelectionName="Los Angeles Dodgers">
<Bet Odds="1.00" Line=""/>
</EventSelections>
<EventSelections BetSelectionID="75989551" EventSelectionName="Chicago Cubs">
<Bet Odds="17.00" Line=""/>
</EventSelections>
Does anyone know how I can grab the granparent tags as well?
Currently I am using:
//EventSelections (this is the context)
.//#BetSelectionID
.//#EventSelectionName
I have tried dozens of different ways to do this including the ../.. operator which won't work either. I'd be eternally grateful for any help on this. Thanks.
I think you just haven't gone far enough up the tree.
../* is a two-step location bath with abbreviations, expanded to parent::node()/child::* ... so in effect you are going up the tree with the first step, but back down the tree for the second step.
Therefore, ../* gives you your siblings (parent's children), ../../* gives you your aunts and uncles (grandparent's children), and ../../../* gives you your grandparent and its siblings (great-grandparent's children).
For attributes, ../#* is an abbreviation for parent::node()/attribute::* and attributes are attached to elements, they are not considered children. So you are going sideways, not down the tree in the second step.
Therefore, unlike above, ../#* gives you your parent's attributes, while ../../#* gives you your grandparent's attributes.
But using // in your situation is really inappropriate. // is an abbreviation for /descendent-or-self::node()/ which walks all the way down a tree to the leaves of the tree. It should be used only in rare occasions (and I cringe when I see it abused on SO questions).
So ..//..//..//#RoundID may work for you, but it is in effect addressing attributes all over the tree and not just an attribute of your great-grandparent, which is why it is finding the attribute of your grandparent. ../../#RoundID should be all you need to get the attribute of your grandparent.
If you torture a stylesheet long enough, it will eventually work for you, but it really is more robust and likely faster executing to address things properly.
You could go with ancestor::Event/#EventID, which does exactly you asked for: matches an ancestor element named Event and returns it's EventID attribute.

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

Prolog list issue

I have the following rules:
/*The structure of a subject teaching team takes the form:
team(Subject, Leader, Non_management_staff, Deputy).
Non_management_staff is a (possibly empty) list of teacher
structures and excludes the teacher structures for Leader and
Deputy.
teacher structures take the form:
teacher(Surname, Initial,
profile(Years_teaching,Second_subject,Club_supervision)).
Assume that each teacher has his or her team's Subject as their
main subject.*/
team(computer_science,teacher(may,j,profile(20,ict,model_railways)),
[teacher(clarke,j,profile(32,ict,car_maintenance))],
teacher(hamm,p,profile(11,ict,science_club))).
team(maths,teacher(vorderly,c,profile(25,computer_science,chess)),
[teacher(o_connell,d,profile(10,music,orchestra)),
teacher(brankin,p,profile(20,home_economics,cookery_club))],
teacher(lynas,d,profile(10,pe,football))).
team(english,teacher(brewster,f,profile(30,french,french_society)),
[ ],
teacher(flaxman,j,profile(35,drama,debating_society))).
team(art,teacher(lawless,m,profile(20,english,film_club)),
[teacher(walker,k,profile(25,english,debating_society)),
teacher(brankin,i,profile(20,home_economics,writing)),
teacher(boyson,r,profile(30,english,writing))],
teacher(carthy,m,profile(20,music,orchestra))).
I am supposed to bring back the initial and surname of any leader in a team that contains a total of 2 or more teachers with ict as their second subject.
I am new to prolog so unsure of this. Also, I have gotten back the results correctly but it is being returned 3 times.
Any help on this would be greatly appreciated.
Also, my aplogies if this is terribly easy.
You didn't provide the code you use to find these teachers, so I can't say for sure, but if there were a team with 3 members w/ ict as their second subject (for example, computer_science), then there would be 3 ways to find 2 (AB, AC, and BC), which would explain your multiple results. But saying how to modify your code to fix that would require seeing the code to be fixed.

Resources