Unable to fetch the next_maintext of 2nd page - xpath

page1 and page2 URL. I want to fetch all the content from the 1st URL and only the main text from the 2nd URL and append it to the main text of 1st URL. This is only one article. function parse_indianexpress_archive_links() contains a list of news articles URLs. I m getting all the results from page1 but the next_maintext column from page2 results output <GET http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/2>
class spider_indianexpress(scrapy.Spider):
name = 'indianexpress'
start_urls = parse_indianexpress_archive_links()
def parse(self,response):
items = ScrapycrawlerItem()
separator = ''
#article_url = response.xpath("//link[#rel = 'canonical']/#href").extract_first()
article_url = response.request.url
date_updated = max(response.xpath("//div[#class = 'story-date']/text()").extract() , key=len)[-27:] #Call max(list, key=len) to return the longest string in list by comparing the lengths of all strings in a list
if len(date_updated) <=10:
date_updated = max(response.xpath("//div[#class = 'story-date']/p/text()").extract() , key=len)[-27:]
headline = response.xpath("(//div[#id = 'ie2013-content']/h1//text())").extract()
headline=separator.join(headline)
image_url = response.css("div.storybigpic.ssss img").xpath("#src").extract_first()
maintext = response.xpath("//div[#class = 'ie2013-contentstory']//p//text()").extract()
maintext = ' '.join(map(str, maintext))
maintext = maintext.replace('\r','')
contd = response.xpath("//div[#class = 'ie2013-contentstory']/p[#align = 'right']/text()").extract_first()
items['date_updated'] = date_updated
items['headline'] = headline
items['maintext'] = maintext
items['image_url'] = image_url
items['article_url'] = article_url
next_page_url = response.xpath("//a[#rel='canonical']/#href").extract_first()
if next_page_url :
items['next_maintext'] = scrapy.Request(next_page_url , callback = self.parse_page2)
yield items
def parse_page2(self, response):
next_maintext = response.xpath("//div[#class = 'ie2013-contentstory']//p//text()").extract()
next_maintext = ' '.join(map(str, next_maintext))
next_maintext = next_maintext.replace('\r','')
yield {next_maintext}
Output:
article_url,date_publish,date_updated,description,headline,image_url,maintext,next_maintext
http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/,,"Fri Apr 03 2009, 14:49 hrs ",,Congress approves 2010 budget plan,http://static.indianexpress.com/m-images/M_Id_69893_Obama.jpg,"The Democratic-controlled US Congress on Thursday approved budget blueprints embracing President Barack Obama's agenda but leaving many hard choices until later and a government deeply in the red. With no Republican support, the House of Representatives and Senate approved slightly different, less expensive versions of Obama's $3.55 trillion budget plan for fiscal 2010, which begins on October 1. The differences will be worked out over the next few weeks. Obama, who took office in January after eight years of the Republican Bush presidency, has said the Democrats' budget is critical to turning around the recession-hit US economy and paving the way for sweeping healthcare, climate change and education reforms he hopes to push through Congress this year. Obama, traveling in Europe, issued a statement praising the votes as ""an important step toward rebuilding our struggling economy."" Vice President Joe Biden, who serves as president of the Senate, presided over that chamber's vote. Democrats in both chambers voted down Republican alternatives that focused on slashing massive deficits with large cuts to domestic social spending but also offered hefty tax breaks for corporations and individuals. ""Democrats know that those policies are the wrong way to go,"" House Majority Leader Steny Hoyer told reporters. ""Our budget lays the groundwork for a sustained, shared and job-creating recovery."" But Republicans have argued the Democrats' budget would be a dangerous expansion of the federal government and could lead to unnecessary taxes that would only worsen the country's long-term fiscal situation. ""The Democrat plan to increase spending, to increase taxes, and increase the debt makes no difficult choices,"" said House Minority Leader John Boehner. ""It's a roadmap to disaster."" The budget measure is nonbinding but it sets guidelines for spending and tax bills Congress will consider later this year. BIPARTISANSHIP ABSENT AGAIN Obama has said he hoped to restore bipartisanship when he arrived in Washington but it was visibly absent on Thursday. ... contd.",<GET http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/2>

This is not how Scrapy works (I mean next_page request) How to fetch the Response object of a Request synchronously on Scrapy?.
But in fact you don't need synchronous requests. All you need is to check for a next page and pass current state (item) to the callback that will process your next page. I'm using cb_kwargs (it's a recommended way now). You may need to use request.meta if you have an old version.
import scrapy
class spider_indianexpress(scrapy.Spider):
name = 'indianexpress'
start_urls = ['http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/']
def parse(self,response):
item = {}
separator = ''
#article_url = response.xpath("//link[#rel = 'canonical']/#href").extract_first()
article_url = response.request.url
date_updated = max(response.xpath("//div[#class = 'story-date']/text()").extract() , key=len)[-27:] #Call max(list, key=len) to return the longest string in list by comparing the lengths of all strings in a list
if len(date_updated) <=10:
date_updated = max(response.xpath("//div[#class = 'story-date']/p/text()").extract() , key=len)[-27:]
headline = response.xpath("(//div[#id = 'ie2013-content']/h1//text())").extract()
headline=separator.join(headline)
image_url = response.css("div.storybigpic.ssss img").xpath("#src").extract_first()
maintext = response.xpath("//div[#class = 'ie2013-contentstory']//p//text()").extract()
maintext = ' '.join(map(str, maintext))
maintext = maintext.replace('\r','')
contd = response.xpath("//div[#class = 'ie2013-contentstory']/p[#align = 'right']/text()").extract_first()
item['date_updated'] = date_updated
item['headline'] = headline
item['maintext'] = maintext
item['image_url'] = image_url
item['article_url'] = article_url
next_page_url = response.xpath('//a[#rel="canonical"][#id="active"]/following-sibling::a[1]/#href').extract_first()
if next_page_url :
yield scrapy.Request(
url=next_page_url,
callback = self.parse_next_page,
cb_kwargs={
'item': item,
}
)
else:
yield item
def parse_next_page(self, response, item):
next_maintext = response.xpath("//div[#class = 'ie2013-contentstory']//p//text()").extract()
next_maintext = ' '.join(map(str, next_maintext))
next_maintext = next_maintext.replace('\r','')
item["maintext"] += next_maintext
next_page_url = response.xpath('//a[#rel="canonical"][#id="active"]/following-sibling::a[1]/#href').extract_first()
if next_page_url :
yield scrapy.Request(
url=next_page_url,
callback = self.parse_next_page,
cb_kwargs={
'item': item,
}
)
else:
yield item

Related

Technical Analyis (MACD) for crpto trading

Background:
I have writing a crypto trading bot for fun and profit.
So far, it connects to an exchange and gets streaming price data.
I am using this price to create a technical indicator (MACD).
Generally for MACD, it is recommended to use closing prices for 26, 12 and 9 days.
However, for my trading strategy, I plan to use data for 26, 12 and 9 minutes.
Question:
I am getting multiple (say 10) price ticks in a minute.
Do I simply average them and round the time to the next minute (so they all fall in the same minute bucket)? Or is there is better way to handle this.
Many Thanks!
This is how I handled it. Streaming data comes in < 1s period. Code checks for new low and high during streaming period and builds the candle. Probably ugly since I'm not a trained developer, but it works.
Adjust "...round('20s')" and "if dur > 15:" for whatever candle period you want.
def on_message(self, msg):
df = pd.json_normalize(msg, record_prefix=msg['type'])
df['date'] = df['time']
df['price'] = df['price'].astype(float)
df['low'] = df['low'].astype(float)
for i in range(0, len(self.df)):
if i == (len(self.df) - 1):
self.rounded_time = self.df['date'][i]
self.rounded_time = pd.to_datetime(self.rounded_time).round('20s')
self.lhigh = self.df['price'][i]
self.lhighcandle = self.candle['high'][i]
self.llow = self.df['price'][i]
self.lowcandle = self.candle['low'][i]
self.close = self.df['price'][i]
if self.lhigh > self.lhighcandle:
nhigh = self.lhigh
else:
nhigh = self.lhighcandle
if self.llow < self.lowcandle:
nlow = self.llow
else:
nlow = self.lowcandle
newdata = pd.DataFrame.from_dict({
'date': self.df['date'],
'tkr': tkr,
'open': self.df.price.iloc[0],
'high': nhigh,
'low': nlow,
'close': self.close,
'vol': self.df['last_size']})
self.candle = self.candle.append(newdata, ignore_index=True).fillna(0)
if ctime > self.rounded_time:
closeit = True
self.en = time.time()
if closeit:
dur = (self.en - self.st)
if dur > 15:
self.st = time.time()
out = self.candle[-1:]
out.to_sql(tkr, cnx, if_exists='append')
dat = ['tkr', 0, 0, 100000, 0, 0]
self.candle = pd.DataFrame([dat], columns=['tkr', 'open', 'high', 'low', 'close', 'vol'])
As far as I know, most or all technical indicator formulas rely on same-sized bars to produce accurate and meaningful results. You'll have to do some data transformation. Here's an example of an aggregation technique that uses quantization to get all your bars into uniform sizes. It will convert small bar sizes to larger bar sizes; e.g. second to minute bars.
// C#, see link above for more info
quoteHistory
.OrderBy(x => x.Date)
.GroupBy(x => x.Date.RoundDown(newPeriod))
.Select(x => new Quote
{
Date = x.Key,
Open = x.First().Open,
High = x.Max(t => t.High),
Low = x.Min(t => t.Low),
Close = x.Last().Close,
Volume = x.Sum(t => t.Volume)
});
See Stock.Indicators for .NET for indicators and related tools.

problems with the leaderboard discord.py

The leaderboard shows the same username even if they are different users in case they have the same value.
I don't know how to solve it but when in the code I ask to resist a variable it gives me only 3 elements and not 4 even if 4 come out.
code:
#client.command(aliases = ["lb"])
async def leaderboard(ctx,x = 10):
leader_board = {}
total = []
for user in economy_system:
name = int(user)
total_amount = economy_system[user]["wallet"] + economy_system[user]["bank"]
leader_board[total_amount] = name
total.append(total_amount)
print(leader_board)
total = sorted(total,reverse=True)
embed = discord.Embed(
title = f"Top {x} Richest People",
description = "This is decided on the basis of raw money in the bank and wallet",
color = 0x003399
)
index = 1
for amt in total:
id_ = leader_board[amt]
member = client.get_user(id_)
name = member.name
print(name)
embed.add_field(
name = f"{index}. {name}",
value = f"{amt}",
inline = False
)
if index == x:
break
else:
index += 1
await ctx.send(embed=embed)
print resists this:
{100: 523967502665908227, 350: 554617490806800387, 1100: 350886488235311126}
Padre Mapper
Flore (Orsolinismo)
Aetna
Aetna
In theory there should also be 100: 488826524791734275 (i.e. my user id) but it doesn't find it.
Your problem comes from this line:
leader_board[total_amount] = name
If total_amount is already a key (eg. two users have the same amount of money), it will replace the previous value (which was a user ID) and replace it with another user ID. In this situation, if multiple users have the same amount of money, only one will be saved in leader_board.
Then, you have this line:
total.append(total_amount)
In this case, if two users have the same amount of money, you would just have two identical values, which is normal but, considering the problem above, this will create a shift.
Let's say you have ten users with two of them who have the same amount of money. leader_board will only contain 9 items whereas total will contain 10 values. That's the reason why you have two of the same name in your message.
To solve the problem:
#client.command(aliases = ["lb"])
async def leaderboard(ctx, x=10):
d = {user_id: info["wallet"] + info["bank"] for user_id, info in economy_system.items()}
leaderboard = {user_id: amount for user_id, amount in sorted(d.items(), key=lambda item: item[1], reverse=True)}
embed = discord.Embed(
title = f"Top {x} Richest People",
description = "This is decided on the basis of raw money in the bank and wallet",
color = 0x003399
)
for index, infos in enumerate(leaderboard.items()):
user_id, amount = infos
member = client.get_user(user_id)
embed.add_field(
name = f"{index}. {member.display_name}",
value = f"{amount}",
inline = False
)
await ctx.send(embed=embed)
If I guessed right and your dictionnary is organized like this, it should work:
economy_system = {
user_id: {"bank": x, "wallet": y}
}

How can I get the score from Question-Answer Pipeline? Is there a bug when Question-answer pipeline is used?

When I run the following code
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
As checked Dis is not yet on boarded to ARB portal, hence we cannot upload the invoices in portal
"""
questions = [
"Dis asked if it is possible to post the two invoice in ARB.I have not access so I wanted to check if you would be able to do it.",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
The answer that I get here is:
Question: Dis asked if it is possible to post the two invoice in ARB.I have not access so I wanted to check if you would be able to do it.
Answer: dis is not yet on boarded to ARB portal
How do I get a score for this answer? Score here is very similar to what is I get when I run Question-Answer pipeline .
I have to take this approach since Question-Answer pipeline when used is giving me Key Error for the below code
from transformers import pipeline
nlp = pipeline("question-answering")
context = r"""
As checked Dis is not yet on boarded to ARB portal, hence we cannot upload the invoices in portal.
"""
print(nlp(question="Dis asked if it is possible to post the two invoice in ARB?", context=context))
This is my attempt to get the score. It appears that I cannot figure out what feature.p_mask. So I could not remove the non-context indexes that contribute to the softmax at the moment.
# ... assuming imports and question and context
model_name="deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
inputs = tokenizer(question, context,
add_special_tokens=True,
return_tensors='pt')
input_ids = inputs['input_ids'].tolist()[0]
outputs = model(**inputs)
# used to compute score
start = outputs.start_logits.detach().numpy()
end = outputs.end_logits.detach().numpy()
# from source code
# Ensure padded tokens & question tokens cannot belong to the set of candidate answers.
#?? undesired_tokens = np.abs(np.array(feature.p_mask) - 1) & feature.attention_mask
# Generate mask
undesired_tokens = inputs['attention_mask']
undesired_tokens_mask = undesired_tokens == 0.0
# Make sure non-context indexes in the tensor cannot contribute to the softmax
start_ = np.where(undesired_tokens_mask, -10000.0, start)
end_ = np.where(undesired_tokens_mask, -10000.0, end)
# Normalize logits and spans to retrieve the answer
start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True)))
end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True)))
# Compute the score of each tuple(start, end) to be the real answer
outer = np.matmul(np.expand_dims(start_, -1), np.expand_dims(end_, 1))
# Remove candidate with end < start and end - start > max_answer_len
max_answer_len = 15
candidates = np.tril(np.triu(outer), max_answer_len - 1)
scores_flat = candidates.flatten()
idx_sort = [np.argmax(scores_flat)]
start, end = np.unravel_index(idx_sort, candidates.shape)[1:]
end += 1
score = candidates[0, start, end-1]
start, end, score = start.item(), end.item(), score.item()
print(tokenizer.decode(input_ids[start:end]))
print(score)
See more source code

discord.py: variable server-info

Hi I'm doing a command to get the server information and I'm looking in the API (recently discord.py program).
I just can't define some variables, I specifically send the variable code (the rest works perfectly).
I also looked for questions that could answer mine but I found different answers that did not satisfy my requests (in case there were I apologize)
async def serverinfo(ctx):
author = ctx.author.name
guild = ctx.guild
name_server = guild.name
icon_server = guild.icon_url
create_server = guild.created_at
owner_server = guild.owner.name
total_member_server = guild.member_count
#From here I can't find variables
online_member_server = guild.online_members
offline_member_server = guild.offline_members
human_member_server = guild.memberUser
bot_member_server = guild.member_bot
total_channel_server = guild.channels
text_channel_server = guild.text_channels
vocal_channel_server = guild.voice_channels
category_server = guild.categories
total_role_server = guild.role_count
boost_level_server = guild.level_boost
number_boost_server = guild.boost
Some of your variables are valid, like guild.member_count, guild.text_channels, guild.voice_channels, guild.channels and guild.categories but you use them the wrong way. Except for guild.member_count, those properties return lists, not integer, so you need to use len(property) if you want the total number of them.
Get the total categories, channel, text_channels, voice_channels:
channels_info = {
"total categories": len(guild.categories)
"total channels": len(guild.channels)
"total text channels": len(guild.text_channels)
"total voice channels": len(guild.voice_channels
}
Get the members informations:
members_info = {
"total users": guild.member_count
"total online members": sum(member.status==discord.Status.online and not member.bot for member in ctx.guild.members)
"total offline members": sum(member.status==discord.Status.offline and not member.bot for member in ctx.guild.members)
"total humans": sum(not member.bot for member in ctx.guild.members)
"total bots": sum(member.bot for member in ctx.guild.members)
}
Get the role informations:
roles_info = {
"total roles": len(guild.roles)
}
Get the server boosts informations (premium tier and premium_subscription_count):
boosts_info = {
"boost level": guild.premium_tier
"total boosts": guild.premium_subscription_count
}
You have to turn on all the intents and use this in your code fetch_offline_members=True like this
client = commands.Bot(command_prefix="c!",intents=intents,fetch_offline_members=True)

Converge on Best Combination of Elements

You have $10,000 to invest in stocks. You are given a list of 200 stocks, and are told to select 8 of those stocks to buy, and also indicate how many of those stocks you want to buy. You cannot spend more than $2,500 on a single stock alone, and each stock has its own price ranging from $100 to $1000. You cannot buy a fraction of a stock, only whole numbers. Each stock also has a value attached to it indicating how profitable it is. This is an arbitrary number from 0-100 that serves as a simple rating system.
The end goal is to list the optimal selection of 8 stocks, and indicate the best quantity of each of those stocks to buy without going over the $2,500 limit for each stock.
• I'm not asking for investment advice, I chose stocks because it acts as a good metaphor for the actual problem I'm trying to solve.
• Seems like what I'm looking at is a more complex version of the 0/1 Knapsack problem: https://en.wikipedia.org/wiki/Knapsack_problem.
• No, this isn't homework.
Here is lightly tested code for solving your problem exactly in time that is polynomial in the amount of money available, the number of stocks that you have, and the maximum amount of stock that you can buy.
#! /usr/bin/env python
from collections import namedtuple
Stock = namedtuple('Stock', ['id', 'price', 'profit'])
def optimize (stocks, money=10000, max_stocks=8, max_per_stock=2500):
Investment = namedtuple('investment', ['profit', 'stock', 'quantity', 'previous_investment'])
investment_transitions = []
last_investments = {money: Investment(0, None, None, None)}
for _ in range(max_stocks):
next_investments = {}
investment_transitions.append([last_investments, next_investments])
last_investments = next_investments
def prioritize(stock):
# This puts the best profit/price, as a ratio, first.
val = [-(stock.profit + 0.0)/stock.price, stock.price, stock.id]
return val
for stock in sorted(stocks, key=prioritize):
# We reverse transitions so we have not yet added the stock to the
# old investments when we add it to the new investments.
for transition in reversed(investment_transitions):
old_t = transition[0]
new_t = transition[1]
for avail, invest in old_t.iteritems():
for i in range(int(min(avail, max_per_stock)/stock.price)):
quantity = i+1
new_avail = avail - quantity*stock.price
new_profit = invest.profit + quantity*stock.profit
if new_avail not in new_t or new_t[new_avail].profit < new_profit:
new_t[new_avail] = Investment(new_profit, stock, quantity, invest)
best_investment = investment_transitions[0][0][money]
for transition in investment_transitions:
for invest in transition[1].values():
if best_investment.profit < invest.profit:
best_investment = invest
purchase = {}
while best_investment.stock is not None:
purchase[best_investment.stock] = best_investment.quantity
best_investment = best_investment.previous_investment
return purchase
optimize([Stock('A', 100, 10), Stock('B', 1040, 160)])
And here it is with the tiny optimization of deleting investments once we see that continuing to add stocks to it cannot improve. This will probably run orders of magnitude faster than the old code with your data.
#! /usr/bin/env python
from collections import namedtuple
Stock = namedtuple('Stock', ['id', 'price', 'profit'])
def optimize (stocks, money=10000, max_stocks=8, max_per_stock=2500):
Investment = namedtuple('investment', ['profit', 'stock', 'quantity', 'previous_investment'])
investment_transitions = []
last_investments = {money: Investment(0, None, None, None)}
for _ in range(max_stocks):
next_investments = {}
investment_transitions.append([last_investments, next_investments])
last_investments = next_investments
def prioritize(stock):
# This puts the best profit/price, as a ratio, first.
val = [-(stock.profit + 0.0)/stock.price, stock.price, stock.id]
return val
best_investment = investment_transitions[0][0][money]
for stock in sorted(stocks, key=prioritize):
profit_ratio = (stock.profit + 0.0) / stock.price
# We reverse transitions so we have not yet added the stock to the
# old investments when we add it to the new investments.
for transition in reversed(investment_transitions):
old_t = transition[0]
new_t = transition[1]
for avail, invest in old_t.items():
if avail * profit_ratio + invest.profit <= best_investment.profit:
# We cannot possibly improve with this or any other stock.
del old_t[avail]
continue
for i in range(int(min(avail, max_per_stock)/stock.price)):
quantity = i+1
new_avail = avail - quantity*stock.price
new_profit = invest.profit + quantity*stock.profit
if new_avail not in new_t or new_t[new_avail].profit < new_profit:
new_invest = Investment(new_profit, stock, quantity, invest)
new_t[new_avail] = new_invest
if best_investment.profit < new_invest.profit:
best_investment = new_invest
purchase = {}
while best_investment.stock is not None:
purchase[best_investment.stock] = best_investment.quantity
best_investment = best_investment.previous_investment
return purchase

Resources