i am running a spider that is pulling information like prices and shipping ... I am getting the shipping information back like this "Shipping:$.99,Shipping:,Shipping:,Shipping:$.49" .... the code that is extracting it looks like this
item["shipping"] = vendor.xpath("normalize-space(.//span[#class='shippingAmount']/text())").extract()
can i write this line to pull just the price after the "Shipping:" ?
Use a combination of substring-after and substring-before, ie.
substring-before(
substring-after(
"Shipping:$.99,Shipping:,Shipping:,Shipping:$.49",
"Shipping:"),
","
)
In XPath 1.0, there is no way to fetch all shipping amounts for an arbitrary number of shipping fees. You could query the 2nd, 3td, ... value by repeatedly calling substring-after($string, "Shipping:") to remove the former value.
(Linebreaks can be omitted, of course.)
You can extract the prices using some regular expression :
import re
str = "Shipping:$.99,Shipping:,Shipping:,Shipping:$.49"
re.findall(r'[\d+[.]]?\d+', str)
['.99', '.49']
EDIT
To have 0 if there is no shipping:
[float(x) if x else 0 for x in re.sub('Shipping:[$]?','',str).split(',')]
[0.99, 0, 0, 0.49]
Related
I want to use a random value from the GET method, in the POST method.
List car = Audi, Porsche, Ford, VW, Honda, Citroen
$['carTypes'][0]['carType']['enum']
Result: [Audi]
$['carTypes'][${=(int)(Math.random()*6)}]['carType']['enum']
Result: [Porsche] (random 1 car from the list of 5 available)
I would like to get a list of random cars but not limited to just one car - random list of cars but in the range of 0 to 6, not only 1 value.
Result: [Audi,Porsche]
Result: [Ford, VW, Honda]
Result: [Citroen]
I have tried like this.
$['carTypes'][${=(int)(Math.random()*6)},${=(int)(Math.random()*6)}]['carType']['enum']
Result: [[Citroen, Honda]]
Probably 2 flat brackets [[ prevent me from using this data in the POST method, how to get rid of unnecessary brackets?
Groovy
import groovy.json.JsonOutput
Random random = new Random()
def list = ["Porsche","Ford","VW"]
def randomValue = random.nextInt(list.size())
def list2 = ["Porsche","Ford","VW"]
def randomValue2 = random.nextInt(list2.size())
def theValue = list2[randomValue2] +","+ list[randomValue]
I will be grateful for your help.
Instead of putting the above in your post step, you could create a Groovy script step in between the GET and POST requests.
In the Groovy script, you can then 'build' the string exactly how you want, including the removal of the brackets. The last line in the script should be a return statement that returns the string you built.
In the POST request, you can then 'pull' in the value from the groovy script step using the $ functionality. E.g. ${Groovy Script Name#result}
I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?
I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.
I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))
I have 2 expressions to fill in the column for the current amount and prior amount:
Current Amount: =IIf(Fields!ACCOUNTING_PERIOD.Value = Parameters!AP.Value, Fields!DEPR.Value, "")
Prior Amount: =IIf(Fields!ACCOUNTING_PERIOD.Value = Parameters!PRAP.Value, Fields!DEPR.Value, "")
What I need to is complete a 3rd column (called "Diff) by subtracting the value in the prior amount field from the value in the current amount field.
I tried to use the following expression that subtracts 1 from the other to get the difference:
=(=IIf(Fields!ACCOUNTING_PERIOD.Value = Parameters!AP.Value, Fields!DEPR.Value, 0)) – (=IIf(Fields!ACCOUNTING_PERIOD.Value = Parameters!PRAP.Value, Fields!DEPR.Value, 0))
However, I get the following errorm message:
The Value expression for the textrun ‘Textbox6.Paragraphs[0].TextRuns[0]’ contains an error: [BC30037] Character is not valid.
FYI, Textbox6 is the cells where this expression resides. Any help in correcting this expression would be greatly appreciated. Thanks for your help.
I think there are two Problems with your diff expression.
First i would take away some equal signs:
=IIf(Fields!ACCOUNTING_PERIOD.Value = Parameters!AP.Value, Fields!DEPR.Value, 0) – IIf(Fields!ACCOUNTING_PERIOD.Value = Parameters!PRAP.Value, Fields!DEPR.Value, 0)
The second thing is, that according to the error message your Fields seem to formatted as characters not doubles. So you will need to convert them before substracting them. You should already do this in your sql-code with:
CONVERT(Fields!DEPR.Value AS numeric)
Hope this helps
I am working on a very basic WEKA assignment, and I'm trying to use WEKA to preprocess data from the GUI (most current version). I am trying to do very basic if statements and mathematical statements in the expression box when double clicking on MathExpression and I haven't had any success. For example I want to do
if (a5 == 2 || a5 == 0) then y = 1; else y = 0
Many different variations of this haven't worked for me and I'm also unclear on how to refer to "y" or if it needs a reference within the line.
Another example is -abs(log(a7)–3) which I wasn't able to work out either. Any ideas about how to make these statements work?
From javadoc of MathExpression
The 'A'
letter refers to the value of the attribute being processed.
Other attribute values (numeric only) can be accessed through
the variables A1, A2, A3, ...
Your filter applies to all attributes of your dataset. If I load iris dataset and apply following filter.
weka.filters.unsupervised.attribute.MathExpression -E log(A).
your attribute ,sepallength values change as following.
Before Filter After Filter
Minimum 4.3 Minimum 1.459
Maximum 7.9 Maximum 2.067
Mean 5.843 Mean 1.755
StdDev 0.828 StdDev 0.141
Also if you look to javadoc, there is no if else function but ifelse function. Therefore you should write something like
ifelse ( (A == 2 || A == 0), 1,0 )
Also this filter applies to all attributes. If you want to change only one attribute and according to other attribute values ; then you need to use "Ignore range option" and use A1,A2 to refer to other attribute values.
if you need to add new attribute use AddExpression.
An instance filter that creates a new attribute by applying a mathematical expression to existing attributes.
I've a list of strings which I want to group by their suffix and then print the values right-aligned, padding the left side with spaces.
What is the pythonic way to do that?
My current code is:
def find_pos(needle, haystack):
for i, v in enumerate(haystack):
if str(needle).endswith(v):
return i
return -1
# Show only Error and Warning things
search_terms = "Error", "Warning"
errors_list = filter(lambda item: str(item).endswith(search_terms), dir(__builtins__))
# alphabetical sort
errors_list.sort()
# Sort the list so Errors come before Warnings
errors_list.sort(lambda x, y: find_pos(x, search_terms) - find_pos(y, search_terms))
# Format for right-aligning the string
size = str(len(max(errors_list, key=len)))
fmt = "{:>" + size + "s}"
for item in errors_list:
print fmt.format(item)
An alternative I had in mind was:
size = len(max(errors_list, key=len))
for item in errors_list:
print str.rjust(item, size)
I'm still learning Python, so other suggestions about improving the code is welcome too.
Very close.
fmt = "{:>{size}s}"
for item in errors_list:
print fmt.format(item, size=size)
The two sorting steps can be combined into one:
errors_list.sort(key=lambda x: (x, find_pos(x, search_terms)))
Generally, using the key parameter is preferred over using cmp. Documentation on sorting
If you are interested in the length anyway, using the key parameter to max() is a bit pointless. I'd go for
width = max(map(len, errors_list))
Since the length does not change inside the loop, I'd prepare the format string only once:
right_align = ">{}".format(width)
Inside the loop, you can now do with the free format() function (i.e. not the str method, but the built-in function):
for item in errors_list:
print format(item, right_align)
str.rjust(item, size) is usually and preferrably written as item.rjust(size).
You might want to look here, which describes how to right-justify using str.rjust and using print formatting.