Test the difference in PROVISION - syntax

I have a dataset which includes data of countries BELGIUM and Netherlands.
I need to perform the t test, however I receive the following error SyntaxError: invalid syntax.
Hope that somebody can help me with this.
from scipy.stats import ttest_ind
data1 = df3[df3['PROVISIONS'] == "BELGIUM")
data2 = df3[df3['PROVISIONS'] == "NETHERLANDS")
ttest_ind(data1,data2)

Related

I can't run a correlation with ggcorrmat

I am getting this error when running a correlation matrix with the R package ggstatsplo
ggcorrmat(data = d, type = "nonparametric", p.adjust.method = "hochberg",pch = "")
Error: 'data_to_numeric' is not an exported object from 'namespace:datawizard'
Somebody could help me.
I expected to have the script to run normally as I have used it before (around July) without any errors.
Is there anything that has changed about the ggstatsplot package?

Testrail API Ruby add_result_for_case error

I am executing the code below as a cucumber step. The test case id is
C70. I tried a run ID and it gave the same error.
The code and error are below -
-----------------
require 'testrail-ruby'
client = TestRail::APIClient.new('https://xxxx.testrail.net')
client.user = 'xxxxxxxxxx.com'
client.password = 'xxxxxx'
r = client.send_post(
'add_result_for_case/C270',
{ :status_id => 1, :comment => 'This test worked fine!' }
)
puts r
The Error:
TestRail API returned HTTP 400 ("Field :run_id is not a valid ID.")
What am I doing wrong? I have researched this topic and have not resolved it Please advise....
I found my problem . I should have used the full Test Case ID. The ID that found when running the actual test case . The issue is resloved.
Extra suggestion.
Do not use characters other than numbers when giving the ID information. For example, when giving the case Id, it shows up as C270 in Testrail. But you have to give it as 270 in Api. In another example, let your RunId be R150. You only need to give it as 150.

spark similarities between text sentences

I'm trying to find similarity between text messages (about 1 million text message), in my implementation each line represents an entry.
In order to calculate similarity between those texts we adopt tfidf and columnSimilarities
Below is the code:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
//import scala.math.Ordering
//import org.apache.spark.RangePartitioner
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
// split word based on spaces and special characters
val documents = sc.textFile("./test1").map(_.split(" |\\,|\\?|\\-|\\+|\\*|\\(|\\)|\\[|\\]|\\{|\\}|\\<|\\>|\\/|\\;|\\.|\\:|\\=|\\^|\\|").filter(_.nonEmpty).toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
println(tf.getNumPartitions)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val mat = new RowMatrix(tfidf)
// transpose matrix to get result between row (not between column)
val sim = transposeRowMatrix(mat).columnSimilarities()
val trdd = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
println(trdd.getNumPartitions)
// to descease write to file time
val transformedRDD = trdd.repartition(50)
println(transformedRDD.getNumPartitions)
transformedRDD.cache()
transformedRDD.saveAsTextFile("output")*
The problem is when the number of similar messages increases in file, the similarity decreases.
e.g.
let's assume that we have the file below:
hello world
hello world 123
how is every thing
we are testing
this is a test
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
The output of the former command is:
%cat output/part-000*
5.0,6.0,0.7373482646933146
0.0,1.0,0.8164965809277261
4.0,5.0,0.053913565847778636
1.0,5.0,0.13144171271256438
2.0,4.0,0.16888723050548915
4.0,6.0,0.052731941041749664
Each line in the output represents the similarity between two lines as follows:
"lineX -1", "lineY -1", "similarity"
The output showing the similarity between the last 2 lines is 5.0,6.0,0.7373482646933146, which is fine.
The two lines are
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
and similarity is 0.7373482646933146
while when the file input is:
hello world
hello world 123
hello world 956248
hello world 2564
how is every thing
we are testing
this is a test
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
corporate code 456-458 you ca also tap on this link to verify corporate.co/8965
corporate code 444-444 you ca also tap on this link to verify corporate.co/4444
the output is:
7.0,10.0,0.4855543123154418
2.0,3.0,0.32317021425463427
6.0,8.0,0.03657892871242232
6.0,10.0,0.03097823353416634
0.0,1.0,0.6661166307685673
7.0,8.0,0.5733398760974173
1.0,2.0,0.37867439463004254
9.0,10.0,0.4855543123154418
0.0,3.0,0.5684806190668547
8.0,9.0,0.6716256614182469
4.0,6.0,0.1903502047647684
8.0,10.0,0.4855543123154418
1.0,3.0,0.37867439463004254
6.0,9.0,0.03657892871242232
7.0,9.0,0.5733398760974173
6.0,7.0,0.03657892871242232
1.0,7.0,0.233827426275723
0.0,2.0,0.5684806190668547
the output between the same lines tested in the first example is: 7.0,8.0,0.5733398760974173
the similarity had decreased from 0.7373482646933146 to 0.5733398760974173 for the sames line
the two lines are:
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
and similarity is 0.5733398760974173
Is there any solution to avoid decreasing in similarity between
sentences when their similar line messages increase in the input? (tfidf could be the problem here? when similar sentence number increases similarity decreases due to tfidf?)
is there any solution to cluster similar messages?
i.e the input above, contains multiple sentences like:
hello world 123
same for the sentences like:
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
could they be grouped based on similarities output?

How to use xpath to find a text node

I'm using scrap to get user informations on stack overflow. And I try to use //h2[#class="user-card-name"]/text()[1] to get that name. However I get this:
['\n Ignacio Vazquez-Abrams\n \n
Someone plz help.
You should be able to clean up surrounding whitespaces from the result easily using Python's strip() function :
In [2]: result = response.xpath('//h2[#class="user-card-name"]/text()[1]').extract()
In [3]: [r.strip() for r in result]
Out[3]: [u'Ignacio Vazquez-Abrams']
The recommended way when crawling unstructured data with scrapy is to use ItemLoaders, and scrapylib offers some very good default_input_processor and default_output_processor.
items.py
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from scrapylib.processors import default_input_processor
from scrapylib.processors import default_output_processor
class MyItem(Item):
field1 = Field()
field2 = Field()
class MyItemLoader(ItemLoader):
default_item_class = MyItem
default_input_processor = default_input_processor
default_output_processor = default_output_processor
now on your spider code, populate your items with:
from myproject.items import MyItemLoader
...
... # on your callback
loader = MyItemLoader(response=response)
loader.add_xpath('field1', '//h2[#class="user-card-name"]/text()[1]')
... keep populating the loader
yield loader.load_item() # to return an item
Try this:
result = response.xpath('//h2[#class="user-card-name"]/text()').extract()
result = result[0].strip() if result else ''

Scrapy restrict_xpath syntax error

I'm trying to limit Scrapy to a particular XPath location for following links. The XPath is correct (according to XPath Helper plugin for chrome), but when I run my Crawl Spider I get a syntax error at my Rule.
My Spider code is:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class BassSpider(CrawlSpider):
name = "bass"
allowed_domains = ["talkbass.com"]
start_urls = ["http://www.talkbass.com/forum/f126"]
rules = [Rule(SgmlLinkExtractor(allow=['/f126/index*']), callback='parse_item', follow=True, restrict_xpaths=('//a[starts-with(#title,"Next ")]')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
ads = hxs.select('//table[#id="threadslist"]/tbody/tr/td[#class="alt1"][2]/div')
items = []
for ad in ads:
item = BassItem()
item['title'] = ad.select('a/text()').extract()
item['link'] = ad.select('a/#href').extract()
items.append(item)
return items
So inside the rule, the XPath '//a[starts-with(#title,"Next ")]' is returning an error and I'm not sure why, since the actual XPath is valid. I'm simply trying to get the spider to crawl each "Next Page" link. Can anyone help me out. Please let me know if you need any other parts of my code for help.
It's not the xpath that is the issue, rather that the syntax of the complete rule is incorrect. The following rule fixes the syntax error, but should be checked to make sure that it is doing what is required:
rules = (Rule(SgmlLinkExtractor(allow=['/f126/index*'], restrict_xpaths=('//a[starts-with(#title,"Next ")]')),
callback='parse_item', follow=True, ),
)
As a general point, posting the actual error in a question is highly recommended since the perception of the error and the actual error may well differ.

Resources