I'm trying to play with Java API of Elastic Search and struggling to find the right querybuilder to use.
My JSON has got fields like: name, description etc.
My JSON (One sample):
{
"_id" : { "$oid" : "5160988c96cc620a5db6dafa" },
"name" : "Spinach Strata Recipe",
"ingredients" : "8 ounces day-old crusty bread, such as pain au levain, large dice (about 6 cups)\n2 cups coarsely chopped baby spinach leaves (about 3 1/2 ounces)\n3/4 cup crumbled provolone cheese (about 3 3/4 ounces)\n2 tablespoons extra-virgin olive oil, plus more for coating the pan\n1 tablespoon finely grated lemon zest (from about 1 medium lemon)\n2 teaspoons Dijon mustard\n1/8teaspoon kosher salt\n1/2 teaspoon freshly ground black pepper\n6 ex large eggs\n2 cups whole milk\n1/2 teaspoon finely chopped fresh oregano leaves",
"url" : "http://www.chow.com/users/recipes/29915-spinach-strata",
"image" : null,
"ts" : { "$date" : 1365285004038 },
"cookTime" : null,
"source" : "chow",
"recipeYield" : "6 servings",
"prepTime" : null,
"description" : "Aside from being comforting and filling, stratas are really easy to make. Whisked eggs and milk are poured over bread and your favorite fillings, then everything..."
}
This is how I"m indexing:
Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "elasticsearch_sai").build();
final TransportClient client = new TransportClient(settings);
Client _client = client.addTransportAddress(new InetSocketTransportAddress("localhost", Integer.parseInt("9300")));
reader.lines().limit(10000).forEach(json -> {
IndexResponse response = _client.prepareIndex("openrecipe", "recipe")
.setSource(json)
.execute()
.actionGet();
System.out.println("Indexed: "+response.getId()+" --> "+response.isCreated());
});
_client.admin().indices().prepareRefresh().execute().actionGet() ;
I want to search all the documents which has got the word sandwich in the name field (case insensitive).
But this doesn't work:
SearchResponse response = _client.prepareSearch("openrecipe")
.setTypes("recipe")
.setFrom(0).setSize(60)
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(QueryBuilders.termQuery("name", "sandwich"))
.execute()
.actionGet();
Related
I am using Databricks Labs Data Generator to send synthetic data to Event Hub.
Everything appears to be working fine for a about two minutes but then the streaming stops and provides the following error:
The request was terminated because the entity is being throttled. Error code : 50002. Sub error : 102.
Can someone let me know how to adjust the throttling.
The code I'm using to send data to Event Hub is as follows:
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
#.withColumn("body",StringType(), False)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build(withStreaming=True, options={'rowsPerSecond': 100})
streamingDelays = (
df_flight_data
.groupBy(
#df_flight_data.body,
df_flight_data.flightNumber,
df_flight_data.airline,
df_flight_data.original_departure,
df_flight_data.delay_minutes,
df_flight_data.delayed_departure,
df_flight_data.reason,
window(df_flight_data.original_departure, "1 hour")
)
.count()
)
writeConnectionString = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
checkpointLocation = "///checkpoint"
# ehWriteConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# ehWriteConf = {
# 'eventhubs.connectionString' : writeConnectionString
# }
ehWriteConf = {
'eventhubs.connectionString' : writeConnectionString
}
Write body data from a DataFrame to EventHubs. Events are distributed across partitions using round-robin model.
ds = streamingDelays \
.select(F.to_json(F.struct("*")).alias("body")) \
.writeStream.format("eventhubs") \
.options(**ehWriteConf) \
.outputMode("complete") \
.option("checkpointLocation", "...") \
.start()
I forgot to mention that I have 1 TU
This is due to usual traffic throttling from Event Hubs, take a look at the limits for 1 TU https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas, you can increase the number of TUs to 2 and then go from there.
If you think this is unexpected throttling then open a support ticket for the issue.
I am trying to extract products description, the first loop runs through each product and nested loop enters each product page and grabs description to extract.
for page in range(1, 2):
guitarPage =
requests.get('https://www.guitarguitar.co.uk/guitars/acoustic/page-
{}'.format(page)).text
soup = BeautifulSoup(guitarPage, 'lxml')
guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')
this is the loop for each product
for guitar in guitars:
title_text = guitar.h3.text.strip()
print('Guitar Name: ', title_text)
price = guitar.find(class_='price bold small').text.strip()
print('Guitar Price: ', price)
priceSave = guitar.find('span', {'class': 'price save'})
if priceSave is not None:
priceOf = priceSave.text
print(priceOf)
else:
print("No discount!")
image = guitar.img.get('src')
print('Guitar Image: ', image)
productLink = guitar.find('a').get('href')
linkProd = url + productLink
print('Link of product', linkProd)
here i am adding the links collected to an array
productsPage.append(linkProd)
here is my attempt at entering each product page and extracting the description
for products in productsPage:
response = requests.get(products)
soup = BeautifulSoup(response.content, "lxml")
productsDetails = soup.find("div", {"class":"description-preview"})
if productsDetails is not None:
description = productsDetails.text
# print('product detail: ', description)
else:
print('none')
time.sleep(0.2)
if None not in(title_text,price,image,linkProd, description):
products = {
'title': title_text,
'price': price,
'discount': priceOf,
'image': image,
'link': linkProd,
'description': description,
}
result.append(products)
with open('datas.json', 'w') as outfile:
json.dump(result, outfile, ensure_ascii=False, indent=4, separators=(',', ': '))
# print(result)
print('--------------------------')
time.sleep(0.5)
The outcome should be
{
"title": "Yamaha NTX700 Electro Classical Guitar (Pre-Owned) #HIM041005",
"price": "£399.00",
"discount": null,
"image": "https://images.guitarguitar.co.uk/cdn/large/150/PXP190415342158006-3115645f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/pxp190415342158006-3115645--yamaha-ntx700-electro-classical-guitar-pre-owned-him",
"description": "\nProduct Overview\nThe versatile, contemporary styled NTX line is designed with thinner bodies, narrower necks, 14th fret neck joints, and cutaway designs to provide greater comfort and playability f... read more\n"
},
but the description works for the first one and does not change later on.
[
{
"title": "Yamaha APX600FM Flame Maple Tobacco Sunburst",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340677008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340677008--yamaha-apx600fm-flame-maple-tobacco-sunburst",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha APX600FM Flame Maple Amber",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340676008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340676008--yamaha-apx600fm-flame-maple-amber",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha AC1R Acoustic Electric Concert Size Rosewood Back And Sides with SRT Pickup",
"price": "£399.00",
"discount": "Save £267.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/105/11012414211132.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/11012414211132--yamaha-ac1r-acoustic-electric-concert-size-rosewood-back-and-sid",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
}
]
this is the result I am getting, It changes all the time, sometimes it shows the previous description of the product
It does loop but it seems there are protective measures in place server side and the pages which fail change. The pages which did fail I checked and they had the searched for content. No single measure seems to suffice in my testing (I didn't try sleep over 2 but did try some IP and User-Agent changes with sleeps <=2.)
You could try alternating IPs and User-Agents, back off retries, changing time between requests.
Changing proxies: https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
Changing User-Agent: https://pypi.org/project/fake-useragent/
I'm learning Scrapy. As an exercise I want to get the product title in this web page https://scrapingclub.com/exercise/detail_json/ using this code:
scrapy shell "https://scrapingclub.com/exercise/detail_json/"
response.xpath("//h3[1]/text()")
[]
but the only thing I get is nothing (a zero dim dic).
Try this,
response.xpath("//script[contains(., 'title')]/text()")
if you press control+u you can see this information at the footer
var obj = {
"title": "Short Sweatshirt",
"price": "$24.99",
"description": "Short sweatshirt with long sleeves and ribbing at neckline, cuffs, and hem. 57% cotton, 43% polyester. Machine wash
cold.",
"img_path": "/static/img/" + "96230-C" + ".jpg" };
Background
I was doing some tests to see which would be the best for a primary key. I assumed that BSON would be better than a string. When I run some tests though, I'm getting about the same results. Am I doing something wrong here or can someone confirm that this is correct?
About my tests
I have created 200k records with 2 mongoid models. I ran everything in ruby benchmark. I did three main queries, a find(id) query, a where(id: id)query and a where(:id.in => array_of_ids). All of which gave me pretty similar response times.
Benchmark.bm(10) do |x|
x.report("String performance") { 100.times { ModelString.where(id: '58205ae41d41c81c5a0289e5').pluck(:id) } }
x.report("BSON performance") { 100.times { ModelBson.where(id: '581a1d271d41c82fc3030a34').pluck(:id) } }
end
Here are my models in Mongoid:
class ModelBson
include Mongoid::Document
end
class ModelString
include Mongoid::Document
field :_id, type: String, pre_processed: true, default: ->{ BSON::ObjectId.new.to_s }
end
Benchmark Results
ID miss "find" query
user system total real
String performance 0.140000 0.070000 0.210000 ( 2.187263)
BSON performance 0.280000 0.060000 0.340000 ( 2.308928)
ID hit "find" query
user system total real
String performance 0.280000 0.060000 0.340000 ( 2.392995)
BSON performance 0.190000 0.060000 0.250000 ( 2.245230)
100 IDs "in" query hit
String performance 0.850000 0.110000 0.960000 ( 9.221822)
BSON performance 0.770000 0.060000 0.830000 ( 8.055971)
db.collection.stats
{
"ns" : "model_bsons",
"count" : 199221,
"size" : 9562704,
"avgObjSize" : 48,
"numExtents" : 7,
"storageSize" : 22507520,
"lastExtentSize" : 11325440,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 1,
"indexDetails" : {
},
"totalIndexSize" : 6475392,
"indexSizes" : {
"_id_" : 6475392
},
"ok" : 1
}
{
"ns" : "model_strings",
"count" : 197680,
"size" : 9488736,
"avgObjSize" : 48,
"numExtents" : 7,
"storageSize" : 22507520,
"lastExtentSize" : 11325440,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 1,
"indexDetails" : {
},
"totalIndexSize" : 9304288,
"indexSizes" : {
"_id_" : 9304288
},
"ok" : 1
}
This is correct.
As you can see from collections stats, documents from both collections have the same size (avgObjSize field). So there is no difference between BSON ObjectID and string field size (both 12 bytes).
What really matters is the index size. Here you can notice that index size on
BSON collections is about 30% smaller than on String collection, because BSON objectID can take full advantage of index prefix compression. The index size difference is too small to see a real performance change with 200 000 documents, but I guess that increasing the number of documents could show different results
I have problem with count performance in MongoDB.
I'm using ZF2 and Doctrine ODM with SoftDelete filter. Now when query "first time" collection with db.getCollection('order').count({"deletedAt": null}), it takes about 30 seconds, sometimes even more. Second and more query takes about 150ms. After few minutes query takes again about 30 seconds. This is only on collections with size > 700MB.
Server is Amazon EC2 t2.medium instance, Mongo 3.0.1
Maybe it similar to MongoDB preload documents into RAM for better performance, but those answers do not solve my problem.
Any ideas what is going on?
/edit
explain
{
"executionSuccess" : true,
"nReturned" : 111449,
"executionTimeMillis" : 24966,
"totalKeysExamined" : 0,
"totalDocsExamined" : 111449,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : []
},
"nReturned" : 111449,
"executionTimeMillisEstimate" : 281,
"works" : 145111,
"advanced" : 111449,
"needTime" : 1,
"needFetch" : 33660,
"saveState" : 33660,
"restoreState" : 33660,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 111449
},
"allPlansExecution" : []
}
The count will go through each document which is creating performance issues.
Care about the precise number if it's a small one. You're interested to know if there are 100 results or 500. But once it goes beyond, let's say, 10000, you can just say 'More than 10000 results' found to the user.
db.getCollection('order').find({"deletedAt": null}).limit(10000).count(true)