What are the differences between html link parser and http url rewriting modifier.
explain with scenarios they are used in jmeter?
What is the context? StackOverflow is not a suitable place for looking for the answers for interview questions or doing a homework. In order to get an answer you need to describe the problem, what have you tried so far and what do you want to achieve.
HTML Link Parser has 2 main use cases:
Choose random values or values from a given range when there is a choice, see Poll Example for details
Act as a website crawler (for broken links checking or simulating users browsing around the site), see How to Spider a Site with JMeter - A Tutorial for example usage.
HTTP URL Re-writing Modifier is being used when a dynamic parameter is being appended to the URL representing a user session or other mandatory parameter. See Correlation with HTTP URL Re-writing Modifier article for more information.
Related
We develop and maintain a large number websites which have used the 'old' translate widget for quite some time. Recently, we've undertaken an effort to make all these sites ADA compliant. As it turns out, the widget's implementation is NOT ADA compliant and, it's being deprecated anyway, so our strategy is to move forward and implement the Cloud Translation API.
Many of the site pages are quite large and contain a lot of markup within the body. The body of most site's home pages is in the vicinity of 20KB. Other site pages are probably somewhat smaller. So, rather than doing a POST to an endpoint on the server which would, in turn, post to the api and then have to return the content to the browser, we believe the correct approach is to access the api directly from the browser and clearly, if we were to post the html content of the body, the api should return the body with the markup intact with the translated text.
The only example we've been able to find shows code with a non-ajax $.get(...) translating a short text string. We're wondering if there might be other examples out there which more closely address what we're trying to accomplish.
One other side note: removing the markup from one of these 20KB bodies results in a reduction in size to a bit over 5KB, so potentially doing this could result in a significant cost savings for our clients. If we were to do this by creating an array of strings to translate as part of the post, is it possible to instruct the api to do a batch translate, which would allow us to replace the original strings with the translated ones.
Right now the only available batch requests for translations would be this [1]. This requires the use of cloud storage, where the files should be and where the translated files go. As per your explanation, I am unsure if this could be of use for you.
I have found this post [2] which has a workaround that may be of use for you if it is possible for you to concatenate what needs to be translated. Basically, the workaround would be creating a string which is a concatenation of the strings that need to be translated and split it once it is translated based on a delimiter value.
[1] https://cloud.google.com/translate/docs/advanced/batch-translation
[2] Bulk translation of a big set of records via google translate
I wrote a Ruby script that appended "data" to the beginning of every word of the English dictionary, and then filtered out various strings using different parameters, and now I want to use a site like namecheap or gandi.net in order to take each of these strings and insert them into the domain name availability checker in order to determine which ones are available.
It is my understanding that this will involve making a POST HTTP request of some kind, as well as grabbing the element in question, but I don't really understand the dynamics of what to read about in order to do this kind of thing.
I imagine that after a few requests I will be limited, but as a learning exercise I am still curious as to how I would go about doing this.
I inspected the element (on namecheap) to see what the tag looked like, to find any uniquely identifiable class/id names that I could use to grab that specific part of the source, and found that inside a fieldset tag, there was a line of HTML that I can't seem to paste here, so here is a picture:
Thanks in advance for any guidance in helping me learn about web scripting!
I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I've read over the Google specification for crawling AJAX-enabled pages. Since part of Google's indexing method uses the URL itself, will converting to !# negatively effect SEO?
For instance, if I have a page at www.mysite.com/surfing, Google will be likely to rate it highly if a user searches for "surfing" because it has "surfing" in the URL. Would the same be true for www.mysite.com/#!surfing or does it ignore the hash fragments for the purposes of weighting the URL itself?
Perhaps you have already read in the google Ajax-crawling instructions that the !# is actually transformed into ?_escaped_fragment_ by the google crawler. So let's use your example:
www.mysite.com/#!surfing , the google crawler will see the link as www.mysite.com/?_escaped_fragment_=surfing . So it comes to the question : what is better for google SEO a link with a paremeter ?_escaped_fragment_=surfing or without one /surfing ?
Search engineer representatives have confirmed on numerous occasions that URLs with more than 2 dynamic parameters may not be spidered unless they are perceived as significantly important (i.e. have many, many links pointing to them). So unless you're using too many parameters in the url, you don't have much to worry about. If you haven't done it already, you can always read the detailed google documentation https://developers.google.com/webmasters/ajax-crawling/docs/getting-started . Now, just an advice - don't rely on # in your AJAX website. Use history.pushState() to change your url to whatever you wish. I use #! only on browsers that don't support history.pushState() like IE. The problem with the SEO with #! doesn't come form the url but from the difficulties in the Server Side processing of the information needed to provide HTML snapshot for the crawler.
The question is old.
Now Google not supports AJAX-Crawling anymore:
https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html
And this document officially deprecated:
https://developers.google.com/search/docs/ajax-crawling/docs/getting-started
So don't use hashbangs in URLs.
Traditionally, from SEO perspective, hash tag (#) is used to avoid the following issues
-Cannibalization issues
-Affiliate URLs (Here is a good article about how to use hash for tracking purpose instead of using question mark in the URL)
-Show limited content on the page (pagination issues)
The usage you are refering to is what Google recommends on how to make AJAX pages being able to be read by Google - https://support.google.com/webmasters/answer/174992?hl=en
For more info about hash tag and its SEO benefits, check this blog post - https://digitalreadymarketing.com/adding-hash-in-urls-seo-benefits/
In My personal opinion and 8 years in SEO & development It won't harm but it depends more on the site other parameters so adding the !# won't do harm...
Do you have the site URL so I can take a more in-depth Look ?
That could cause a problem if Google's crawler thought that there could be an infinite number of possibilities. Like with a ? in the url. But the answer beyond that is clear.
website.com/oreo-cookies
is more semantic and easier to understand for both people and crawlers than
website.com/#!oreo-cookies
But is this going to have a major impact? If you were a client paying me for SEO, I would tell you that your incoming text links with relevant keyword phrases from relevant related websites is far more important. I would also say that if you are submitting an xml sitemap for google to digest, and lots of popular websites are using the #! google will figure it out and ignore it.
So bottom line, if my content was worth linking to, and I made sure google was finding all my pages and indexing them, I would not worry about it.
I think that it will not harm your SEO in any way I am in SEO for last 5 years and haven't experienced such problem yet so don't worry about it. So my opinion is you can do it by adding the !# no harm !!
I'm looking for a solution to replace all the links from a curl response to my site.
Lets say my site is: example.com, then I make a CURL request to site.com.
site.com has various links:
Something!
<some html>......
Google!
<more html>
Something else
My goal is to prefix all the links with: example.com/?url={THE URL OF THE LINK} (AKA my site).
My current solution uses regexp to "catch" and process all the links.
This works most of the time, but from time to time I encounter a non-valid HTML that fails the regex.
The regex has another disadvantage: I can't catch onclick="" actions and different link scenarios.
I heard several solutions such as rewrite and reverse proxy. Any of them can work to achieve my goal?
Thanks..
You should absolutely be able to use regex for that. However, your code will have to be a little more robust to handle inline scripting. Analyze a large sample of anchor attributes to determine all the possible link formats, over and above /href=""/ and /window.location.href/.
You will also have to parse referenced script files to see what the event handlers hold.