Using Blaze Meter extension in Chrome Browser - I have saved the .jmx of a website and used that file in my JMeter test. Where it creates HTTP Header Manager with below user agent.
User Agent - Mozilla/5.0 (Operating_system; Intel xxx OS XXX xxx_xxx_xxx) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Question: Since, the agent contains driver for (Mozilla, Chrome, Safari) which browser my test will run? How does it understand my browser details?
As per documentation:
The Chrome (or Chromium/blink-based engines) user agent string is similar to the Firefox format. For compatibility, it adds strings like "KHTML, like Gecko" and "Safari".
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
So in your case the server will think that the virtual user is connecting from MacOS operating system and using Chrome browser (or derivative)
You might want to parameterize this User-Agent header value to represent different users using different browsers. Also pay attention to other headers, i.e. Accept-Encoding as it has huge impact on the data size which is requested from the server, i.e. whether it will be compressed or not.
Related
I am recently working on web scraping.
I found that we can use proxy or random user agents to stay away from anti - scraping detection's.
Is there any difference between proxy and random user agents?
Because I got confused when I understood that both are used to hide the original client request identity.
If m understanding is wrong please let me know
Useragent and proxy are totally different concepts
1) Useragents : The useragent will be sent to the targeted website through headers
When I send a request to stackoverflow, my useragent is :
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0
It says I'm using mozilla and linux + other infos. Everyone using same browser (firefox 5.0) on linux will have the same useragent.
This library will help you find the most common useragents used on the web so that your useragent looks anonymous : https://github.com/Lobstrio/shadow-useragent
2) Proxy
A proxy will let you hide your ip adress behind a proxy. The website you target will receive the ip address of the proxy rather than yours. If you ip get blocked by the website, then using a proxy would normally unlock the website.
There can be many reasons why you can be blocked during scraping but rotating ip and useragents can be effective in some cases
I have a piece of log :
2016-02-10T08:51:07.000+00:00 [RTR] OUT app_url -
[10/02/2016:08:51:07 +0000] "GET / HTTP/1.1" 200 3418
"https://abcsomeootherurl"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/10.2.3.4 Safari/537.36" 10.4.5.6.:44711 x_forwarded_for:"10.5.6.7"
vcap_request_id:req_id response_time:0.021407606 app_id:some_id
I would like to form a elastic search query to be used in kibana4. How can I do that. Any lead would be highly appreciated.
I am using activeMQ as a full featured broker. I have deployed my spring application on Tomcat 8.0.8. I am sending very large messages and I do it in separate Thread (about 230000 stomp messages in while loop). When I use Chrome, or Firefox in activeMQ console, I see that messages are being consumed almost instantly. The problem as always is IE. I can see that it stops to consume messages (after about 1000), and Tomcat fails at
java.lang.Thread.run(Thread.java:722) aused by: org.apache.catalina.connector.ClientAbortException: java.io.IOExcepti n: An existing connection was forcibly closed by the remote host at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffe .java:396) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:426) at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java: 42) at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:31 ) etc....
Is IE such slow a consumer by default or what? I have tried with numerous slow consumer politic with ActiveMQ but without success.
The exception points to an IO error of some sort on the client side. You can try to track what happens on the client side, e.g. use Fiddler to check for any reported errors, or if that fails Wireshark to track what HTTP messages are sent out and how far it gets. Also try using the latest 4.0.6.BUILD-SNAPSHOT (or better yet 4.1.0.BUILD-SNAPSHOT) available from repo.spring.io/libs-snapshot. There are some recent SockJS related fixes worth trying out.
I have tried with 4.1.0.BUILD-SNAPSHOT, but with no success. After a while Fiddler reports:
POST mylink/ami-0.1.0/liveEvents/072/omseen4z/xhr_send?t=1404832405318 HTTP/1.1
Accept: /
Accept-Language: en-us
Referer: mylink/ami-0.1.0/liveEvents/iframe.html#ensha4ej
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
Host: mylink
Content-Length: 6
DNT: 1
Connection: Keep-Alive
Pragma: no-cache
Cookie: JSESSIONID=4CC71475E3FF1377C79D77C349710237; theme=standard
["\n"]
and HTTP/1.1 404 Not Found error occurs.
What can be the reason of this(new line message). I have added activemq prefetchSize of 1 and I do ack on every received message.
Marko
I am using wget for windows (gnuwin32 wget-1.11.4-1) in Windows 8 and using it for a helpdesk tool called kayako, telling it to poll from an email queue. The command line looks like this:
wget.exe -O null --timeout 25 http://xxx.kayako.com/cron/index.php?/Parser/ParserMinute/POP3IMAP
I know it takes around 20 seconds to receive a response from the server in my particular case when using a browser with the url in the command line above. However, when using that command, it returns almost immediately. This is an excerpt of the output:
Connecting to xxx.kayako.com[xxx.xxx.xxx.xxx]:80... connected. HTTP
request sent, awaiting response... 200 OK Length: unspecified
[text/html]
I would like to know what would be the difference between the two cases and how could I get wget to behave in the same way as the computer (I know it doesn't because kayako is not polling from the email queue).
There are a number of potential variables, but one of the more common distinctions made by web servers is based on the user agent string you are reporting. By default, wget will identify itself truthfully as wget. If this is an issue, you can use the --user-agent= option to change the user agent string.
For example, you could identify as Firefox on 64-bit Windows with something like --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0".
Is there a Pig UDF that calculates time difference in the weblogs?
Assuming I have weblogs in the below format:
10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
The user with IP 10.171.100.10 visited somurl/page1 at 12/Jan/2012:14:39:46 (1st entry in weblogs). Next user visited someurl/page2 at 12/Jan/2012:14:41:47. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:
10.171.100.10 someurl/page1 121 sec
10.171.100.10 someurl/page2 148 sec etc ..
The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?
I don't know any function that would by default use the content from following rows to generate some content, as the sequence is variable and thus highly unreliable.
You have to write your own UDF. To optimize the calculation (if you have billions of lines), you may want to ORDER by IP and date, and to GROUP your data set by IP and before starting a MapReduce job on each IP (or IP group) to ensure that all the rows corresponding to a particular IP are processed by the same node.
Also, I would advise you to think a bit longer about the rules you want to use to calculate the time spent on a page: when is a user still active and when is a user returning? You may end up with very long time ranges.