Is there a Pig UDF that calculates time difference in the weblogs?
Assuming I have weblogs in the below format:
10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
The user with IP 10.171.100.10 visited somurl/page1 at 12/Jan/2012:14:39:46 (1st entry in weblogs). Next user visited someurl/page2 at 12/Jan/2012:14:41:47. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:
10.171.100.10 someurl/page1 121 sec
10.171.100.10 someurl/page2 148 sec etc ..
The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?
I don't know any function that would by default use the content from following rows to generate some content, as the sequence is variable and thus highly unreliable.
You have to write your own UDF. To optimize the calculation (if you have billions of lines), you may want to ORDER by IP and date, and to GROUP your data set by IP and before starting a MapReduce job on each IP (or IP group) to ensure that all the rows corresponding to a particular IP are processed by the same node.
Also, I would advise you to think a bit longer about the rules you want to use to calculate the time spent on a page: when is a user still active and when is a user returning? You may end up with very long time ranges.
Related
Using Blaze Meter extension in Chrome Browser - I have saved the .jmx of a website and used that file in my JMeter test. Where it creates HTTP Header Manager with below user agent.
User Agent - Mozilla/5.0 (Operating_system; Intel xxx OS XXX xxx_xxx_xxx) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Question: Since, the agent contains driver for (Mozilla, Chrome, Safari) which browser my test will run? How does it understand my browser details?
As per documentation:
The Chrome (or Chromium/blink-based engines) user agent string is similar to the Firefox format. For compatibility, it adds strings like "KHTML, like Gecko" and "Safari".
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
So in your case the server will think that the virtual user is connecting from MacOS operating system and using Chrome browser (or derivative)
You might want to parameterize this User-Agent header value to represent different users using different browsers. Also pay attention to other headers, i.e. Accept-Encoding as it has huge impact on the data size which is requested from the server, i.e. whether it will be compressed or not.
I have a very simple Gatling scenario which hits a single HTTP endpoint with concurrent users.
When I run this for 30 seconds with 10 requests per second, everything is fine.
When I run this for 30 seconds at 60 requests per second on Windows, I get very strange errors that look to me like the underlying network connections are getting corrupted or are being misused. Perhaps there is a race condition or concurrency bug somewhere in Gatling or somewhere else in my system.
I don't get the same problems on a linux machine.
The web server is nginx and PHP. I don't suspect that is the cause of the problem, but I might be wrong.
How can I track down and fix this bug?
The scenario code
val scn = scenario("my scenario - one endpoint only")
.exec(http("fetch")
.get("http://my.website/page"))
.inject(
constantUsersPerSec(requestsPerSecond)
.during(30.seconds)
.randomized)
.protocols(httpProtocol)
setUp(scn)
Symptoms
The scenario reports about 8% failure rate, with errors that look like the server is replying with malformed HTTP responses, returning HTML code where the HTTP status line should be. These vary in the details, but here is a representative example:
2017-02-20 17:30:59,875 DEBUG org.asynchttpclient.netty.request.NettyRequestSender - invalid version format: <META
java.lang.IllegalArgumentException: invalid version format: <META
at io.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:130)
at io.netty.handler.codec.http.HttpVersion.valueOf(HttpVersion.java:84)
at io.netty.handler.codec.http.HttpResponseDecoder.createMessage(HttpResponseDecoder.java:118)
at io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:219)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:152)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Similarly, the server logs include invalid requests where the client has sent HTML where the HTTP request line should be:
10.56.4.130 - - [20/Feb/2017:17:30:59 +0000] "span class=\x22id4-cta-size-small id5-cta id4-cta-color-blue id4-cta-small-blue\x22><a hr" 400 166 "-" "-" "-" 0.179 -
10.56.4.130 - - [20/Feb/2017:17:31:00 +0000] "<!doctype html>" 400 166 "-" "-" "-" 0.070 -
Version info
I am using:
O/S: Windows 8.1 64 bit
Virus Scanner: I have Kapersky which intercepts network traffic. I tried turning it off, which made no difference. I don't know if it was "really" off.
VPN: My machine has a Windows Direct Connect VPN. The target site does not fall within that VPN.
Java: "1.8.0_121", Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Scala: 2.11.8
Gatling: 2.2.3
Akka: 2.4.12
io.netty.netty-handler: 4.0.42.Final
(4.0.41.Final was requested by netty-reactive-streams v 1.0.8, I wonder if that's significant)
I have a piece of log :
2016-02-10T08:51:07.000+00:00 [RTR] OUT app_url -
[10/02/2016:08:51:07 +0000] "GET / HTTP/1.1" 200 3418
"https://abcsomeootherurl"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/10.2.3.4 Safari/537.36" 10.4.5.6.:44711 x_forwarded_for:"10.5.6.7"
vcap_request_id:req_id response_time:0.021407606 app_id:some_id
I would like to form a elastic search query to be used in kibana4. How can I do that. Any lead would be highly appreciated.
I am using activeMQ as a full featured broker. I have deployed my spring application on Tomcat 8.0.8. I am sending very large messages and I do it in separate Thread (about 230000 stomp messages in while loop). When I use Chrome, or Firefox in activeMQ console, I see that messages are being consumed almost instantly. The problem as always is IE. I can see that it stops to consume messages (after about 1000), and Tomcat fails at
java.lang.Thread.run(Thread.java:722) aused by: org.apache.catalina.connector.ClientAbortException: java.io.IOExcepti n: An existing connection was forcibly closed by the remote host at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffe .java:396) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:426) at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java: 42) at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:31 ) etc....
Is IE such slow a consumer by default or what? I have tried with numerous slow consumer politic with ActiveMQ but without success.
The exception points to an IO error of some sort on the client side. You can try to track what happens on the client side, e.g. use Fiddler to check for any reported errors, or if that fails Wireshark to track what HTTP messages are sent out and how far it gets. Also try using the latest 4.0.6.BUILD-SNAPSHOT (or better yet 4.1.0.BUILD-SNAPSHOT) available from repo.spring.io/libs-snapshot. There are some recent SockJS related fixes worth trying out.
I have tried with 4.1.0.BUILD-SNAPSHOT, but with no success. After a while Fiddler reports:
POST mylink/ami-0.1.0/liveEvents/072/omseen4z/xhr_send?t=1404832405318 HTTP/1.1
Accept: /
Accept-Language: en-us
Referer: mylink/ami-0.1.0/liveEvents/iframe.html#ensha4ej
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
Host: mylink
Content-Length: 6
DNT: 1
Connection: Keep-Alive
Pragma: no-cache
Cookie: JSESSIONID=4CC71475E3FF1377C79D77C349710237; theme=standard
["\n"]
and HTTP/1.1 404 Not Found error occurs.
What can be the reason of this(new line message). I have added activemq prefetchSize of 1 and I do ack on every received message.
Marko
I am using wget for windows (gnuwin32 wget-1.11.4-1) in Windows 8 and using it for a helpdesk tool called kayako, telling it to poll from an email queue. The command line looks like this:
wget.exe -O null --timeout 25 http://xxx.kayako.com/cron/index.php?/Parser/ParserMinute/POP3IMAP
I know it takes around 20 seconds to receive a response from the server in my particular case when using a browser with the url in the command line above. However, when using that command, it returns almost immediately. This is an excerpt of the output:
Connecting to xxx.kayako.com[xxx.xxx.xxx.xxx]:80... connected. HTTP
request sent, awaiting response... 200 OK Length: unspecified
[text/html]
I would like to know what would be the difference between the two cases and how could I get wget to behave in the same way as the computer (I know it doesn't because kayako is not polling from the email queue).
There are a number of potential variables, but one of the more common distinctions made by web servers is based on the user agent string you are reporting. By default, wget will identify itself truthfully as wget. If this is an issue, you can use the --user-agent= option to change the user agent string.
For example, you could identify as Firefox on 64-bit Windows with something like --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0".