Packetbeat appears to be adding DNS packets that were not really sent - elasticsearch

I have an interesting problem with Packetbeat. Packetbeat is installed on a Debian 10 system. It is the latest version of Packetbeat (installed fresh this week from the Elastic download area) and sending data to Elastic v7.7 also installed on a Debian 10 system.
I am seeing the DNS data in the Elastic logs (when viewing them using Kibana-->Logs gui). But, I also see additional DNS packets in the log that I am not seeing on a packet analyzer running tcpdump from the same system that packetbeat is running on.
Here is the packet analyzer showing DNS calls to/from a client (10.5.52.47). The wireshark capture filter is set to 'port 53' and the display filter is set to 'ip.addr==10.5.52.47' It is running on the same system as packetbeat (for purposes of troubleshooting this issue).
Wireshark screenshot
1552 2020-06-04 20:31:34.297973 10.5.52.47 10.1.3.200 52874 53 DNS 93 Standard query 0x95f7 SRV
1553 2020-06-04 20:31:34.298242 10.1.3.200 10.5.52.47 53 52874 DNS 165 Standard query response 0x95f7 No such name SRV
1862 2020-06-04 20:32:53.002439 10.5.52.47 10.1.3.200 59308 53 DNS 90 Standard query 0xd67f SRV
1863 2020-06-04 20:32:53.002626 10.1.3.200 10.5.52.47 53 59308 DNS 162 Standard query response 0xd67f No such name SRV
1864 2020-06-04 20:32:53.004126 10.1.3.200 10.5.52.47 64594 53 DNS 84 Standard query 0xaaaa A
1867 2020-06-04 20:32:53.516716 10.1.3.200 10.5.52.47 64594 53 DNS 84 Standard query 0xaaaa A
2731 2020-06-04 20:36:34.314959 10.5.52.47 10.1.3.200 53912 53 DNS 93 Standard query 0x2631 SRV
2732 2020-06-04 20:36:34.315058 10.1.3.200 10.5.52.47 53 53912 DNS 165 Standard query response 0x2631 No such name SRV
I removed the actual DNS query info from these packets as it is not pertinent to this topic.
From the wireshark output, you can see a DNS query at 20:32:53 from 10.5.52.47 to the DNS server 10.1.3.200. The server responds to this query in the next packet. Also, there are two other responses from server after this on the same second of time.
The next DNS query by the client 10.5.52.47 occurs at 20:36:34. And this also gets an immediate response from the server.
This differs from the Kibana-->log output sent by packetbeat. In the Kibana logs, it shows the following:
Screenshot of Kibana Log showing actual DNS call(s), and multiple non-existent DNS calls (highlighted in yellow)
All the above info as captured in the packet capture
plus the following:
20:33:00.000 destination IP of 10.5.52.47 destination port of 53
Same thing at
20:33:10.000
20:33:20.000
20:33:30.000
20:33:40.000
Then at 20:36:34 it shows the DNS query that the packet capture shows.
So, these port 53 that end at 00/10/20/30/40 seconds after the minute appear to be made up from thin air. Additionally, there are no other fields being populated in the Elastic logs for these entries. client.ip is empty, and so is client.bytes, client.port, and ALL the DNS fields for these log entries. All the DNS entries that are listed in both the packet capture and Kibana, have all the expected fields populated with correct data.
Does anyone have an idea of why this is occurring? This example above is a small sample. This occurs for multiple systems at 10 seconds intervals. for example, at 10 or 20 or 30 or 40 or 50 or 60 seconds after the minute, I see between 10 to 100 (guesstimate) of these log entries where all the fields are blank except destination.ip, destination.byte, and destination.port - there is no client info and no DNS info contained in the fields for these errant records.
The 'normal' DNS records hove about 20 fields of information listed on the Kibana log, and these errant ones have only four fields (the fields listed above and the timestamp).
Here is an example of the log from one of these 10 second intervals...
timestamp Dest.ip Dest.bytes Dest.port
20:02:50.000 10.1.3.200 105 53
20:02:50.000 10.1.3.200 326 53
20:02:50.000 10.1.3.200 199 53
20:02:50.000 10.1.3.200 208 53
20:02:50.000 10.1.3.201 260 53
20:02:50.000 10.1.3.200 219 53
20:02:50.000 10.1.3.200 208 53
20:02:50.000 10.1.3.200 199 53
.
.
Plus 42 more of these at the same second
.
.
20:02:50.000 10.1.3.201 98 53
Kibana Log view of reported issue - the 'real' dns call is highlighted in yellow, the non-existent dns calls are marked by the red line - there are way more non-existent DNS calls logged than real DNS queries
And here is the packetbeat.yml file (only showing uncommented lines)
packetbeat.interfaces.device: enp0s25
packetbeat.flows:
timeout: 30s
period: 10s
packetbeat.protocols:
- type: dhcpv4
ports: [67, 68]
- type: dns
ports: [53]
include_authorities: true
include_additionals: true
setup.template.settings:
index.number_of_shards: 1
setup.dashboards.enabled: true
setup.kibana:
host: "1.1.1.1:5601"
output.elasticsearch:
hosts: ["1.1.1.2:59200"]
setup.template.overwrite: true
setup.template.enabled: true
Thank you for your thoughts on what might be causing this issue.
=======================================================================
Update on 6/8/20
I had to shutdown packetbeat due to this issue, until I can locate a resolution. One single packetbeat system generated 100 million documents over the weekend for just DNS queries. Of which 98% of them were somehow created by packetbeat and were not real DNS queries.
I stopped the packetbeat service this morning on the linux box that is capturing these DNS queries, and deleted this index. I then restarted the packetbeat instance and let it run for about 60 seconds. Then I stopped the packetbeat service. During the 60 seconds 22,119 DNS documents were added to the index. When I removed the documents packetbeat created (that were not real DNS queries), it deleted 21,391. leaving me with 728 actual DNS queries. In this case, 97% of the documents were created by packetbeat, and 3% where 'real' DNS queries made by our systems which packetbeat captured.
Any ideas as to why this behavior is being exhibited by this system?
Thank you

Related

Postgres connect time delay on Windows

There is a long delay between "forked new backend" and "connection received", from about 200 to 13000 ms. Postgres 12.2, Windows Server 2016.
During this delay the client is waiting for the network packet to start the authentication. Example:
14:26:33.312 CEST 3184 DEBUG: forked new backend, pid=4904 socket=5340
14:26:33.771 CEST 172.30.100.238 [unknown] 4904 LOG: connection received: host=* port=56983
This was discussed earlier here:
Postegresql slow connect time on Windows
But I have not found a solution.
After rebooting the server the delay is much shorter, about 50 ms. Then it gradually increases in the course of a few hours. There are about 100 clients connected.
I use ip addresses only in "pg_hba.conf". "log_hostname" is off.
There is BitDefender running on the server but switching it off did not help. Further, Postgres files are excluded from BitDefender checks.
I used Process Monitor which revealed the following: Forking the postgres.exe process needs 3 to 4 ms. Then, after loading DLLs, postgres.exe is looking for custom and extended locale info of 648 locales. It finds none of these. This locale search takes 560 ms (there is a gap of 420 ms, though). Perhaps this step can be skipped by setting a connection parameter. After reading some TCP/IP parameters, there are no events for 388 ms. This time period overlaps the 420 ms mentioned above. Then postgres.exe creates a thread. The total connection time measured by the client was 823 ms.
Locale example, performed 648 times:
"02.9760160","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale","REPARSE","Desired Access: Read"
"02.9760500","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale","SUCCESS","Desired Access: Read"
"02.9760673","RegQueryValue","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale\bg-BG","NAME NOT FOUND","Length: 532"
"02.9760827","RegCloseKey","HKLM\System\CurrentControlSet\Control\Nls\CustomLocale","SUCCESS",""
"02.9761052","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale","REPARSE","Desired Access: Read"
"02.9761309","RegOpenKey","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale","SUCCESS","Desired Access: Read"
"02.9761502","RegQueryValue","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale\bg-BG","NAME NOT FOUND","Length: 532"
"02.9761688","RegCloseKey","HKLM\System\CurrentControlSet\Control\Nls\ExtendedLocale","SUCCESS",""
No events for 388 ms:
"03.0988152","RegCloseKey","HKLM\System\CurrentControlSet\Services\Tcpip6\Parameters\Winsock","SUCCESS",""
"03.4869332","Thread Create","","SUCCESS","Thread ID: 2036"

IdMappedPortTCP now requires "prodding" after telnet connection

I have used IdMappedPortTCP in a particular program to allow generic port-forwarding for many years. I'm testing an upgraded build/component environment, and I ran into a problem. First, here's the old & new version info:
OS: W2kSP4 --> same (Hey, why is everyone laughing?)
Delphi: 5 --> 7
Project Indy: 9.0.0.14 --> 9.[Latest SVN]
I'm testing it by inserting it in a telnet session using the standard Windows console telnet client, and a Linux server, and I'm seeing an odd change in behavior.
Direct connection: Client connects, immediately sees server greeting
Old Indy: Same as direct
New Indy: Client connects, sees nothing. Press key, sees server greeting + keystroke.
Here's a comparison of the event chain:
Old:
6/08/2017 6:47:16 PM - DEBUG: MappedPort-Connect
6/08/2017 6:47:16 PM - TCP Port Fwd: Connect: 127.0.0.1:4325 --> 127.0.0.1:23
6/08/2017 6:47:16 PM - DEBUG: MappedPort-OutboundConnect
6/08/2017 6:47:16 PM - TCP Port Fwd: Outbound Connect: 192.168.214.11:4326 --> 192.168.210.101:23
6/08/2017 6:47:16 PM - DEBUG: MappedPort-OutboundData
6/08/2017 6:47:16 PM - DEBUG: MappedPort-Execute
6/08/2017 6:47:16 PM - DEBUG: MappedPort-OutboundData
6/08/2017 6:47:16 PM - DEBUG: MappedPort-Execute
6/08/2017 6:47:16 PM - DEBUG: MappedPort-OutboundData
...
New:
6/08/2017 6:41:34 PM - DEBUG: MappedPort-Connect
6/08/2017 6:41:34 PM - TCP Port Fwd: Connect: 127.0.0.1:1085 --> 127.0.0.1:23
6/08/2017 6:41:34 PM - DEBUG: MappedPort-OutboundConnect
6/08/2017 6:41:34 PM - TCP Port Fwd: Outbound Connect: 192.168.214.59:1086 --> 192.168.210.101:23
6/08/2017 6:47:36 PM - DEBUG: MappedPort-Execute
6/08/2017 6:47:36 PM - DEBUG: MappedPort-OutboundData
6/08/2017 6:47:36 PM - DEBUG: MappedPort-Execute
6/08/2017 6:47:36 PM - DEBUG: MappedPort-OutboundData
6/08/2017 6:47:36 PM - DEBUG: MappedPort-Execute
In the first one, you see OutboundData right after connecting. In the second, nothing happens after connect until I sent a keystroke (6 minutes later), at which time you see the Execute and then the first OutboundData event.
This caused me to wonder: Is it really connecting to the server and only delaying the output, or is the connection itself being delayed?
My first conclusion was that the connection itself was being delayed, and here's why. The server has a 1-minute timeout at the login prompt. If you connect and get the greeting but just sit there, the server disconnects after a minute. With the new Indy version, I sat there after the connect event for 6 full minutes, then got the server greeting with no problem.
However... NETSTAT shows the connection to the remote server established soon after the connection event is logged! So, I'm left to conclude that the connection is indeed established, but perhaps some initial character is being "eaten" or something that's causing getty to not engage until it gets a keystroke?
Any suggestions? Are you aware of anything that changed that I might look for -- something that I ought to be doing but am not? Any insights are appreciated.
(Barring any good leads, I guess the next step in my sleuthing might be to sniff both machines w/ WireShark to see what's going on after the connection.)
Update: Wireshark (single leg)
A packet capture from outside the machines showing traffic between MappedPort & the server (but not traffic between the client & MappedPort) shows that the telnet server sends a "Do Authenticate", to which the client (via MappedPort) replies w/ a "Will Authenticate". This is followed by the server sending authenticate suboption (and the client agreeing) then all the other telnet options. Finally, after seeing the login text, the client sends "do echo" and they both sit there until after 1min, at which time the server sends a TCP FIN to close the connection. That's the "good old" version.
On the new version, the client doesn't respond to the "Will Authenticate", and they both sit there indefinitely. (Hmmm, I wonder what that ties up in terms of server resources -- could be good DOS attack. It is an old telnet daemon, though, so it's probably been fixed by now...) When I finally sent the first keystroke, that's all it sent in that packet. THEN the client sends the "will authenticate" (without additional prodding from the server) and the negotiation continues exactly as normal; the last packet from the server (containing echo parameters) also then includes the echoed character which was typed. So it's like the client doesn't see the initial "do authenticate" packet from the server, but once you start typing, goes ahead and responds as if it had just heard it (once it sends the keystroke).
6/13 Update: Wireshark (both legs)
I captured both legs of the "broken" conversation and analyzed it. Interesting behavior. Bottom line:
As soon as the the server gets the TCP connection, it sends back a Telnet-DoAuth invitation. IdMappedPortTCP holds onto that packet and doesn't pass it on to the client -- yet. Once the client finally sends the first keystroke (seconds or minutes later), Id passes it on to the server. THEN Id passes the DoAuth packet that it got from the server on to the client.
Here's a more detailed accounting of the packets:
65 11-59 TCP Syn
67 59-11 TCP SynAck
69 11-59 TCP Ack
71 59-101 TCP Syn
73 101-59 TCP SynAck
74 59-101 TCP Ack
76 101-59 DoAuth
77 59-101 TCP Ack
nothing for 23 seconds
79 11-59 Data:\r\n (I pressed Enter)
81 59-101 Data:\r\n
83 59-11 DoAuth
85 11-59 WillAuth
87 101-59 TCP Ack
88 59-101 WillAuth
90 101-59 TCP Ack
91 101-59 Authentication option
92 59-11 Authentication option
94 11-59 Authentication option reply
96 59-101 Authentication option reply
98 101-59 Will/do Encryption/terminal/env options
99 59-101 Will/do Encryption/terminal/env options
101 11-59 Don't encrypt
103 59-101 Don't encrypt
105 101-59 TCP Ack
106 59-11 TCP Ack
108 11-59 Won't/will list
110 59-101 Won't/will list
112 101-59 TCP Ack
113 101-59 Do window size
114 59-11 Do window size
Packet dump line format: Pkt# From-To Payload
(Don't mind the packet# skips; the client & proxy are both running on VMs hosted by the machine where I was running the capture from, so Wireshark saw two copies of packets. I only included the pkt# so I can reference the original dump later if I want.)
From/To Machines:
10 = Linux client (see below)
11 = Windows client
59 = proxy
101 = server
An Interesting diversion: Linux Client
Though all my testing has been using various Windows clients (because that's what's used in production), I "accidentally" used Linux (because that's what I run on my workstation, where I ran Wireshark) because it was convenient. That client behaves differently -- more aggressively -- and thus avoids the problem. Here's what a dump of that looks like:
1 10-59 TCP Syn
2 59-10 TCP SynAck
3 10 59 TCP Ack
4 10-59 Do/Will list
5 59-101 TCP Syn
7 101-59 TCP SynAck
8 59-101 TCP Ack
10 59-101 Do/Will list
12 101-59 TCP Ack
13 101-59 DoAuth
14 59-10 DoAuth
15 10-59 TCP Auth
16 10-59 WontAuth
17 59-101 WontAuth
19 101-59 Will/Do list
20 59-10 Will/Do list
21 10-50 Do window size
22 59-101 Do window size
As you can see, the client doesn't wait for the telnet server to speak first -- as soon as the TCP connection is established, it sends a full Do/Will list. This is in turn passed onto the server once Id opens that connection. The server sends back the same "DoAuth" that it did initially before; the difference being that this time, having already passed traffic from the client, Id passes it on immediately. The client then sends auth flags, and things move right along.
So, if the client speaks first, IdMappedPortTCP does okay; it's only when the server speaks first that it holds onto its message and doesn't pass it on to the client until the client says something.
9/27 Update: Found the Code Change
Downgrading to 9.0.0.14 fixed the problem. Comparing the two versions' source code for IdMappedPortTCP.pas I found that the only difference is that the newer version added a block of code to procedure TIdMappedPortThread.OutboundConnect:
DoOutboundClientConnect(Self);
FNetData := Connection.CurrentReadBuffer;
if Length(FNetData) > 0 then begin
DoLocalClientData(Self);
FOutboundClient.Write(FNetData);
end;//if
except
(The first and last lines existed already, and are only shown for context.)
I confirmed that adding that code to 9.0.0.14 produced the problem.
I checked the SVN repo, and you added the offending code on 9/7/2008. The commit comment is:
Updated TIdMappedPortThread.OutboundConnect() to check for pending
data in the inbound client's InputBuffer after the OnOutboundConnect
event handler exits.
I don't fully understand the reason for or implications of the change -- obviously you had a good reason for doing it -- but it does appear to produce the effect I described ("holding onto" the server's initial output until the client sends something).
In Indy 9, TIdTCPConnection.CurrentReadBuffer() calls TIdTCPConnection.ReadFromStack() before then returning whatever data is stored in the TIdTCPConnection.InputBuffer property:
function TIdTCPConnection.CurrentReadBuffer: string;
begin
Result := '';
if Connected then begin
ReadFromStack(False); // <-- here
end;
Result := InputBuffer.Extract(InputBuffer.Size);
end;
Regardless of what may already be in the InputBuffer, ReadFromStack() waits for the socket to receive new data to append to the InputBuffer. It does not exit until new data actually arrives, or the specified ReadTimeout interval elapses. The TIdTCPConnection.ReadTimeout property is set to 0 by default, so when CurrentReadBuffer() calls ReadFromStack(), it ends up using an infinite timeout:
function TIdTCPConnection.ReadFromStack(const ARaiseExceptionIfDisconnected: Boolean = True;
ATimeout: Integer = IdTimeoutDefault; const ARaiseExceptionOnTimeout: Boolean = True): Integer;
// Reads any data in tcp/ip buffer and puts it into Indy buffer
// This must be the ONLY raw read from Winsock routine
// This must be the ONLY call to RECV - all data goes thru this method
var
i: Integer;
LByteCount: Integer;
begin
if ATimeout = IdTimeoutDefault then begin
if ReadTimeOut = 0 then begin
ATimeout := IdTimeoutInfinite; // <-- here
end else begin
ATimeout := FReadTimeout;
end;
end;
...
end;
So, when TIdMappedPortTCP.OutboundConnect() calls CurrentReadBuffer() after connecting its OutboundClient to the server, it does indeed wait for data to arrive from the client before then reading data from the server. To avoid that, you can set a non-infinite ReadTimeout value in the TIdMappedPortTCP.OnConnect or TIdMappedPortTCP.OnOutboundConnect event, eg:
AThread.Connection.ReadTimeout := 1;
In Indy 10, this problem was fixed in TIdMappedPortTCP by avoiding this initial wait on the client data after connecting to the server. I have now updated TIdMappedPortTCP in Indy 9 to do the same.

How to configure Cassandra to work across multiple EC2 regions with Ec2MultiRegionSnitch

I am new to Cassandra and have been tasked with getting it up and running in the EC2 environment across multiple regions such that if an entire EC2 region goes belly up our app will continue on its merry way. I've read as much documentation as I could find regarding Ec2MultiRegionSnitch and have come to a dead stop. I am running cassandra 1.0.10.
My problems are as follows:
1) when I start bin/cassandra I get the error: Could not start register mbean in JMX. Though I can run bin/nodetool -h ring on any of the nodes and I get the display you would expect from a healthy system. I have added the mx4j library to my cassandra deployment. I could try removing that I suppose.
2) when I then start bin/cassandra-cli -h I am able to create the keyspace as follows:
CREATE KEYSPACE mykeyspace
WITH placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options = {us-east-1:2,us-west-1:2};
3) After I run 'use mykeyspace' I can create a column family as follows:
CREATE COLUMN FAMILY people
WITH comparator=UTF8Type AND key_validation_class=UTF8Type AND
default_validation_class=UTF8Type AND column_metadata=[{column_name:FIRST_NAME,validation_class:UTF8Type},
{column_name:LAST_NAME,validation_class:UTF8Type},
{column_name:EMAIL,validation_class:UTF8Type},
{column_name:LOGIN,validation_class:UTF8Type, index_type: KEYS}];
4) After I do this I can run bin/cassandra-cli -h on any of the 4 nodes, run use mykeyspace; describe; and each node correctly describes mykeyspace including the column family and seed list.
5) But when I try to perform a simple:
set people['1']['FIRST_NAME'] = 'John';
I get a stack trace as follows:
null
UnavailableException()
at org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15206)
at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:858)
at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:830)
at org.apache.cassandra.cli.CliClient.executeSet(CliClient.java:901)
My configuration:
I have performed ec2-authorize for ports 22, 7000, 7199 and 9160
I have 4 nodes in my cluster: one node in each of the following regions:AvailabilityZones.
us-east-1:us-east-1a (initial_token: 0)
us-east-1:us-east-1c (initial_token: 85070591730234615865843651857942052864)
us-west-1:us-west-1a (initial_token: 1)
us-west-1:us-west-1c (initial_token: 85070591730234615865843651857942052865)
Each EC2 instance has been associated with a public IP address.
In each node I have configured cassandra.yaml as follows:
seeds: <set to the public ip address for the us-east-1a and us-west-1a nodes>
storage_port: 7000
listen_address: <private ip address of this node>
broadcast_address: <public ip address of this node>
rpc_address: 0.0.0.0
rpc_port: 9160
endpoint_snitch: Ec2MultiRegionSnitch
Additionally in each node's cassandra-env.sh I've included:
JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<Node's local IP Address>"
My Plea
Hopefully I have provided someone with enough information to help me get this thing working as one would like.
Additional Information
Stack trace from first mx4j issue:
WARN 22:07:17,651 Could not start register mbean in JMX java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.cassandra.utils.Mx4jTool.maybeLoad(Mx4jTool.java:66)
at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:243)
at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:356)
at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
Caused by: java.net.BindException: Cannot assign requested address
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:353)
My cassandra-topology.properties
aaa.aaa.aaa.aaa=us-east-1:us-east-1a
bbb.bbb.bbb.bbb=us-east-1:us-east-1c
ccc.ccc.ccc.ccc=us-west-1:us-west-1a
ddd.ddd.ddd.ddd=us-west-1:us-west-1c
default=us-east-1:us-east-1a
My nodetool ring output __
Address DC Rack Status State Load Owns Token
85070591730234615865843651857942052865
aaa.aaa.aaa.aaa us-east 1a Up Normal 11.09 KB 50.00% 0
bbb.bbb.bbb.bbb us-west 1a Up Normal 6.68 KB 0.00% 1
ccc.ccc.ccc.ccc us-east 1c Up Normal 11.09 KB 50.00% 85070591730234615865843651857942052864
ddd.ddd.ddd.ddd us-west 1c Up Normal 15.5 KB 0.00% 85070591730234615865843651857942052865
I'm pretty certain I've added the regions/availability zone correctly. At least I think I matched what appears in the documentation. (Look at Ec2MultiRegionSnitch in this link)
http://www.datastax.com/docs/1.0/cluster_architecture/replication
I don't think I can just list the regions as us-west and us-east because there are two regions out west (us-west-1 is the California region and us-west-2 is the Oregon region). So I don't think just putting us-west would successfully differentiate regions.
My guess in my comment was right. Your replication settings and datacenter names don't match. A couple of things.
1) cassandra-topology.properties is only used by the PropertyFileSnitch. That file is irrelevant while using the ec2 snitch.
2) The reason the snitch is currently reporting 'us-west' instead of 'us-west-1' is due to a bug. https://issues.apache.org/jira/browse/CASSANDRA-4026. If you added nodes in 'us-west-2' they will correctly get reported as that.
So the solution here is to update your replication settings:
CREATE KEYSPACE mykeyspace
WITH placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options = {us-east:2,us-west:2};
Also, I unfortunately do not know what is wrong with mx4j. It isn't needed by cassandra though so unless you actually need it for something you can just remove it.

Performance degradation using Azure CDN?

I have experimented quite a bit with CDN from Azure, and I thought i was home safe after a successful setup using a web-role.
Why the web-role?
Well, I wanted the benefits of compression and caching headers which I was unsuccessful obtaining using normal blob way. And as an added bonus; the case-sensitive constrain was eliminated also.
Enough with the choice of CDN serving; while all content before was served from the same domain, I now serve more or less all "static" content from cdn.cuemon.net. In theory, this should improve performance since browsers parallel can spread content gathering over "multiple" domains compared to one domain only.
Unfortunately this has lead to a decrease in performance which I believe has to do with number of hobs before content is being served (using a tracert command):
C:\Windows\system32>tracert -d cdn.cuemon.net
Tracing route to az162766.vo.msecnd.net [94.245.68.160]
over a maximum of 30 hops:
1 1 ms 1 ms 1 ms 192.168.1.1
2 21 ms 21 ms 21 ms 87.59.99.217
3 30 ms 30 ms 31 ms 62.95.54.124
4 30 ms 29 ms 29 ms 194.68.128.181
5 30 ms 30 ms 30 ms 207.46.42.44
6 83 ms 61 ms 59 ms 207.46.42.7
7 65 ms 65 ms 64 ms 207.46.42.13
8 65 ms 67 ms 74 ms 213.199.152.186
9 65 ms 65 ms 64 ms 94.245.68.160
C:\Windows\system32>tracert cdn.cuemon.net
Tracing route to az162766.vo.msecnd.net [94.245.68.160]
over a maximum of 30 hops:
1 1 ms 1 ms 1 ms 192.168.1.1
2 21 ms 22 ms 20 ms ge-1-1-0-1104.hlgnqu1.dk.ip.tdc.net [87.59.99.217]
3 29 ms 30 ms 30 ms ae1.tg4-peer1.sto.se.ip.tdc.net [62.95.54.124]
4 30 ms 30 ms 29 ms netnod-ix-ge-b-sth-1500.microsoft.com [194.68.128.181]
5 45 ms 45 ms 46 ms ge-3-0-0-0.ams-64cb-1a.ntwk.msn.net [207.46.42.10]
6 87 ms 59 ms 59 ms xe-3-2-0-0.fra-96cbe-1a.ntwk.msn.net [207.46.42.50]
7 68 ms 65 ms 65 ms xe-0-1-0-0.zrh-96cbe-1b.ntwk.msn.net [207.46.42.13]
8 65 ms 70 ms 74 ms 10gigabitethernet5-1.zrh-xmx-edgcom-1b.ntwk.msn.net [213.199.152.186]
9 65 ms 65 ms 65 ms cds29.zrh9.msecn.net [94.245.68.160]
As you can see from the above trace route, all external content is delayed for quite some time.
It is worth noticing, that the Azure service is setup in North Europe and I am settled in Denmark, why this trace route is a bit .. hmm .. over the top?
Another issue might be that the web-role is two extra small instances; I have not found the time yet to try with two small instances, but I know that Microsoft limits the extra small instances to a 5Mb/s WAN where small and above has 100Mb/s.
I am just unsure if this goes for CDN as well.
Anyway - any help and/or explanation is greatly appreciated.
And let me state, that I am very satisfied with the Azure platform - I am just curious in regards to the above mentioned matters.
Update
New tracert without the -d option.
Being inspired by user728584 I have researched and found this article, http://blogs.msdn.com/b/scicoria/archive/2011/03/11/taking-advantage-of-windows-azure-cdn-and-dynamic-pages-in-asp-net-caching-content-from-hosted-services.aspx, which I will investigate further in regards to public cache-control and CDN.
This does not explain the excessive hops count phenomenon, but I hope a skilled network professional can help in casting light to this matter.
Rest assured, that I will keep you posted according to my findings.
Not to state the obvious but I assume you have set the Cache-Control HTTP header to a large number so as your content is not being removed from the CDN Cache and being served from Blob Storage when you did your tracert tests?
There are quite a few edge servers near you so I would expect it to perform better: 'Windows Azure CDN Node Locations' http://msdn.microsoft.com/en-us/library/windowsazure/gg680302.aspx
Maarten Balliauw has a great article on usage and use cases for the CDN (this might help?): http://acloudyplace.com/2012/04/using-the-windows-azure-content-delivery-network/
Not sure if that helps at all, interesting...
Okay, after I'd implemented public caching-control headers, the CDN appears to do what is expected; delivering content from x-number of nodes in the CDN cluster.
The above has the constrain that it is experienced - it is not measured for a concrete validation.
However, this link support my theory: http://msdn.microsoft.com/en-us/wazplatformtrainingcourse_windowsazurecdn_topic3,
The time-to-live (TTL) setting for a blob controls for how long a CDN edge server returns a copy of the cached resource before requesting a fresh copy from its source in blob storage. Once this period expires, a new request will force the CDN server to retrieve the resource again from the original blob, at which point it will cache it again.
Which was my assumed challenge; the CDN referenced resources kept pooling the original blob.
Also, credits must be given to this link also (given by user728584); http://blogs.msdn.com/b/scicoria/archive/2011/03/11/taking-advantage-of-windows-azure-cdn-and-dynamic-pages-in-asp-net-caching-content-from-hosted-services.aspx.
And the final link for now: http://blogs.msdn.com/b/windowsazure/archive/2011/03/18/best-practices-for-the-windows-azure-content-delivery-network.aspx
For ASP.NET pages, the default behavior is to set cache control to private. In this case, the Windows Azure CDN will not cache this content. To override this behavior, use the Response object to change the default cache control settings.
So my conclusion so far for this little puzzle is that you must pay a close attention to your cache-control (which often is set to private for obvious reasons). If you skip the web-role approach, the TTL is per default 72 hours, why you may not never experience what i experienced; hence it will just work out-of-the-box.
Thanks to user728584 for pointing me in the right direction.

How to verify that Squid used as a reversed proxy is working?

We want to decrease the load in one of our web servers and we are running some tests with squid configured as a reverse proxy.
The configuration is in the remarks below:
http_port 80 accel defaultsite=original.server.com
cache_peer original.server.com parent 80 0 no-query originserver name=myAccel
acl our_sites dstdomain .contentpilot.net
http_access allow our_sites
cache_peer_access myAccel allow our_sites
cache_peer_access myAccel deny all
The situation we are having is that pretty much the server is returning TCP_MISS almost all the time.
1238022316.988 86 69.15.30.186 TCP_MISS/200 797 GET http://original.server.com/templates/site/images/topnav_givingback.gif - FIRST_UP_PARENT/myAccel -
1238022317.016 76 69.15.30.186 TCP_MISS/200 706 GET http://original.server.com/templates/site/images/topnav_diversity.gif - FIRST_UP_PARENT/myAccel -
1238022317.158 75 69.15.30.186 TCP_MISS/200 570 GET http://original.server.com/templates/site/images/topnav_careers.gif - FIRST_UP_PARENT/myAccel -
1238022317.344 75 69.15.30.186 TCP_MISS/200 2981 GET http://original.server.com/templates/site/js/home-search-personalization.js - FIRST_UP_PARENT/myAccel -
1238022317.414 85 69.15.30.186 TCP_MISS/200 400 GET http://original.server.com/templates/site/images/submenu_arrow.gif - FIRST_UP_PARENT/myAccel -
1238022317.807 75 69.15.30.186 TCP_MISS/200 2680 GET http://original.server.com/templates/site/js/homeMakeURL.js - FIRST_UP_PARENT/myAccel -
1238022318.666 1401 69.15.30.186 TCP_MISS/200 103167 GET http://original.server.com/portalresource/lookup/wosid/intelliun-2201-301/image2.jpg - FIRST_UP_PARENT/myAccel image/pjpeg
1238022319.057 1938 69.15.30.186 TCP_MISS/200 108021 GET http://original.server.com/portalresource/lookup/wosid/intelliun-2201-301/image1.jpg - FIRST_UP_PARENT/myAccel image/pjpeg
1238022319.367 83 69.15.30.186 TCP_MISS/200 870 GET http://original.server.com/templates/site/images/home_dots.gif - FIRST_UP_PARENT/myAccel -
1238022319.367 80 69.15.30.186 TCP_MISS/200 5052 GET http://original.server.com/templates/site/images/home_search.jpg - FIRST_UP_PARENT/myAccel -
1238022319.368 88 69.15.30.186 TCP_MISS/200 5144 GET http://original.server.com/templates/site/images/home_continue.jpg - FIRST_UP_PARENT/myAccel -
1238022319.368 76 69.15.30.186 TCP_MISS/200 412 GET http://original.server.com/templates/site/js/showFooterBar.js - FIRST_UP_PARENT/myAccel -
1238022319.377 100 69.15.30.186 TCP_MISS/200 399 GET http://original.server.com/templates/site/images/home_arrow.gif - FIRST_UP_PARENT/myAccel -
We already tried removing all the cache memory. Any ideas. Could it be that my web site is marking some of the content different each time even though it has not change since the last time it was requested by the proxy?
What headers is the origin server (web server) sending back with your content? In order to be cacheable by squid, I believe you generally have to specify either a Last-Modified or ETag in the response header. Web servers will typically do this automatically for static content, but if your content is being dynamically served (even if from a static source) then you have to ensure they are there, and handle request headers such as If-Modified-Since and If-None-Match.
Also, since I got pointed to this question by your subsequent question about sessions--- is there a "Vary" header coming out in the response? For example, "Vary: Cookie" tells caches that the content can vary according to the Cookie header in the request: so static content wants to have that removed. But your web server might be adding that to all requests if there is a session, regardless of the static/dynamic nature of the data being served.
In my experience, some experimentation with the HTTP headers to see what the effects are on caching is of great benefit: I remember finding that the solutions were not always obvious.
Examine the headers returned with wireshark or firebug in firefox (the latter is easier to prod around but the former will give you more low-level information if you end up needing that).
Look for these items in the Response Headers (click on an item in the `Net' view to expand it and see request and response headers):
Last-Modified date -> if not set to a sensible time in the past then it won't be cached
Etags -> if these change every time the same item is requested then it will be re-fetched
Cache-Control -> Requests from the client with max-age=0 will (I believe) request a fresh copy of the page each time
(edit) Expires header -> If this is set in the past (i.e. always expired) then squid will not cache it
As suggested by araqnid, the HTTP headers can make a huge difference to what the proxy will think it can cache. If your back-end is using apache then test that static files served without going via any PHP or other application layer are cacheable.
Also, check that the squid settings for maximum_object_size and minimum_object_size are set to sensible values (the defaults are 4Mb and 0kb, which should be fine), and maximum cache item ages are also set sensibly.
(See http://www.visolve.com/squid/squid30/cachesize.php#maximum_object_size for this and other settings)

Resources