Category: load-balancer

A10 AX series ADC look and feel (F5 comparison)

When searching for A10 help, I have come short most of the time. The good thing about A10 AFlex rules, is you can reference F5 iRule documentation, because they are both based on TCL. But for any other operational or troubleshooting tasks help, I am not so lucky. Coming from an F5 background, I would like to start a short series to log some of my findings when working with the AX series ADC, such as basic A10 commands, syntax and troubleshooting methods. Similar to the F5 (post 9.x), A10 AX also have a very useful command line but with some tasks the GUI much cleaner/faster. Lets start with some known terms and the configuration hierarchy and how it differs from F5 The F5 you had your VIPs (virtual servers) which were tied to a specific IP:Port, which would be directly tied to a pool with pool members (server nodes). So typically you had multiple virtual servers for the same IP, if you needed to expose multiple ports.
.
├── Virtual server
    └── pool
        └── node
Things are designed a bit different in the A10, you will see instead:
.
├── Virtual Server
    └── Virtual Service
        └── Service Group
            └── Server
Where your Virtual server, you will only have one of, per IP address. In the VS configuration, you will have port mappings, which are known as 'Virtual Services'. They map VIP ports to service groups such as:
Port TCP 80 --> Service Group HTTP-EU-app01-tcp8080-sg
Port TCP 443 --> Service Group HTTPS-EU-app01-tcp8443-sg
Service Groups are very much similar to F5 pools, where you will configure member servers, load balancing algorithm, server priority, health checks, etc. So looking back at A10's configuration hierarchy, the Virtual Server is just an abstraction layer in the config hierarchy that makes the GUI feel cleaner. You have one Virtual Server per IP, which is represented as one page for configuration in the web UI. From there, you will configure ports to direct requests to specific Service groups. But wait.... what about 'Virtual Services'? These are generated when you map the port to SG and to edit, you will be brought to a new page for configuration of each Virtual Service. In the text configuration, they will be noted as `_10.1.1.1_HTTP_80`{:.language-ruby} if you happened to map port 80 as HTTP to VIP 10.1.1.1. This is nothing daunting, just a little different and with an small learning curve for someone with F5 experience. As far as the look and feel, the GUI is very easy ![A10 ACOS UI](/images/a10_acos.png) _above picture is Thunder series, not AX but OS is the same_ The UI is broken down into 'Monitor' and 'Config' modes. Monitor, you may see graphs and counters in relation to the objects that you are looking into, where Config is strictly for configuring. The A10 has a proprietary HA engine, where there is a Active/Standby node but also a VCS Primary/Secondary. You make changes on the VCS Primary but the traffic will flow through the Active node, which doesn’t always mean you need to make your changes on the “Active” node. As far as network interfaces, or in F5 talk, Self-IPs.. These are controlled via VRRP. You can have multiple VRRP domains or you can throw all your networks into the default VRID domain. Hope this helped as an introduction. I will write a few posts in the near future about basic configurations and also troubleshooting methods using the tools present on the ACOS CLI. Thanks!

Decrypt F5 SSL traffic for troubleshooting

Recently, I had to troubleshoot an issue where there was some improper API use and it was being blamed on the application. The traffic is SNAT’d, so essentially backend servers are being proxied. What I mean by this, is in the “backend servers” perspective, the requests appear to be coming from the F5, not the original source IP. Since the SSL is offloaded on the F5, we can only trace the unencrypted traffic on the backend and due to how noisy it is and all request appear to be coming from the F5 self IP, this becomes very difficult to troubleshoot. So we need to troubleshoot on the frontend, where the public source IP is preserved. This traffic is encrypted, so we need means of viewing the traffic unencrypted. This can be done with tcpdump and ssldump, which both are installed on the F5 by default. So now we are ready to start capturing the traffic. You can do so with tcpdump, but you should be very strict with your parameters, as you could cause performance issues sending to much garbage to stdout. You will want to save to file, so we can later decrypt with ssldump. Here is an example, listening on the frontend VLAN:
[root@LB01:Active] ~ # tcpdump -vvv -s 0 -nni vlan2025 -w test.pcap host 256.10.20.30
tcpdump: listening on vlan2025

30 packets received by filter
0 packets dropped by kernel
[root@LB01:Active] ~ #
Now you should have the PCAP file in your present directory and you can view it via tcpdump, wireshark, or any other packet analysis tool that you have available, as PCAP is the industry standard packet capture format. If you were to open the file up in Wireshark, you would notice that the SSL/TLS payload displays ***‘* Record Layer: Handshake Protocol: Encrypted Handshake Message’***, so you aren’t able to view the unencrypted data natively, which is expected. This is where ssldump comes in, which can utilized your F5 private keys to decrypt the trace. You need to identify what SSL cert/key pair are used on the VIP you are troubleshooting. You can find this out by looking over the VIP configuration, which will use a specific SSL profile. In looking into that profile, it will mention what cert/key pair it uses. You will be able to find these certs/keys in /config/ssl/ssl.crt & ssl.key on 9.x/10.x or it could be stored in /config/filestore/files\_d/Common\_d/certificate\_d & certificate\_key_d in 11.x. Once you find the key, you can decrypt the PCAP, using the following example:
[root@LB01:Active] ~ # ssldump -Aed -nr test.pcap -k /config/ssl/ssl.key/example.com.key
#1: 256.10.20.30(50123) <-> 192.168.255.10(443)
1 1  1421093568.0163 (0.1353)  C>SV3.1(125)  Handshake

(.. Omitted for brevity ..)

1 10 1421093568.3602 (0.1746)  C>SV3.1(351)  application_data
    ---------------------------------------------------------------
    POST /example/wrong/uri/app HTTP/1.1
    Content-Type: text/xml; charset=utf-8
    Host: example.com
    Content-Length: 472
    Expect: 100-continue
    Accept-Encoding: gzip, deflate
    Connection: Keep-Alive

(.. Omitted for brevity ..)

1 1421093568.8656 (0.0000) S>C TCP FIN
1 1421093569.0030 (0.1374) C>S TCP FIN
[root@LB01:Active] ~ #
So there you have it, you can decrypt SSL traffic if you have the private key with only tcpdump and ssldump. You can perform the same task in using tcpdump to output to PCAP and then using the private key in Wireshark to decrypt the traffic, although I find it easier to troubleshoot using tools on the F5 if I can. Let me know if you have any questions on the subject or any suggestions to improve this method. Feel free to comment!

Configure HAProxy to remove host on Nagios scheduled downtime

I was messing around with HAProxy yesterday and thought it would be useful to integrate Nagios downtime into the process for taking a node off the load balancer. This method uses Xinetd to emulate HTTP headers and isn’t limited for use on HAProxy exclusively, it can be used with any LB that supports basic HTTP header health checks… So all of them? The required components to make this demonstration work are: * Linux webserver with Xinetd * HAProxy server * Nagios server with [Nagios-api][1] installed * And root access to the above servers! Now to get started, I used [this guide][2] to get Nagios-api up and running. Once you have the Nagios-api running, you should be able to query the status of your webserver via:
[[email protected] ~]$ curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool                                           
{
    "content": {
        "acknowledgement_type": "0",
        "active_checks_enabled": "1",
        "check_command": "check-host-alive",
        "check_execution_time": "0.010",
        "check_interval": "5.000000",
        "check_latency": "0.024",
        "check_options": "0",
        "check_period": "",
        "check_type": "0",
        "comment": [],
        "current_attempt": "1",
        "current_event_id": "0",
        "current_notification_id": "0",
        "current_notification_number": "0",
        "current_problem_id": "0",
        "current_state": "0",
        "downtime": [],
        "event_handler": "",
        "event_handler_enabled": "1",
        "failure_prediction_enabled": "1",
        "flap_detection_enabled": "1",
        "has_been_checked": "1",
        "host": "prod-web01",
        "host_name": "prod-web01",
        "is_flapping": "0",
        "last_check": "1428676190",
        "last_event_id": "0",
        "last_hard_state": "0",
        "last_hard_state_change": "1428674980",
        "last_notification": "0",
        "last_problem_id": "0",
        "last_state_change": "1428674980",
        "last_time_down": "0",
        "last_time_unreachable": "0",
        "last_time_up": "1428676200",
        "last_update": "1428676315",
        "long_plugin_output": "",
        "max_attempts": "10",
        "modified_attributes": "0",
        "next_check": "1428676500",
        "next_notification": "0",
        "no_more_notifications": "0",
        "notification_period": "24x7",
        "notifications_enabled": "1",
        "obsess_over_host": "1",
        "passive_checks_enabled": "1",
        "percent_state_change": "0.00",
        "plugin_output": "PING OK - Packet loss = 0%, RTA = 0.06 ms",
        "problem_has_been_acknowledged": "0",
        "process_performance_data": "1",
        "retry_interval": "1.000000",
        "scheduled_downtime_depth": "0",
        "services": [
            "smb:139-dosamba-prod-web01",
            "http:43326-donagios-prod-web01",
            "int:load-donagios-prod-web01",
            "int:process_postfix-dopostfix-prod-web01",
            "ssh:15022-docommon-prod-web01",
            "int:process_puppetmaster-dopuppetmaster-prod-web01",
            "int:process_puppetdb-dopuppetmaster-prod-web01",
            "int:disk_root-donagios-prod-web01",
            "int:process_smbd-dosamba-prod-web01",
            "int:process_nagios-donagios-prod-web01"
        ],
        "should_be_scheduled": "1",
        "state_type": "1",
        "type": "hoststatus"
    },
    "success": true
}
So if you notice above, “scheduled\_downtime\_depth” is the status we are looking for, which is currently 0, so there is currently no downtime set. We can easily grab that value with the following one-liner and save for later:
[[email protected] ~]$ curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth | awk -F'"' '{print $4}'
0
So now the fun part begins, creating the Xinetd script to emulate the HTTP header. What we want to do is to return a 200 (OK) if we return a 0 from our scheduled\_downtime\_depth query and return a 5xx (BAD) if we are returned a non-zero value, meaning downtime is set. So there are a few things we need to do: 1. Write our script, which will return a 200 if our check passes, otherwise it will return a 503. In the below script, 192.168.33.10 is the Nagios server and prod-web01 is the Nagios configured host for our web server. The Xinetd script will reside on the webserver since that is where the health check from HAProxy will be directed: #### /opt/serverchk
#!/bin/bash


DOWN=`curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth | awk -F'"' '{print $4}'`

if [ "$DOWN" == "0" ]
then
    	# server is online, return http 200

        /bin/echo -e "HTTP/1.1 200 OK\r\n"
        /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
        /bin/echo -e "\r\n"
        /bin/echo -e "No downtime scheduled.\r\n"
        /bin/echo -e "\r\n"
else
    	# server is offline, return http 503

        /bin/echo -e "HTTP/1.1 503 Service Unavailable\r\n"
        /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
        /bin/echo -e "\r\n"
        /bin/echo -e "**Downtime is SCHEDULED**\r\n"
        /bin/echo -e "\r\n"
fi
2. Add the service name to the tail of /etc/services
serverchk	8189/tcp		# serverchk script
3. Add the xinetd configuration with the same service name as above: #### /etc/xinetd.d/serverchk
# default: on

# description: serverchk

service serverchk
{
        flags           = REUSE
        socket_type     = stream
        port            = 8189
        wait            = no
        user            = nobody
        server          = /opt/serverchk_status.sh
        log_on_failure  += USERID
        disable         = no
        only_from       = 0.0.0.0/0
        per_source      = UNLIMITED
}
4. Restart xinetd
[[email protected] ~]$ sudo service xinetd restart
Redirecting to /bin/systemctl restart  xinetd.service
Now the web portion is complete. You can test it by curling the configured xinetd service port from HAProxy or any other if you didn’t limit via ‘only_from':
[[email protected] ~]$ curl -s 192.168.56.101:8189
Content-Type: Content-Type: text/plain



No downtime scheduled.



[email protected]:~#
Now that it works, we can configure HAProxy. To do so, lets look over the current backend config for our webserver. Here is the excerpt from /etc/haproxy/haproxy.cfg:
backend nagios-test_BACKEND
  balance roundrobin
  server nagios-test 192.168.56.101:80 check
We need to modify this by adding the httpchk and specifying the check port:
backend nagios-test_BACKEND
  option httpchk HEAD
  balance roundrobin
  server nagios-test 192.168.56.101:80 check port 8189
Now lets reload haproxy and check the status:
[email protected]:~# sudo /etc/init.d/haproxy reload
 * Reloading haproxy haproxy                                                                                                                                                                             [ OK ]
[email protected]:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,UP
nagios-test_BACKEND,UP
[email protected]:~#
Excellent! Now lets put the host into maintenance mode (downtime) on Nagios and see what comes of it!
[[email protected] nagios-api]~$ ./nagios-cli -H localhost -p 8080 schedule-downtime prod-web01 4h
[2015/04/10 15:16:59] {diesel} INFO|Sending command: [1428679019] SCHEDULE_HOST_DOWNTIME;prod-web01;1428679019;1428693419;1;0;14400;nagios-api;schedule downtime
[[email protected] nagios-api]~$
And now if we check the Nagios downtime, xinetd script remotely from HAProxy on port 8189 and check the status of the BACKEND resource:
[email protected]:~# curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth
        "scheduled_downtime_depth": "1",
[email protected]:~# curl -s 192.168.56.101:8189
Content-Type: Content-Type: text/plain



**Downtime is SCHEDULED**


[email protected]:~# curl -sI 192.168.56.101:8189
HTTP/1.1 503 Service Unavailable

[email protected]:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,DOWN
nagios-test_BACKEND,DOWN
[email protected]:~#
Now as we see, Nagios is reporting a non-zero value for downtime. Also, the web server shows our script as working correctly and returning a 503! HAProxy also shows the node as down, awesome! Now lets cancel the downtime to see it come back up:
[[email protected] nagios-api]~$ ./nagios-cli -H localhost -p 8080 cancel-downtime prod-web01
[2015/04/10 15:24:09] {diesel} INFO|Sending command: [1428679449] DEL_HOST_DOWNTIME;4
[[email protected] nagios-api]~$
And…
[email protected]:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,UP
nagios-test_BACKEND,UP
[email protected]:~#
SUCCESS! So effectively, this xinetd script can be set on all the webservers, by just changing the Nagios-api to query the different webserver in the script. Also, using xinetd scripts in this fashion, you can perform many other “checks” on the server behind the load balancer.. Anything that can be performed in a BASH (or language of your choice) script can be transformed into the boolean state operation needed to bring the node online/offline. I’d like to see if anyone else has done something similar to this or has any suggestions to improve! Please comment! DISCLAIMER: Please test thoroughly before using this solution in a production environment. I am not liable for your mistakes 😉 [1]: https://github.com/zorkian/nagios-api [2]: http://www.eventenrichment.com/installing-nagios-api-ubuntu-12-04-lts/