Jade's Tech Notes

More bits more places

Basic Configuration in Ciena SAOS

Ciena makes Ethernet & Optical switches for the service provider market. There’s essentially zero publicly available documentation available which is a shame because they make some interesting equipment and as carriers upgrade there’s a lot of used equipment available that smaller providers can put to good use. I’m not sure why it’s traditional for telco-oriented companies to be stingy with their documentation but it’s annoying.

I have some LE-311v, 3911, and a 3940 here for a project so here’s my notes. They are applicable to all devices as far as I know unless otherwise noted.

This also applies to devices running World Wide Packets LE-OS. Ciena acquired WWP and built their packet Ethernet offerings on tech from the acquisition.

A Zone Named Management on a Juniper SRX

TL;DR: Don’t name a zone “management” on a Juniper SRX (11.4R7.5).

One of my on again, off again projects involves moving a datacenter management network with devices on public IP space with ACLs for protection to private IP space with a zone-based firewall (Juniper SRX240).

When I last touched it I ran into a problem where one zone would not pass traffic even though it had identical rules to a different zone that worked. It happens that the zone that didn’t work was named “management”.

Earlier today I was browsing around and found a article that mentioned the management functional zone which got me wondering if there was something special with naming a zone “management”. I thought that didn’t make sense since the functionality comes from the functional-zone tag not the zone name. Cue more unsuccessful searching for any mention of reserved zone names. Eventually I decided to just rename the zone and see what happened. One quick rename statement and “management” became “dcn-mgmt” and everything started to work.

What?

Then I came across a post to J-NSP that mentions management being a reserved keyword for zones. Oh. That explains it.

Cisco CSR 1000V on SmartOS

In pursuit of some Friday Evening Fun I decided to try to get a Cisco CSR 1000V VM running on SmartOS. I was successful. It runs. It seems to run quite well.

First, a overview of the systems involved since the CSR1000V is fairly new and SmartOS is relatively uncommon.

The Cisco Cloud Services Router 1000V (CSR1000V) is Cisco’s entry in the virtualized router field. It runs IOS-XE and brings many of the handy features that you can find on the ASR1k platform into the cloud so you can do fun and exciting things with MPLS, VPLS, LISP, etc. This is a pleasant change from Vyatta which lacks any sort of MPLS support and Mikrotik which blackholes traffic for no good reason.

SmartOS is a hypervisor based on Illumos which is in turn a fork of OpenSolaris which was next-gen Solaris before Oracle borged Sun. SmartOS has all the wonderful features of OpenSolaris (ZFS, Zones, network virtualization) and adds support for KVM. There are some interesting design choices in SmartOS that turn out to be pretty cool once you get used to them, in particular: Netboot only (or USB if you are desperate) with all local disk used for VMs and VM Management by JSON config files.

I started by downloading the CSR1000V ISO (csr1000v-universalk9.03.10.00.S.153-3.S-ext.iso) from CCO and then uploading it to a server in the datacenter. I followed the instructions for creating a KVM VM on SmartOS except I copied the CSR1000V ISO over instead of a Debian ISO. Installation was smooth but after boot and initial configuration I wasn’t able to reach the VM. It turns out that SmartOS has some decent network security policies by default so I had to add "allow_ip_spoofing": true to the VM config before traffic started to flow.

After the CSR was installed and configured for basic connectivity I set it up as a route reflector and sent it the partial BGP feed that I run on our old Cisco 6500 series routers (~100,000 routes). sho proc cpu hist showed up to 40% CPU usage while it was loading. After that I added a table-map containing a deny statement to the BGP config to prevent it from wasting time trying to update the RIB. This reduced loading time to two seconds with no response lag on the terminal.

My feeling in general on the CSR is that it simply has too large of a footprint for general cloud routing & VPN duties. Vyatta gets the job done with a lot less than 4G of RAM and four CPU cores. If your needs go beyond simple routing & VPN then the CSR becomes more appealing. So far it seems well suited to the route reflector role but I have more testing to do before I commit to it.

If the CSR holds up to testing I hope to run a pair of them in place of the aging 7200s that are currently acting as my route reflectors. Cisco’s recommended route reflector platform is the ASR1k but as a small ISP I just can’t justify the cost on something that doesn’t generate revenue. If this works I can stay with Cisco instead of cobbling something together on a different platform.

Configs used are available as Github Gists: VM config json and CSR config.

Home Directories and SMF Error 96

tl;dr: make sure the user’s home directory exists.

The other day I was setting up a Tomcat instance (for Atlassian Stash) to run under SMF on SmartOS with a alternate user and low ports (on 443 without root) and the service kept going into maintenance state with a reason of Start method exited with $SMF_EXIT_ERR_CONFIG and a service log message of

[ Jul 12 20:45:38 Enabled. ]
[ Jul 12 20:45:39 Executing start method ("/opt/stash/bin/start-stash.sh"). ]
[ Jul 12 20:45:39 Method "start" exited with status 96. ]

Everything I found when searching the web referred to config files missing and other sorts of things that are fairly obvious from log messages, except there wasn’t anything relevant in the logs. I replaced the start method with a shell script that printed a message to a file in /tmp and the file wasn’t updated. That got my focus back on SMF. Why would SMF not run a start method and say it was a configuration error?

After a few rounds of checking the SMF manifest and head banging I remembered that I didn’t specify -m (create home directory) when I created the user that the service was going to run under. I changed the user’s home directory from null to the app’s working dir in /var. The service started up after that.

I’m not sure why neither SMF’s verbose documentation nor the extended documentation online that is referenced in the error mentioned that when you start a service as a user other than root that user has to have a home directory. Instead everything points toward an application configuration error.

For reference, the SMF manifest I’m using to start Tomcat (as bundled with Atlassian Stash) is available as a Github Gist: stash.xml.

Generating RRD Graphs With Nginx

Nginx has a neat module called mod_rrd_graph that generates graphs from Round Robin Database files and serves up the resulting image using the web server. This is good because otherwise you are likely messing with CGI scripts or wasting time generating graphs that may be viewed infrequently.

I have a application (Bongo) that gathers key stats from a few thousand CPE devices every five minutes. This generats a fair amount of disk I/O activity but it is bearable with RAID10 & rrdcached. Before I got mod_rrd_graph working I was redrawing three graphs for every CPE device twice every day which was pretty unbearable, both from a performance and user experience (why are my graphs five hours old?) perspective. It was also a sad waste of I/O activity as most graphs were never viewed as they are only checked if a customer has a problem.

The documentation is OK to get started but I quickly reached the limit of what I could find with Google.

Mikrotik DHCP & FreeRADIUS With a Hint of Unlang

Mikrotik has two particularly useful subscriber management features built into the DHCP server on RouterOS. First, If you set the rate-limit property on a DHCP lease it will dynamically manage a Simple Queue to enforce the rate limit. Second, it will authenticate to a RADIUS server using the DHCP Client’s MAC address as the username. The RADIUS server can reply with the IP address, pool, or traffic shaping paramaters.

We use this combination to rate limit customers on Ubiquiti AirMax equipment since Ubiquiti is somewhat unaccommodating towards OSS/BSS integrations. When we set up our system we envisioned three states of a customer so far as DHCP RADIUS is concerned:

  1. Users that aren’t known to the system don’t get a IP address. We do this by setting the default address-pool on the DHCP server instance to static-only.
  2. A user that is known to the system and has a active account is assigned a DHCP Pool of ubnt-cust (for Ubiquiti customer).
  3. A user with a delinquent account is assigned to the ubnt-deact (for Ubiquiti deactivated) pool. Users in this pool are assigned a RFC1918 IP address and redirected to a splash page.

You might think that we would want unknown users to get a splash page, and we do on our Cambium Canopy network which comes with strong AAA support. On the Ubiquiti network we assume that anyone we don’t know about that manages to connect to our network is hostile and want to give them as little data as possible. Every little obstacle helps. I may relax this after exploring the fairly new RADIUS authentication support in Ubiquiti AirOS.

Anyway, We ended up with a Mikrotik router at each AirMax tower running RADIUS-backed DHCP. Our B/OSS provisions our existing FreeRADIUS system with Ubiquiti device MAC addresses as usernames and puts them in a group with a Mikrotik-Rate-Limit attribute that matches the customer’s speed package. Everything is working great until…

We started to retrofit existing towers with AirMax. These towers were often connected in a star topology, as opposed to our previous expansion where we build out in a line and then expand to the side to make a box. Since we had a central tower it didn’t make sense to deploy a router at every spoke site1. Of course each tower (at a minimum) has its own VLAN to make management easier and contain bad behavior. This presented a problem since the Mikrotik DHCP server can’t dynamically select the correct subnet based on the ingress interface (unlike the Cisco IOS or ISC DHCPd servers). The solution is to create a separately named pool for each interface/tower and get the RADIUS server to provide the appropriate pool in the RADIUS reply. How do we do that? Well, the “easy” way would be to manually set a different pool for each tower in RADIUS. The problem is that is more work on a ongoing basis.

What to do? Well, the Mikrotik DHCP server sends the DHCP server name in the RADIUS request as the Called-Station-ID. On all our towers we had been setting this to ubnt-server. On routers serving multiple towers I set the server name to ubnt-server-TWRID2. Then I setup the pools for each tower as ubnt-cust-TWRID and ubnt-deact-TWRID. In the FreeRADIUS server config I used unlang to extract the TWRID portion of each server name in the request and append it to the pool in the reply.

in sites-enabled/default:

post-auth {
 # Rewrite Mikrotik IP Pool assignments for routers with multiple pools
 if (request:Called-Station-Id =~ /^ubnt-server-(.*$)/) {
  update reply {
   Framed-Pool := "%{reply:Framed-Pool}-%{1}"
  }
 }
}

Example Mikrotik DHCP server config:

/ip pool
add name=ubnt-cust-HRT2 ranges=199.X.X.2-199.X.X.14
add name=ubnt-deact-HRT2 ranges=10.10.17.82-10.10.17.94
add name=ubnt-deact-HRT1 ranges=10.10.17.210-10.10.17.222
add name=ubnt-deact-SLNG1 ranges=10.10.17.194-10.10.17.206
add name=ubnt-cust-HRT1 ranges=199.X.X.18-199.X.X.30
add name=ubnt-cust-SLNG1 ranges=199.X.X.34-199.X.X.46

/ip dhcp-server
add add-arp=yes authoritative=yes disabled=no interface=v100-ether7 \
    lease-time=30m name= ubnt-server-HRT2 src-address=199.X.X.1 use-radius=yes
add add-arp=yes authoritative=yes disabled=no interface=v100-ether8 \
    lease-time=30m name=ubnt-server-HRT1 src-address=199.X.X.17 use-radius=yes
add add-arp=yes authoritative=yes disabled=no interface=v201-ether9 \
    lease-time=30m name=ubnt-server-SLNG1 src-address=199.X.X.33 use-radius=yes

[1] Since you aren’t doing subscriber management at the tower you need some kind of rate limiting in the CPE as a safety net to prevent subscriber-originated UDP floods saturating the tower backhaul on their way to the router.

[2] TWRID is the unique ID for each tower. We base it off of psudo-CLLI code. SLNG1, HRFD2, WHWR1, etc.

Enable SNMP Write on Canopy Radio

Cambium (formerly Motorola) Canopy radios come with SNMP write disabled (as they should). Unfortunately there isn’t a documented way to enable it without clicking through the web UI. I’ve been thinking about how to enable it for a while as our OSS project needs SNMP write to enforce configurations on the device. Over the weekend I got the chance to sit down and figure out exactly what needed to happen by watching the HTTP transactions with Charles Proxy. The result is a short bit of python that can be used from the command line or imported into a existing system as a python module: snmpset.py.

Continuity of Blog

This August marks the 10-year anniversary of me occasionally writing things in public in blog format (I’ve been posting online since 1999 when I was 14). I started out writing using Movable Type, moved to Blogger publishing to my own site over SFTP, then normal Blogger when Google discontinued SFTP, then Posterous. At each move I’ve imported old entries and fixed links. I’m now wondering if that was a good idea. My writing style and topics have varied greatly. What I care about now is much different than what I cared about in 2002. Maybe there is value in leaving old content in old places and starting fresh.