I'm trying to figure out a way to replicate the issue, but I'm having trouble -- here's what I'm trying:
Setup:
2011 MacBook Pro is connected via WiFi to the Almond+ via 802.11n (MCS 13-15)
Almond+ is connected to a Cisco Catalyst 2960G
Macintosh VM is running on an vSphere cluster also hooked up to the above switch
The MacBook Pro is the only device connected to the A+ via WiFi. All other devices are going through other networking gear, with a pfSense firewall controlling access to the Internet.
On the MacBook Pro:
pc-macbook:~ pat$ dd if=/dev/urandom of=250MBtestfile bs=1m count=250
250+0 records in
250+0 records out
262144000 bytes transferred in 17.196924 secs (15243656 bytes/sec)
pc-macbook:~ pat$ md5 250MBtestfile
MD5 (250MBtestfile) = 5862b0c4be5e7bfb668fcb0fac503471
Then on the Macintosh VM (hardwired):
mac-vm01:~ pat$ for i in {1..20}; do nc -l 1337 > 250MBtestfile-recv ; md5 250MBtestfile-recv ; done
Then on the MacBook Pro:
pc-macbook:~ pat$ for i in {1..20}; do cat 250MBtestfile | nc mac-vm01 1337 ; sleep 5 ; done
This causes the laptop to send a 250MB test file 20 times to another listening Mac. The listening Mac should output the same MD5 checksum 20 times, and this checksum should match the original calculated on the sending Mac. This seems to work okay in both directions no matter which end does the sending, but what I am noticing on the Almond+ is this:
top - 03:21:24 up 5 days, 22:11, 0 users, load average: 0.08, 0.03, 0.00
Tasks: 94 total, 1 running, 93 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 1.0%sy, 0.0%ni, 61.2%id, 0.0%wa, 0.0%hi, 37.5%si, 0.0%st
Cpu1 : 0.3%us, 0.7%sy, 0.0%ni, 74.0%id, 0.0%wa, 0.0%hi, 25.0%si, 0.0%st
Mem: 432392k total, 125184k used, 307208k free, 12500k buffers
Swap: 0k total, 0k used, 0k free, 35328k cached
Software IRQ time seems abnormally high to me. I've seen it peak at 50-something on one CPU and 30-something on the other at the same time. That's a lot considering there's only one computer connecting to the Almond+ and everything else is back going through my old access point. I'm wondering if high loads that bring the A+ to 100%si or 0% idle time on either/both CPUs is what is causing this problem to occur.
I think the best way to stress test the router is to have 5-6 computers connected to the same Almond+ doing this test at the same time. If packet corruption is occurring, we should check that the md5sum's are borked when the router seems to be pegged.