The link level CRC was a 16 bit bit checksum. With only a 16 bit CRC, every time a packet gets corrupted with noise you've got about a 1 in 65536 chance that the bad packet will pass the CRC test.
By using a logic analyzer to trigger the "stop" on the protocol analyzer, I was able to capture the moment. The last packet received before the crash was always a packet with a "good" CRC, but its actual contents were garbled.
The protocol used by the stat mux assumed that the link level CRC would be adequate so it didn't do any further error checking on the contents of the packet. Since the product was on its last legs, my boss approved the kludge of adding an extra one byte checksum inside the link level packet- this turned the once a week crash into a once every five years crash.
Now, the newer link level protocols all use 32 bit CRC's. Furthermore, TCP (which was developed in the bad old days of 16 bit CRC's) includes its own checksum.
no subject
65536 chance that the bad packet will pass the CRC test.
By using a logic analyzer to trigger the "stop" on the protocol analyzer, I was able to capture the moment. The last packet received before the crash was always a packet with a "good" CRC, but its actual contents were garbled.
The protocol used by the stat mux assumed that the link level CRC would be adequate so it didn't do any further error checking on the contents of the packet. Since the product was on its last legs, my boss approved the kludge of adding an extra one byte checksum inside the link level packet- this turned the once a week crash into a once every five years crash.
Now, the newer link level protocols all use 32 bit CRC's. Furthermore, TCP (which was developed in the bad old days of 16 bit CRC's) includes its own checksum.