On 2014-03-31 16:38, Dirk Hohndel wrote:
On Mon, 2014-03-31 at 10:36 +0200, Jef Driesen wrote:
To be honest, I don't feel comfortable removing the error checking for the syscalls, and not just because omitting error checking is usually the wrong thing to do.
But instead of making the error fatal, how about you just log it and try to keep going, anyway?
Think of it this way: "what's the worst that could happen?" a) the error is real, communication will fail (same result as today) b) the error is bogus, communication will succeed (winning!)
I get your point, but please read on...
First of all, this only seems to be a problem for pty's and not the real thing. The use of pty's is basically a hack to make the simulator work. For that purpose I'm happy with the --enable-pty option. Only a libdivecomputer developer will ever need it.
That assumes there are no issues with the libraries used or the kernel used and the specific serial device.
True, but I think that's a reasonable assumption. If we can no longer rely on the kernel or device drivers to get things right, then I think we're in far more trouble. Of course there can and will be bugs (as we just discovered the hard way), but those are supposed to get fixed, right?
Linus' mail to the fedora/redhat developers made that pretty clear I think.
But what worries me a lot more is that trying to communicate with a real device using the wrong settings may actually harm the device. This has already happened in the past with the Mares Icon HD and Matrix. Both use the same protocol, but trying to talk to a Matrix with the Icon HD baudrate will cause the device to lock up completely. Luckily there was no permanent damage and the device could be revived by removing the batteries. But that's pretty scary and something I want to avoid at all cost. This Mares problem is now gone, but that doesn't mean it can't happen for others.
But that wasn't a problem of RTS/DTR or other serial parameters. That was a USB communication issue. And IIRC this wouldn't have been caught by an error return, either.
Yes and no.
The scenario where this particular problem happened, was that a user accidentally selected the Icon HD protocol variant to connect to a Matrix dive computer. That happens to be the default choice when using the libdivecomputer universal application with just the -b iconhd option. To select the Matrix variant you need the extra -m 0x0F option (or use the -n "Mares Matrix" option).
The key difference between the two protocol variants was the different baudrate. You are right that trying to set the wrong baudrate did not cause the tcsetattr call to fail. That's because the usb-serial driver/chip happens to support that baudrate too. But, assume for a moment we ignore tcsetattr errors, and try to proceed. How do we know the baudrate was changed correctly? We simply don't know. It could have failed before it even tried to change the baudrate. The only case where we do know, is when it returns without errors. The result is that trying to proceeding anyway after an error, may put the device at risk!
Note that this Mares issue is gone now, because we are no longer using different baudrates, but that's irrelevant for this discussion.
So I'm not convinced ignoring the return value of a syscall, and hoping for the best, is the right solution here. For pty's I'm fine with that, but not for real serial ports.
It's your library... I think it makes much more sense to try to keep going and fail if the responses of actual communication make no sense, not if something odd happens during the setup of the communication.
I'm aware that in practice this risk will probably be very small. But if I have to choose between keeping the strict error checking, and removing it as a workaround for a bug in one particular linux distribution (and which only affects developers but not end users!), then that's an easy choice for me.
For as long I have been working on the libdivecomputer project, I've always been very cautious and never caused any damage to any dive computer, and I would like to continue that trend. I can assure you that the incident with the Mares Matrix was pretty scary, and I never want to go through that again. If that causes some minor inconvenience here and there, then that's a price I'm willing to pay. I hope you'll understand that.
Note that I absolutely don't mind enabling the pty "hack" on the tcsetattr call too. I think that should already be sufficient to work around this glibc bug in fedora/redhat.
Jef