[PATCH] Cochran_commander refactoring

Wed Jan 21 11:34:08 PST 2015

On Wed, Jan 21, 2015 at 10:52 AM, Jef Driesen <jef at libdivecomputer.org>
wrote:

> On 16-01-15 18:14, John Van Ostrand wrote:
>
>> Is the application going to expect to see dives from oldest to newest?
>>
>
> Yes! This is very important for the "download only new dives" feature. If
> the application receives the dives in newest first order, it can stop the
> download as soon as it recognizes a dive that it already has.
> Libdivecomputer's internal implementation (using the fingerprint feature)
> also requires this.
>

That's what I expected. So I'll take your "process dives from the oldest to
the newest one (which you already do), and sum the profile length." to be
misstated. I'll process from newest to oldest to determine which dives have
valid profile data and I'll continue to present dives newest to oldest.

> That reminds me of a somewhat related issue. Currently the entire profile
> ringbuffer (or at least the part that contains dives) gets downloaded with
> a single large request. For applications not using libdivecomputer'
> built-in fingerprint feature, that means we will first download ALL dives,
> and then throw away the ones the application has already downloaded before.
> That's not really optimal. Especially because downloading only the new
> dives happens to be our most common use-case.
>

The logic downloads only the sample data from the fingerprint dive forward,
addresses are rounded to 16KB boundaries so that's the only waste. I'll
experiment with more accurate addresses.

This could be improved by downloading the profile data with multiple
> smaller requests. For example one request per dive profile (or two request
> if the dive happens to cross the ringbuffer end). I believe this is
> technically possible with the Cochran protocol. Multiple request will
> probably add some extra overhead when doing a complete download. But I
> doubt that outweighs the benefits in the normal case of downloading only a
> few dives.
>
> With multiple smaller requests it's also possible to cancel a download. So
> that's an extra advantage that's nearly impossible to implement correctly
> with just one large request.
>

Downloading sample data is more complicated than downloading id, config,
and misc data.
First an 800ms delay seems to be needed. Maybe this can be whittled down,
I'm sure I tried already though. Then we wait to see a heartbeat byte
issued every second. If we don't wait the DC will ignore our command. So
that's an average of 500ms. Then the command is sent and we switch to a
high baud rate. The DC takes 450ms to process the request. Round it all up
a bit and it's about 1.8 seconds per request. Then it's about 250ms per 4K
of data.

With that in mind let's think about some scenarios:

1. One dive to download will take the same time either way.
2. Ten dives (say 100KB of samples) would take 24 seconds (18 seconds of
overhead and 6 seconds of data transfer) done one dive at a time verses 8
seconds all at once.
3. One hundred dives (1MB of samples) would take 247 seconds (180 seconds
of overhead and 57 seconds of data transfer,) verses 59 seconds all at once.

That's a lot of overhead, 3 to 4 times longer for dive-by-dive download. If
we ignore that 800ms, the times are 16 vs 7 seconds or 157 vs 60, it's
still double.

 Rather than assuming a value that might be too large, I would start with a
>>> relative small value and increase whenever there is evidence that there
>>> is
>>> indeed more memory. Since the last memory area contains the profile
>>> ringbuffer, this evidence will most likely be a ringbuffer pointer that
>>> points past our assumed end. That's something that's very easy to check.
>>>
>>>
>> Certainly it's easier if we know where memory starts and ends and I may be
>> close to figuring that out. I've programmed this with a few ideals in
>> mind.
>> The first is to not dump a dive just because we aren't absolutely sure of
>> the data. I argue this because I assume most end users of libdivecomputer
>> (e.g. subsurface users) are not relying on the data to validate diving
>> algorithms, they want to see a graphical representation of their dive. If
>> they are like me they want to see most of their dive even if the last few
>> samples are wrong. This explains why I'd like to see an attempt to
>> retrieve
>> corrupt dives and dives that wrap the sample buffer. In my experience it
>> shows useful data. I'll get to my other ideals later.
>>
>
> We're not dropping any dives, only informing the caller that we did
> encounter data that we (currently) can't handle correctly. That's something
> completely different!
>
>
Dropping dive profile data is what I meant. In all the cases I've seen the
data is entirely recoverable but we simply don't stop early enough and we
see 20 seconds of bad data at the end of a dive. The user can delete or
ignore the dive if they want.

Relying on guessing is always error-prone and unreliable. Just one wrong
> sample is enough to render the recovery of a corrupted dive completely
> useless. For example if there is one depth sample completely out of range,
> then the application will scale the profile accordingly. The result is that
> the good part of the profile will be squeezed completely, and you won't be
> able to see anything useful at all. And things can get much worse too (see
> examples below).
>

These are educated guesses based on empirical data. I have a dozen or so
corrupt dives that do recover reasonably well. This is presumption not
assumption that I'm working on.

The profiles I've recovered look acceptable in Subsurface and i like that I
can see my dive profile and I know that spike at the end can be ignored.

>  [...]
>>
>> I can understand the point of view that we should fail when we are unsure
>> of data and then expect the end user to file a bug report. This would
>> likely get earlier reports of errors but in the interim many new users
>> might be turned away when the software fails to work for them.
>>
>
> If you ask me, silently ignoring errors is an order of magnitude worse
> than gracefully failing with an error. Remember that if we ignore the error
> and try to proceed anyway, there is absolutely no guarantee that the result
> is correct. If you're lucky then the result is completely bogus and the
> user will notice immediately (e.g. unrealistic profiles). Smaller mistakes
> may go unnoticed at first, but if you realize after a while it might be
> already too late for re-downloading. And last but not least there is also
> the possibility of crashing the application.
>

We are still talking about corrupt dives aren't we? Technically speaking,
given that we don't have the full specifications of dive computers it's
really not guaranteed that any data we parse is going to be correct. I'd
say that corrupt data might have less of a guarantee but I've had very good
success recovering dive profiles.

Preventing a crash because of random data can be done if we are careful.

> I can give plenty examples of real life scenarios that I encountered
> during libdivecomputer development. Two examples: If a ringbuffer pointer
> is out of range, and you try to use it anyway with code that expects a
> pointer in a certain range, then you'll likely end up with a buffer
> overflow and crash the application. When parsing "random" data, you can
> easily end up with depths (or times) that are way beyond the normal range
> (e.g. order of several kilometres). Definitely a challenge for the
> decompression algorithm in the application. (Hint: I ran out of patience
> and just killed it.)
>

Absolutely, checking pointers is a good idea.

>
> Anyway, my experience is that being very strict pays off in the long run.
> For the end-user, it may be a bit annoying initially, but it's certainly
> not making things worse.
>
>
> Note that I have the impression we're mixing two unrelated issues in this
> discussion: (1) problems associated with the ringbuffer end, and (2)
> recovery of corrupt dives.
>
> For issue #1, this is all about finding the correct ringbuffer begin/end
> address. Once we have those values, the problem is solved and there is
> absolutely no need for any guessing. That's why I insist on being very
> strict there. It will help us to find the correct values faster.
>

No argument there. I completely agree. It's why I raised it as an issue.

>
> For issue #2 the situation is very different. Because we're dealing with
> corrupt data here, this will not be as easy as finding some magic values.
> Maybe it's indeed possible to recover some data without too much serious
> issues. In that case I'm fine doing that, but if not I lean towards
> sticking to the logbook header only and dropping the partial profile. But
> let's first see how far we get.
>

We are dealing with missing pointers, not corrupt data. All we need to do
is better locate the profile end pointer.

>
>  But remember that even if libdivecomputer itself doesn't use this
>>> information nor provides an api to parse it, there might be someone out
>>> there who is interested in the raw data. It's perfectly possible to
>>> build a
>>> cochran specific application that use libdivecomputer only for
>>> downloading
>>> the data, and does the parsing on its own. Probably not a very likely
>>> scenario, but not impossible. Btw, that's also the reason why
>>> libdivecomputer always provides access to the raw data!
>>>
>>> Note that if we now strip this info, and change our minds later one, then
>>> it might be too late. Changing the data format will break backwards
>>> compatibility.
>>>
>>
>> I agree it would be nice to provide this data via API. If we decide to
>> provide this data via the existing API it's going to mess up a lot of
>> existing applications. The inter-dive events may not be so bad, since
>> there
>> are usually only a few events, keep in mind some dives have a lot. Also I
>> don't know how to parse data from these events.
>>
>
> If we consider the inter-dive events to be part of the dive, and include
> it in our raw dive data, that doesn't mean the parser has to do anything
> with it. The parser can simply ignore it if it doesn't fit well into the
> api. But it's still there, in case an applications wants to look at the raw
> binary data.
>
> Just to make myself clear: For the inter-dive events I'm not sure whether
> we should include it or not. The libdivecomputer api is based around the
> concept of dives. Inter-dive events don't really fit well into that model.
> They are not really part of the dive that comes before or after it. We
> don't have any way to deliver non-dive data to the application either. So
> the only options I see at the moment, are to include it as a pre-dive data,
> or to discard it. Currently my preference is more towards the latter
> option, but at this point I want to leave all options open.
>

There are two types of inter-dive data, inter-dive events which appear
before most dives. These are things like off, on, configuration changes,
connect to Analyst,  altitude/temp changes, etc. These are events but they
aren't dive events and the time stamps on them is going to skew the profile
that's why they shouldn't be included.

What I meant was that it would be nice to provide this data but it would
have to be done through an extended API.

>
>  The off-gassing data, if sent out through the samples_foreach function are
>> going to be incorporated into the aplications' dive profiles and appear as
>> flat trailing lines in the profile. Temperature might be the only
>> interesting sample and barely so.
>>
>
> I doubt this is off-gassing data. It's an artifact of how dive computers
> determine the end of a dive. They typically consider a dive after spending
> X amount of time above a certain depth threshold Y. During this period they
> keep recording because you might descend again and resume the dive. When
> the dive is really over, some dive computers keep this tail intact, others
> rewind and discard it. It looks like the cochran has some sort of
> compromise: It leaves the tail intact, but also indicates at which point it
> considered the dive finished.
>
> I would definitely include this data. Discarding the tail should be done
> by the application, where it can be configured by the user. (I believe
> subsurface already support this.) Some users will want to see it, and
> others don't.
>

This is the second type of inter-dive data, samples.

I call it "off-gassing" because I was told by tech support that the
computer stays on calculating off-gassing after a dive. After seeing the
data in detail I can see that was misstated.

Plotted, the data is a flat line at surface with temp changes and it lasts
10 to 20 minutes, which would plot as a significantly large flat tail at
the end of most dives.

>  Already other interesting data from the Cochran data stream is not
>> provided
>> through the API. Data like deepest stop time and tissue compartment loads
>> are not accessible. To make off-gassing samples meaningful we should also
>> present the in-dive TC loads and then we should indicate to the
>> application
>> that this is inter-dive data.
>>
>
> The libdivecomputer api supports deco/ndl info, but not tissue loadings.
> The latter are too device specific. To be able to interpret tissue
> loadings, you need to know the decompression algorithm and all its
> parameter. That's info we simply don't have (and likely never will).
> Implementing a decompression algorithm in the application makes a lot more
> sense.
>
>
>  The only explanation that I can come up with is that the dive computer
>>> started writing dive #95. Then it failed before the dive was finished
>>> (e.g.
>>> battery empty, reboot or whatever) and the end pointer didn't get written
>>> correctly. Hence the ffffffff value. But the end-of-profile pointer is
>>> probably only updated after the dive is finished. Hence the pre-dive
>>> pointer of the next dive retains the same value. I suspect that the
>>> current
>>> write position is somehow preserved and thus the next dive will start
>>> from
>>> there again. But this is of course just guessing.
>>>
>>>  That's what I think too. given that flash can be slow and have wears
>> with
>> writes I expect it's logical to eliminate unnecessary writes. Based on
>> what
>> I've seen the Cochran code writes the first half of the log when the dive
>> starts (e.g. deeper than 6ft). There are two end-points to a dive. The
>> first is when the diver ascends to 3ft. I think at this point Cochran
>> updates the sample-end-pointer in config0. Cochran then continues logging
>> until the configured "Post dive surface interval minutes" (10 - 20
>> minutes)
>> elapses. This is why you see 1800 bytes (10 min * 60 seconds * 3
>> samples/sec) between dive-end-pointer and the pre-dive-pointer. At any
>> time
>> prior to this elapsing a submersion to below 6 ft continues the dive. I
>> think this is why these post-dive samples exist. After the elapsed
>> interval, the end-dive log is written and now the pre-dive-pointer is
>> known
>> and can be updated in config0.
>>
>> So although the pre-dive pointer is bogus on the dive after a corrupt
>> dive,
>> the dive sample data of the corrupt dive is intact but determining the end
>> has been difficult. Now that I've thought through the Cochran logic, I
>> should go over my logic to see if I can better determine the end of a
>> corrupt dive.
>>
>
> Makes sense.
>
> Excellent info about that 1800 (emc) or 1201 (commander) bytes block.
> Initially I assumed this was some fixed size trailer, and I had no
> explanation for the fact that on some dives it's a few bytes larger. But
> you are right. It's indeed the tail of the dive profile, and those few
> extra bytes are events. More pieces start to fall into place!
>
>  I understand why you're trying to salvage those corrupt dives (in the
>>> sense
>>> that having something is better than nothing at all), but I worry that
>>> this
>>> will complicate the code or produce bogus data (which is even worse).
>>> When
>>> I look at some of those salvaged dives, that seems to be the case
>>> already.
>>> They end abrupt, which is probably normal. But there appears to be some
>>> amount of "random" data (e.g. spikes in the profile) too. And that's not
>>> good. Maybe this is just a bug in the parsing that we can fix? I simply
>>> don't know at this point.
>>>
>>
>> The code ins't too complicated because of it. A corrupt dive is known
>> because of the FFFFFFFF dive-end-pointer. Knowing that we loop, exiting on
>> a memory boundary, when we see a sample that doesn't make sense (0xFF),
>> when we see normal end-dive events or normal start-dive events. If we
>> don't
>> find end-dive events and we stop because of start-dive events we may have
>> processed some inter-dive events as sample data and I think that's why we
>> see the spikes at the end. It's a few lines of code in the parse function
>> plus the guess function and a few if-thens. Since corrupt dives don't have
>> end-dive log data the code parses the samples for depth, temp and
>> duration,
>> etc.
>>
>> Now that I've walked through the cochran logic and why it's writing sample
>> pointers at certain times I might be able to do better guessing at
>> end-dives.
>>
>
> Well the code is indeed not really complicated, but it's also not without
> flaws. Those spikes at the end are examples of that. Fixing those issues
> might turn out harder than expected. That's my main concern.
>

Backtracking through inter-dive events should remove that spike and it
should be reasonably reliable. At worst it would still remove all events
and truncate a few seconds of profile. That's probably better than having a
spike.

>
> I suggest we first take care of the non-corrupt dives, and get the backend
> into a state where we can integrate it into libdivecomputer. Once we have
> that, we take a step back and try to come up with a solution to deal with
> those corrupt dives too. What do you think?
>

It's been 8 months that I've been working on this, I'd like to see it
included. The changes you've asked for so far are quite intensive and are
going to require a significant amount of work. I'm going to keep the
corrupt profile recovery in mind since Im so close on that too.

>  What does the Cochran application do with those corrupt dives? Does it
>>> show them? With or without a profile?
>>>
>>
>> Cochran refuses to display a corrupt dive.
>>
>
> So if we're not able to recover corrupt dives, we're at least not doing
> worse then Analyst. I think that's already a good starting point. If we
> manage to do better then great.
>

The Analyst software is pretty crappy Windows 3.1, non-standard-UI software
that almost always crashes before I can exit it. Doing better than Analyst
isn't going to be hard. I should also mention that behind that crappy,
crashy interface is quite a bit of functionality, but it doesn't feel well
programmed at all.

>  You also mention that the
>> Commander shows some erratic dives, although in the context of reclaimed
>> and overwritten sample data. There were a lot of Commander dives that
>> looked corrupt in Analyst. I was happy that libdivecomputer/subsurface
>> mirrored those corrupt dive profiles because it suggested the sample data
>> was corrupt or that we were at least as capable as Analyst.
>>
>
> Well, we're certainly not going to re-implement the bugs in the Analyst
> application :-)
>

I'm not suggesting we do, it's just the first bar to pass.

-- 
John Van Ostrand
At large on sabbatical
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libdivecomputer.org/pipermail/devel/attachments/20150121/79f5b9a4/attachment-0001.html>