[PATCH] Cochran_commander refactoring

John Van Ostrand john at vanostrand.com
Mon Dec 22 12:58:49 PST 2014


On Mon, Dec 22, 2014 at 10:38 AM, Jef Driesen <jef at libdivecomputer.org>
wrote:

> On 08-12-14 00:50, John Van Ostrand wrote:
>
>> On Sun, Dec 7, 2014 at 3:14 PM, Jef Driesen <jef at libdivecomputer.org>
>> wrote:
>>
>>> With the exception of the id, config and misc packets, the protocol is
>>> mainly based on simple random access to the internal memory. There is no
>>> dedicated command to read the logbook and samples. They are simply stored
>>> in a certain area of the internal memory. So for the dump api it makes
>>> much
>>> more sense to implement as full memory dump, instead of reading just two
>>> "random" pieces of the memory. The result is that the dump would be much
>>> more generic, because you no longer need to interpret the data (in order
>>> to
>>> find the logbook and samples). This is important when you need the memory
>>> dump to diagnose bugs in the logic for downloading the dives. The fact
>>> that
>>> we don't seem to know the total amount of memory, might be a problem
>>> here.
>>>
>>
> What I meant here is that in the data that can be downloaded with the read
> command there are two main areas of interest: the logbook and sample
> ringbuffers. Right now, you read some part of each ringbuffer, and pack
> them into your memory dump structure. But because the start address and
> size depends on the current implementation, that's not very useful as a
> memory dump.
>
> I'll illustrate with an example. Assume for a moment that the current code
> finds valid sample data in the area 0x094000 to 0x096000. Later on we
> discover a bug in the code, and the end address should have been 0x098000.
> That means all memory dump that have been downloaded with the old code are
> now useless because they don't contain enough data. If the start address is
> wrong, it gets even worse.
>
>
That makes sense.


> A memory dump should be independent from whatever logic we use for
> locating the dives. If you simply read everything from address 0 to the
> end, then you we can keep using those old memory dumps. If a device has
> lots of memory, then yes, it may take a while to download, but that's fine.
> Downloading a memory dump is not the normal use-case. It's intended for
> troubleshooting, and then you do want all the data. (For reference,
> downloading a full memory on a Suunto Vyper takes 10-15 minutes, and that's
> only 8KB of data!)
>
> If we really don't know the total amount of memory, then we'll have to
> make an assumption. If that assumptions turns out to be wrong, we can
> always adjust it. But at least we always have all data up to a certain
> address.
>

I'll have to commit to knowing the end of memory then. Right now I've
avoided stating the end explicitly because I'm not sure where it is. If all
Cochran DCs return 0xFF when past memory I can be a little liberal and not
crash.


> Since the other pieces of data (id, misc and config) have a fixed size,
> the structure for a good memory dump could be as simple as concatenating
> all pieces:
>
> id       67 bytes
> misc     1500 bytes
> config   2x512 (emc) or 4x512 (commander) bytes
> memory   variable length, memory starting at address
>
> An alternative is to only include the memory part, and take care of the
> other three separately, by means of the vendor event. This is how it's done
> in the several other backends (e.g. oceanic and reefnet).
>

Considering I'm one of two users of this I'd like to have the extra data.
And considering the person doing the dump knows the model it should be
apparent how to parse the data so I don't need to make a generic format.


>  What I find interesting is that logbook data is at address 0 (for the read
>> command) and that the sample data read is far enough along to justify the
>> config pages and maybe some program data, although I have no idea how
>> large
>> the program for this could be. GIven the background of Mike Cochran I
>> suspect it's tight code.
>>
>> So for the EMC 16 and 20H the logbook is from 0 to 0x80000  (1024 entries)
>> then samples begine at 0x94000. I haven't looked to see what the 0x14000
>> bytes between the logbook and sample are, I suspect config and program
>> data, although I'm surprised that doesn't start a 0.
>>
>> Now for the EMC 20H the quoted maximum sample data is 1450 hours. 3 byte
>> samples * 60 seconds * 60 minutes * 1450 hours is 0xEEF3E0, or close to
>> 16MB, when added to the 0x94000 base. It's short but I suspect that 1450
>> is
>> a rounded number.
>>
>> To add to this, EMC models are sold with different amounts of memory
>> configured. Word from the dealer is that it's configurable via software.
>> That means each 20H is shipped with the same hardware configuration and
>> that something in one of the other data blocks indicates how much memory
>> is
>> enabled. I've been unable to find where that might be and I'm worried it's
>> stored as a feature code instead of simply the amount of memory.
>>
>
> My advice is to start with a very simple assumption. If your assumption is
> wrong, then someday someone will be affected, and report it. At that point
> you can fix your assumption.
>
> Let's look at logbook area. It seems the ringbuffer is located at address
> 0 to 0x80000 (1024 entries of 512 bytes) for the emc and address 0 to
> 0x20000 (512 entries of 256 bytes) for the commander . We also have the
> number of dives from the config block. If it's less than 1024/512, then we
> know the ringbuffer isn't full yet. A reasonable assumption is that it will
> be at offset 0. If there are more, then all entries will be valid and we
> can find the oldest one by means of the internal dive number. I think those
> are very reasonable assumptions. In fact that's pretty much the description
> of the OSTC3 algorithm.
>
> Now, if a device is limited by its firmware to less than 1024/512 logbook
> entries, then the above will work fine until someone exceeds this
> artificial limit. At that point our algorithm will fail, and we'll get a
> bugreport. But the good news is that we now also have real data to confirm
> and fix the problem.
>
> For the sample area, we know the start address of the ringbuffer, but not
> the end. What we could do is assume the ringbuffer is unlimited in size
> (e.g. address 0xFFFFFFFF, which is the maximum we can support with a 32bit
> integer) and fail when the end address is before the start address.
> Normally that indicates that we reached the end of the ringbuffer. Again,
> we now have access to data where we can figure out the exact ringbuffer end.
>
>
> The idea is that you always start with a very simple assumptions, and
> improve them if they turn out to be wrong. That's in a nutshell how things
> are usually done in libdivecomputer. When you make an (educated) guess,
> without any hard evidence, then you have a very high chance that your guess
> is still wrong. But the problem is that the resulting bugs will be a lot
> more subtle and harder to spot and fix.
>
> For example if a ringbuffer end address is wrong, then you'll either miss
> a few bytes (address too small) or have some garbage bytes inserted
> (address too large). A few missing samples can easily go unnoticed.
>
>  The next thing where I have some concerns, is the logic how you find the
>>> logbook and samples. It seems we only know the start address of the
>>> logbook
>>> and sample ringbuffers (respectively 0 and 0x094000 for the emc), but not
>>> the end address. And that results in a few issues:
>>>
>>
>> We don't really know the start. I just worked with my first EMC-14 and its
>> samples start at 0x20000, mostly because it supports only 256 logs.
>>
>
> If the first dive starts at that address, then it's likely the start
> address of the ringbuffer. Of course it's not a guarantee, but nevertheless
> a good assumption. Let's say it's correct until proven wrong :-)
>
>  For the logbook, you assume the first logbook entry is at the address 0,
>>> and then move forward until you reach the total number of dives. But in
>>> practice, the number of dives that can be stored is not unlimited, and
>>> the
>>> memory is most likely used in a circular way. Thus, once you have enough
>>> dives, this assumption will no longer be correct. Ideally we need to
>>> figure
>>> out the end of the ringbuffer, so we can take this into account
>>> correctly.
>>> If we are unable to find this address, then we need at least some checks
>>> to
>>> verify our assumptions (for example check that the dive number starts at
>>> zero and always increases, check that we don't go past the sample area,
>>> etc). If you ask me, returning an error is always better than silently
>>> returning some wrong data.
>>>
>>>
>> The number of log book entries is also configured via software. In other
>> words the DC can be configured to ignore part of the logbook ringbuffer.
>> It
>> might be a head/tail pointer or something.
>>
>> You can see that the number of dives is available in the config data
>> block.
>> I suspect that a starting dive number is also present  but since it's set
>> at 00 00 now it can't be identified. Also because of the insane number of
>> dives this computer will log it would take me 8 years to wrap mine.
>>
>> Checking dive numbers will actually work for most divers. I suspect that
>> all but a few will never wrap the ring buffer before changing DCs.
>>
>
> See my explanation above. Have a look at the hw_ostc3_device_foreach
> function.
>
> I believe I have found something else that will be of interest here.
>
> At offset 0x142 in config0 there is an "end-of-profile" pointer. In the
> data I got from you it contains 0x00191467. That nicely corresponds to the
> end pointer (0x00190d5f) of the last logbook entry (#114), plus the 1800
> byte "trailer". This is "end-of-profile" pointer exist in most data formats
> and is typically used internally by the dive computer as the position to
> start writing the sample data for the next dive.
>
> The next 4 bytes contains the value 0xE400. This is the address where the
> next logbook entry will be written (114 * 512). So this is probably the
> "end-of-logbook" pointer.
>
> With this information, the ringbuffer starts to look very similar to what
> I've seen in other data formats.
>

Good catch. It looks like I'm wrong about the start address of the EMC14
samples, it looks like 0x22000 based on the empty DC. Now to find that on
the Commander.


>  A similar problem exist when reading the samples. You have some code that
>>> tries to guess where the ringbuffer ends. But there is absolutely no
>>> guarantee that those assumptions are correct. I think it's much safer to
>>> not guess at all and just fail if the ringbuffer wrap point is crossed
>>> (e.g. begin address is higher than the end address). Once someone hits
>>> this
>>> error, we can investigate and determine the correct ringbuffer end
>>> address.
>>> If we're guessing, and get it wrong, then we're silently returning bogus
>>> data. That's not good. And the chance that your guess is wrong is pretty
>>> high. Although I agree that the total memory is likely nicely aligned,
>>> the
>>> sample ringbuffer can end anywhere. I've seen that more than once in
>>> other
>>> backends.
>>>
>>>
>> It's more likely that a ring buffer will wrap than the logbook. I have a
>> Commander that has already wrapped (I bought it used) and so I have a
>> working example. From USB traffic I could see the addresses used by
>> Analyst
>> and it's a nice clean alignment. I have no evidence on an EMC but it seems
>> safe to assume it's a minor upgrade on the Commander software.
>>
>
> So looks like you already have some real data to figure out the ringbuffer
> end for the commander. Great.
>
> Note that if the requested addresses are nicely aligned, then there is no
> guarantee that the ringbuffer end address is also nicely aligned. The
> application might read more data than it needs, there might be data that is
> not part of the ringbuffer but still used for something else, etc.
>

I haven't seen evidence of that on the Commander. The data looks like it
runs to the end.


>
>  [One method that I have used successfully in the past to locate the
>>> ringbuffer end, without having access to data that crosses it, is by
>>> looking at the full memory dump. Usually unused data in flash memory
>>> contains all 0xFF. So if there is another block located right after the
>>> ringbuffer area, then the unused tail of the ringbuffer will contain
>>> 0xFF,
>>> followed by some non 0xFF data from that other block. Thus the first
>>> address containing non 0xFF bytes is very likely the end of the
>>> ringbuffer.]
>>>
>>
>> There are three problems with that. First, for the EMC 20H, that's 16 MB
>> of
>> data and even at 800K baud it's going to be a while to download. Probing
>> might make more sense. Reading 1K at certain points as a test.
>>
>
> A long download might be annoying, but it's not a big deal either. It's
> not something you'll need to do every day.
>
>  Second, since the sample ring buffer size is software configurable the end
>> may not be apparent. I expect the end of the buffer is truncated on a DC
>> configured with less memory. That means that the technique of probing will
>> probably return a larger size than it should.
>>
>
> That might be true, but we can't write code to handle things we don't know
> yet. Only for the things we do know.
>
>  Third, I suspect that the end of sample memory is the end of physical
>> memory. It's a guess based on the number of quoted number of samples
>> supported and the starting address of samples. I don't know what will
>> happen if we read beyond physical memory.
>>
>
> Without actually trying, I have no idea what will happen when you read
> past the available memory. The read request may fail, the device could
> automatically wrap around to some other address (and not necessary to
> zero), etc. All possible outcomes that I've seen in the past.
>
>
>  The next big question is where does the sample data begins/ends? So far
>>> you have identified three pointers: pre, begin and end address. When I
>>> dump
>>> those pointers (see below), I noticed something interesting:
>>>
>>>   * The 5th column is the difference between begin and pre. According to
>>> your documentation, these are inter dive events. They can be zero length
>>> too. Should we consider those as part of the dive, or ignore them? Since
>>> I
>>> don't know what's in those interdive events, I have no idea.
>>>
>>>   * The 6th column is the difference between pre and end of the previous
>>> entry. This is nearly always equal to 1800. I doubt that's a
>>> coincidence. I
>>> guess that means there is some 1800 byte header or trailer. Do you have
>>> any
>>> idea what's in there?
>>>
>>> * When the end pointer is ffffffff, the next entry has the same pre
>>> pointer. Because it makes no sense that the sample data for two different
>>> dives start at the same address, I assume this means that for some reason
>>> (e.g. battery empty, too short duration, etc) this dive has no sample
>>> data.
>>> What you did in this case is take the data until the start of the next
>>> dive. But if I'm right, that's the inter dive events. So this makes no
>>> sense to me. Do you get anything useful from doing this?
>>>
>>>
>>>  Pre-dive events consist of the following (and more) according to
>> documentation.
>>
>> - Initialization of the unit.
>> - The unit is turned on
>> - Low batteries
>> - Altitude Changes of over 500 feet
>> - Temperature Changes of 10 degrees
>> - Sensor Malfunction
>> - Analyst ® P.C. Communication
>>
>> Interdive events like the above are multiple bytes in length, up to 20
>> bytes. The DC has no on or off switch. It turns on under certain
>> conditions
>> and off after a certain amount of off-gassing after a dive. So many dives
>> will have at least an On event occur before dives. The first dive has a
>> lot
>> of events because I was configuring the device via Analyst and each config
>> change seems to be logged.
>>
>> The DC also stays on after a dive and logs off-gassing. I'd bet that's the
>> 1800 bytes of logs after. Although I assumed it was longer. Off gassing
>> samples are recognizable by the depth change and ascent rate samples being
>> 0x40 and 0x80 respectively (these decode to negative zero.)
>>
>
> Do you think we should consider the pre-dive event as part of the dive
> data? Based on your description, these pre-dive events do not look really
> interesting from a dive logbook point of view. I have no idea how the
> parser should deal with it, other than just ignoring them.
>

If the data were readily available I'd be interested in some of the data,
like altitude and temp changes while travelling. However I've looked at
some of the raw pre-dive data and I can't parse the event data. All that
said I think this data is a passing interest.

I took my cue from Analyst. It doesn't show the pre-dive events on dive
profiles nor does it show the off-gassing data. It has an option to display
"inter-dive" events and does so similar to a dive. I've not been interested
in the data, except to find out what it was. I've been thinking of
libdivecomputer in the context of Subsurface which has no mechanism to show
inter-dive events or data. I think libdivecomputer would need some other
mechanism to present this data and if only the Cochran does this it may not
make sense to spend the time.


>
>  A pointer of FFFFFFFF means the end of dive was never logged. I had a
>> problem with my DC where it would reboot not logging the end of dive. The
>> code tries to salvage the dive logs.
>>
>
> Can you recover some of the samples? Based on what I've seen so far (see
> my long table) it makes no sense to try to recover the samples, because it
> looks like at least part of the data you are recovering are the pre-dive
> events of another dive. That certainly doesn't look right.
>
> Look for example at dive #79:
>
> 78 0014b8f3 0014b94a 0014ef9d    87 1800
> 79 0014f6a5 0014f70d ffffffff   104 1800
> 80 0014f6a5 0015043c 00151442  3479 1373862
>
> Your code to guess the end pointer will replace ffffffff with 0015043c,
> which is the start of the next dive (#80). So for dive #79 you take the
> range 0014f70d to 0015043c as the sample data. But that memory range is
> also the last part of the pre-dive events for dive #80. This overlap
> doesn't make any sense to me.
>

You can see several of those starting about there. The battery was low and
the DC was starting to exhibit the crash problem so I was intentionally
invoking it about that time so I could be confident in describing the
problem.

Take a look at

 95 001627c7 0016447b ffffffff 7348 1451976
 96 001627c7 0016463c 001661b3 7797 1451976

That 7348 byte of samples on dive 95 was a 40 minute dive in Tobermory,
Canada followed by another about that length. Both imported mostly
correctly. I'd rather have that flawed dive profile than nothing.

I'm not sure what dive 79 was, but I figured the user would rather see
corrupt dive imported than none so s/he could make a choice.

 I also tried to parse the test data I received from you, but I can't get
>>> anything useful out of it. That might be due to the fact that I don't
>>> know
>>> at which address the sample data was read when you created that data
>>> (that
>>> might have changed in your latest code). Btw, that's exactly why I
>>> prefer a
>>> more generic dump format as explained above. Can you provide some new
>>> data
>>> for me, or the correct address? I also tried to run the parser under
>>> valgrind, and it reports a *huge* number of problems. That might be due
>>> to
>>> this mismatch in data, but the parser should be robust against whatever
>>> you
>>> throw at it. The rule of thumb is never trust the input data, and always
>>> check everything before you use it. I didn't investigate further,
>>> because I
>>> first want to confirm with good input :-)
>>>
>>>
>> As you can see from the source, the reads are based on data from the
>> config
>> blocks and from the logbook. Config tells me how many logs to read.
>> Reading
>> the logbook for pointers results in knowing which memory addresses to
>> read.
>>
>> For an EMC 20H the sample data starts at 0x94000. It's been a while since
>> I
>> checked that the code is actually reading from that point, but it does
>> round down 16k.
>>
>> I've not tried to read the data, like you have. My simulator used earlier
>> files with names that included their memory addresses. From a glance the
>> data looks good.
>>
>> But let's go through an example.
>>
>> [...]
>>
>
> Ignore my question. I think it was a mistake on my side. Instead of modify
> the code to read the data from files, I extended the libdivecomputer
> simulator with the cochran protocol. It produces reasonable looking
> profiles now, so I assume there was some bug in previous hack to load the
> data from files.
>
>  I've been hesitant in reading the entire memory of the device. I may try
>> with my used device, which is a Commander.
>>
>> I've been holding off asking Cochran again for help. They previously
>> refused citing concern that open source code would reveal too much about
>> their algorithm. Now that the code exists I wonder if they would be
>> willing
>> to help by telling me how to determine model and memory configuration, or
>> if they will demand that I stop.. It's also the same reason I've been
>> trying to respect their privacy by not reading anything but my data.
>>
>
> That's their usual argument. The truth is that I've never encountered a
> device where you can download anything worth protecting (e.g. firmware
> code, decompression algorithm). All you can access is the memory containing
> the dives.
>
> Jef
>

Have you had any success in changing their minds? Do you have any good
examples of DC manufacturers that have been open that I could use as
examples when talking to Cochran next?


-- 
John Van Ostrand
At large on sabbatical
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libdivecomputer.org/pipermail/devel/attachments/20141222/1d184389/attachment-0001.html>


More information about the devel mailing list