[deprecated🐸] 2022-24 BPS Debugging Notes

Note: These issues have already been resolved on https://github.com/lhr-solar/BPS/tree/CAN_UPDATE_DRIVERS. The information in this doc therefore may be inaccurate to the current state of the BPS.


In preparation for the 2022-24 Solar competition, there is a bug with the BPS causing some voltage measurements to get read incorrectly via the ADC on the LTC6811s. Notes on the issue and steps taken to debug it are written here.

Code: https://github.com/lhr-solar/BPS

Minion Board: https://github.com/lhr-solar/BPS-MinionPCB

Setup Steps: [wip] 2022-24 BPS Setup Steps

Debugging/Flashing: Flashing and debugging code on an MCU

LTC6811 Datasheet: https://www.analog.com/media/en/technical-documentation/data-sheets/ltc6811-1-6811-2.pdf

Issue 1: OS Error Assertion after Scheduler Unlock

We realized the BPS was reaching EnterFaultState() pretty early on; specifically, whenever RTOS_BPS_DelayUs was called (after looking at the backtrace on GDB). The code below shows where we were entering FaultState.

void RTOS_BPS_DelayUs(uint32_t delay_us) {
    BPS_OS_ERR err;
    
    // lock the scheduler
    OSSchedLock(&err);
    assertOSError(err);

    // delay
    BSP_Timer_Start_OneShot(delay_us, DelayUsCallback);
    while (!DelayUsContinue);
    DelayUsContinue = 0;

    // unlock the scheduler
    OSSchedUnlock(&err);
    assertOSError(err); // FAULT STATE ENTERED HERE
}

This was entering FaultState because the value of err was not OS_ERR_NONE. Instead, the value was OS_ERR_SCHED_LOCKED, which is the error Micrium OS-III returns whenever the RTOS scheduler was locked prior to an OS function call. Of course the OS Scheduler is going to be locked before we call OSSchedUnlock. But sometimes Micrium likes to just provide us information about a function through its error mechanism rather than an actual fault case. Simply adding a catch for this as follows was enough to keep it from faulting, as this should not be considered a fault.

if(err != OS_ERR_SCHED_LOCKED) assertOSError(err);


Issue 2: Different Cell Configuration

Each minion module has the capacity to read from 12 voltage sensors, for a total of 36 voltage readings, however we do not use all 36 readings.

The original code reads the voltages from the three modules into a length 32 array, expecting the configuration to be 12 voltage taps on the first module, 12 on the second, and 8 on the third. The configuration has since changed to 11 voltage taps on the first module, 10 on the second, and 11 on the third. This can be fixed by simply adding an offset to the code such that in place of Module 1, Voltage Tap 12 there is now Module 2, Voltage Tap 1 (and similarly for the other "skipped values".

Original code:

// package raw voltage values into single array
for(uint8_t i = 0; i < NUM_BATTERY_MODULES; i++){
	rawVoltages[i] = Minions[i / MAX_VOLT_SENSORS_PER_MINION_BOARD].cells.c_codes[i % MAX_VOLT_SENSORS_PER_MINION_BOARD];
}

New code:

// package raw voltage values into single array
uint8_t offset = 0;
for(uint8_t i = 0; i < NUM_BATTERY_MODULES; i++){
	uint8_t minionIdx = i / MAX_VOLT_SENSORS_PER_MINION_BOARD;
	uint8_t sensorIdx = i % MAX_VOLT_SENSORS_PER_MINION_BOARD;

    if((minionIdx == 0 && sensorIdx == 11)
      || (minionIdx == 1 && sensorIdx >= 10)) offset++;      
    rawVoltages[i - offset] = Minions[minionIdx].cells.c_codes[sensorIdx];
}

The sensors skipped are Minion 1: Sensor 12, Minion 2: Sensor 11, and Minion 2: Sensor 12. Since arrays are 0-indexed, the indices are as listed above.

Note that this is a bit of a hacky fix. Ideally the code would be slightly restructured to allow for variable amounts of voltage sensors per minion board, and the macro values in config.h would be the only things needing to be edited, but this works for the time being.

Issue 3: Invalid Readings

Here is a printout of the Voltages array (all thirty-two sensor readings).

{35633, 35628, 35617, 35621, 35619, 
  35620, 35626, 35622, 35611, 36811, 6, 
  36137, 35485, 35063, 34996, 34965, 34975, 
  36956, 35456, 35040, 0, 35617, 35628, 
  35616, 35628, 35624, 35628, 35622, 35634, 
  0, 28571, 2048}

Here it is split up by module:

Module 1: 35633, 35628, 35617, 35621, 35619, 35620, 35626, 35622, 35611, 36811, 6

Module 2: 36137, 35485, 35063, 34996, 34965, 34975, 36956, 35456, 35040, 0

Module 3: 35617, 35628, 35616, 35628, 35624, 35628, 35622, 35634, 0, 28571, 2048

The problem sensors have been highlighted in red. Sensor 11 of Module 1, Sensor 10 of Module 2, and Sensors 9/10/11 of Module 3. What really bugs me about this issue is the fact that it's not simply just the last voltage sensor on each module that's faulting, but also the last three sensors on module 3.

Possible Issue Idea 1:

One possible issue is that the LTC isn't getting enough power to properly operate on the last few cells. Here's the diagram of our input to the LTC:

And here's a picture of our missing voltage inputs for each of the connectors:

On Module 1, we leave pins 7/6 of the top connector floating. This means our last voltage input is on Pin 5.

On Module 2, we leave pins 7/6/5 of the top connector floating. This means our last voltage input is on Pin 4.

On Module 3, we leave pins 7/6 of the top connector floating. This means our last voltage input is on Pin 5.

First, I'm probing some voltages on Module 1. Probing V+ to V- gives 38.2V, while probing Pin 5 to V- gives 38.8V. The LTC6811 datasheet clearly states:

The LTC6811 is powered via two pins: V+ and VREG. The V+ input requires voltage greater than or equal to the top cell voltage minus 0.3V, and it provides power to the high voltage elements of the core circuitry.

Which our current setup does not satisfy. Probing Pin 6/7 to V- also gives 38.2V, which suggests that the voltage difference between Pin 5 and Pin 6 is dropping 0.6V (probing the pins also seems to confirm this). So in theory the voltage input (V+) should be at least 0.3V higher than it is. I don't know how much of an effect this would have on the operation of the chip, but given that the problem always occurs on the last sensor reading of each LTC, it may be related.

A solution to this problem could be to short Pin 6/7 with Pin 5 on Modules 1 and 3, and short Pin 5/6/7 with Pin 4 on Module 2. Basically we'd make sure that V+ receives exactly the top cell voltage. This option would have us shorting the pins externally to the minion module.

*typo: 42V should read 38.8V

An alternative solution: The datasheet suggests that we short the inputs to the LTC after the resistor, like so:

As you can see, in the 11 cells configuration (on the left) C11 is directly shorted to C12. This would mean we solder a wire directly onto the minion module between C11 and C12.

In my opinion there wouldn't be much of a difference between these two options, but I'm on the side of just following what the datasheet says to do since there may be some weird current flow stuff if we just connect the nodes externally. Plus it wouldn't be too difficult to solder a trace on if this is actually the issue.


Issue 3.1: Last Value in Minion is Invalid

PROBLEM CHILDREN 3: 3/21/24

Voltages = {
35629, 35625, 35612, 35620, 35617, 35617, 35621, 35618, 35608, 36027, 914, 
35600, 35597, 35594, 35606, 35599, 35606, 35612, 35609, 35594, 31586, 
35617, 35624, 35615, 35627, 35623, 35627, 35622, 35635, 35611, 35629, 13320}

$3 = {
35630, 35624, 35615, 35621, 35617, 35615, 35624, 35619, 35610, 35991, 956, 
35602, 35594, 35594, 35602, 35591, 35603, 35608, 35612, 35594, 31582, 
35616, 35627, 35616, 35627, 35623, 35627, 35620, 35633, 35610, 35630, 13345}

$4 = {
35631, 35625, 35614, 35620, 35618, 35618, 35624, 35620, 35610, 35953, 1042, 
35599, 35597, 35596, 35604, 35598, 35607, 35611, 35607, 35591, 31584, 
35616, 35624, 35615, 35628, 35624, 35627, 35620, 35634, 35611, 35629, 13324}

These are 3 readings of the Voltage array after running it.

The last module in each minion is wrong. 

The voltages of the 11th Module in the segment change much more than the other modules per run