Where are my CLI args?

MP 93: Some surprising behavior from pdb.

Note: I've been busy this week preparing a talk for PyCon next month, and a local presentation on the same topic. The ongoing series about building a Django project from a single file will continue next week.

I was resuming work on an old project this morning, and I wasn't getting the output I was expecting when running the project's main program against newly-updated data. I thought it had something to do with one of the CLI arguments, but the argument I was passing didn't seem to affect the program's behavior at all.

I wasn't sure the CLI argument was actually being processed, so I inserted a breakpoint right after the call to parser.parse_args(). I was surprised to find that the argument I was passing wasn't being processed at all. Or at least, that's what it looked like.

In this post I'll share a little background about this project, and explain why it only looked like the CLI arguments weren't being handled properly.

Landslides in southeast Alaska

Many people think of Alaska as cold and snowy, but southeast Alaska is actually a temperate rainforest. The proximity of the Pacific Ocean provides a steady source of moisture, and it moderates temperatures here as well. As a result, we get steady rain throughout most of the year.

Steady rain throughout the year also means heavy rains at times, especially in the fall. Heavy rain on steep mountain slopes can lead to landslides, and that dynamic has shaped the landscape in this region as long as humans have been here. Climate change, however, has made landslides happen more frequently, and with higher consequence.

map showing most mountains in the area susceptible to landslide activity
There are many maps that attempt to model the potential impact of landslides in southeast Alaska. This map shows areas prone to landslide activity near Sitka, and the likely runout areas if a slide were to occur. Source: https://sitkalandslide.org/areas-at-risk/

I'm giving a talk at PyCon next month about using river activity in small watersheds to monitor landslide risk in southeast Alaska. I'm also giving a presentation in my hometown next week about a monitoring project I've been working on for almost ten years now.

The direct cause of landslides is an increase in soil moisture content to the point where soil on a slope no longer adheres to the underlying layers on a mountainside. The most accurate approach to landslide monitoring focuses on measuring soil moisture content directly. But that approach is difficult and laborious; you need to place sensors directly in the soil on every slope you want to monitor. This involves remote sensing, and maintaining monitoring equipment in locations that are often difficult to reach.

Other approaches to landslide monitoring involve measuring precipitation, and correlating that to soil moisture content and landslide activity.

map showing many areas of known landslide activity
The Tongass landslide inventory shows areas where landslide activity is known to have occurred. This map covers roughly the same region as the one shown above. Click here for this specific map, and here for the project's home page.

A watershed-based approach to landslide monitoring

My home town, Sitka, is surrounded by steep mountains. Our rivers are quite short. For many of them, you can hike from the ocean to the headwaters in a day if you're happy to do some serious bushwhacking. When Sitka was hit by a fatal landslide in 2015, a number of us started wondering if changes in river activity could be used to indirectly assess landslide risk. We had been in the mountains during periods of heavy rain, and noticed that the rivers rose quite noticeably during rain events that were associated with landslide activity.

When it rains heavily in a small watershed, the river rises quickly and begins to recede shortly after the rain lets up. We wondered if these changes could be correlated with landslide activity. Fortunately the main river in our area has a stream gauge that's been maintained for decades, and all the data from it is publicly available.

The two factors we wanted to look at were the rate of change in the river height, and the total rise of the river. Brief heavy rain causes a rapid rise in the river, but the river doesn't rise much overall. This seems to correlate with the mountains quickly shedding moisture, and a low risk of landslides. Extended periods of moderate rain can cause a significant rise in the river, but without any significant overall rise. This seems to correlate with the mountains shedding water at a steady, sustainable rate. Extended periods of heavy rain result in a rapid rise in the river's height, with a significant total rise in the river's height. This seems to correlate to the mountains not being able to shed water as fast as it's falling, which is exactly what causes landslides to occur.

I wrote a program to run through all the river gage readings over a roughly five-year period, during which a number of landslides had been known to occur. The program would flag any periods where a total rise of 2.5 feet occurred, with an average rate of rise of at least 6 inches per hour. In the initial analysis period, this happened 9 times; 3 of those events were associated with known landslides. This led to the development of a real-time monitoring tool.

Back to those CLI args

I haven't done much with this project in the last few years. One of the challenges of landslide monitoring is that events are relatively infrequent, but have significant impact on individuals and communities. It takes years to evaluate the effectiveness of a real-time monitoring system.

Most of my work has been limited to posting notices when conditions in the mountains are such that the model is almost certainly invalid. For example, landslides are extremely unlikely to occur during the winter. This model is unreliable for much of the winter, and into the spring when the snowpack affects drainage patterns in the mountains.

This morning I set out to update the long-term analysis of this approach to monitoring. We've had a number of heavy rain events since I last ran the analysis in early 2021, and I was curious to see how many critical events had occurred, and how many were associated with known slide events.

When I first ran the analysis program, I got an error:

$ python process_hx_data.py
Parsing raw data files...
Traceback (most recent call last):
  ...
  get_readings_from_data_file
    return all_readings
           ^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'all_readings'...

This wasn't too surprising. I remembered that I had written the program to use cached data during development work, and I couldn't remember if that was the default behavior or if I needed to pass a CLI argument to process the new data.

Here's the relevant code for the CLI arguments. I wanted to make sure the CLI arguments were being processed correctly, so I inserted a breakpoint immediately after the call to parse_args():

parser = argparse.ArgumentParser()
parser.add_argument('--no-interactive-plots',
    ...
parser.add_argument('--use-cached-data',
    help="Use pickled data; don't parse raw data files.",
    action='store_true')

args = parser.parse_args()
breakpoint()

In the debugger, I tried to examine the value of args, and of args.use_cached_data:

$ python process_hx_data.py
...
-> def process_hx_data(root_output_directory=''):
(Pdb) args
(Pdb) args.use_cached_data
(Pdb)

Nothing! That was a little surprising. I tried again, this time passing the argument --use-cached-data:

$ python process_hx_data.py --use-cached-data
...
-> def process_hx_data(root_output_directory=''):
(Pdb) args
(Pdb) args.use_cached_data
(Pdb)

Nothing again! This was really confusing. I forget exactly how argparse works when I haven't used it for a while. I thought maybe args is just empty if no CLI arguments are passed. But when I pass --use-cached-data, args should certainly be defined. I should see True for args.use_cached_data.

Falling back to print debugging

I used print debugging for many years before I started using pdb, so I replaced the breakpoint with a print() call:

parser = argparse.ArgumentParser()
...
parser.add_argument('--use-cached-data',
    help="Use pickled data; don't parse raw data files.",
    action='store_true')

args = parser.parse_args()
print(args)

This generated much different output than what I was seeing in the debugger:

$ python process_hx_data.py
Namespace(no_interactive_plots=False, no_static_plots=False,
  use_cached_data=False)
Parsing raw data files...
Traceback (most recent call last):
  ...
UnboundLocalError...

The print() call shows that args is an argparse.Namespace object, and we can see the default values of all three CLI arguments.

Here's the output when --use-cached-data is passed in:

$ python process_hx_data.py --use-cached-data
Namespace(no_interactive_plots=False, no_static_plots=False,
  use_cached_data=True)
Reading data from pickled files...
Generating interactive plots...
  ...

Looking at the Namespace object, we can see that use_cached_data is indeed set to True.

What was going on with the pdb session? I tried replacing the print() call with a breakpoint again, but still got no output for these values in the debugging session.

pdb help

I looked at the documentation page for pdb, and didn't see anything that would explain this behavior. Then I ran help in a debugging session:

(Pdb) help

Documented commands (type help <topic>):
========================================
EOF    c          d        h         list      q        rv       undisplay
a      cl         debug    help      ll        quit     s        unt      
alias  clear      disable  ignore    longlist  r        source   until    
args   commands   display  interact  n         restart  step     up       
b      condition  down     j         next      return   tbreak   w        
break  cont       enable   jump      p         retval   u        whatis   
bt     continue   exit     l         pp        run      unalias  where    

Miscellaneous help topics:
==========================
exec  pdb

(Pdb)

Oh! There's a pdb command named args!

Let's see what it does:

(Pdb) help args
      Usage: a(rgs)
      
      Print the argument list of the current function.

That's good to know! In a debugging session, args displays the list of arguments associated with the current function. My breakpoint was outside of any function, so there are no arguments to display.

To make sure I was understanding this right, I put a breakpoint on the first line of the main function in the file:

def process_hx_data(root_output_directory=''):
    """Process all historical data in ir_data_clean/.
    ...
    """
    breakpoint()

If I call args in the debugging session now, I should see something about the argument root_output_directory:

$ python process_hx_data.py
-> slides_file = 'known_slides/known_slides.json'
(Pdb) args
root_output_directory = ''
(Pdb)

In this debugger session, args is showing that process_hx_data() was called with no argument for the parameter root_output_directory.

That's it. In previous debugger sessions, instead of looking at the value of args I had defined to hold the parsed CLI arguments, args was showing that there were no arguments in the current function.

Conclusions

This was an interesting edge case to run into. The standard usage of argparse involves the line

args = parser.parseargs()

I wonder how many people have tried to look at the value of args by putting a breakpoint after this line, and run into this very issue.

This has made me look more closely at the full set of commands that pdb makes available. Most of them are names I probably wouldn't use in a file, but there's a few that I could see using. I can imagine using display in a program to refer to some kind of screen object. I might be tempted to use restart for a boolean about whether some process should be restarted or not. If I do use these names, I'll try to remember to run help again if I see some odd behavior in a debugging session.

I'm pretty sure the actual issue in my program had nothing to do with CLI arguments now. I haven't finished debugging yet, but I'm sure the process will go more smoothly with a little better understanding of exactly how pdb works. I'll share more about the landslide monitoring project after these upcoming talks.

Resources

You can find the relevant code from this post in the sitka_irg_analysis GitHub repository; the main file is process_hx_data.py. Please note that this project hasn't seen much refactoring. Once this approach to landslide monitoring was shown to have value, I started focusing on creating a real-time monitoring tool.