Organizing Python Projects

A Structured Approach

Outline


  • Libraries
  • Applications
  • Science Repository

Libraries


  • Target Audience: Python coders
  • Libraries are imported and used in other codes
  • Used via an Application Programming Interface (API)

Libraries


Key Components:

  • modules (directories with __init__.py)
  • pyrpoject.toml
  • tests
  • license
  • readme

Library Structure

DBScan1D example:

dbscan1d
├── src
│   └── dbscan1d
│       ├── __init__.py
│       ├── version.py
│       ├── core.py
│       └── utils.py
├── tests
│   ├── test_core.py
│   └── test_utils.py
├── LICENSE.txt
├── README.md
└── pyproject.toml

Libraries: Build systems


Libraries: Extras


  • docs
  • github config files (.github)
    • actions
    • templates
  • .ini files
  • various configurations

Libraries: Tips


  • Write the readme (with a tutorial) before coding
  • Write tests as you develop
  • Use a linter (e.g., pre-commit)
  • Setup CI with github actions
  • Many templates exist to get started

Applications

Designed for people (generally not codes)

  • Graphical User Interface (GUI)
  • Terminal User Interfaces (TUI)
  • Command-line tools
  • Web application
  • Web services

Applications


  • Most applications are written using a framework
  • The framework specifies how the project is set up
  • Pinning dependencies is usually best
  • Containerization (e.g. docker) can be a good idea

GUIs, TUIs, Command Line

TUIs

Textual

Command line

-Make a library

-Define command line entry points in pyproject.toml

-Install the library

Textual Examples


Web Applications/Services

Minimalist frameworks

Flask, Quart, Sanic

Batteries included frameworks

Django, Pyramid

Static site generators

Quarto, Pelican, Hugo

The Science Repo


Purposes

  • Organize experiments and analysis
  • Make research reproducible
  • Generate figures
  • Share research

Useful Frameworks and Workflow Engines


Frameworks: Kedro

  • Opinionated end-to-end framework
  • Supports containerization, dependency resolution, logging, VC, etc.
project-template    # Project folder
├── conf            # Configuration files
├── data            # Local project data
├── docs            # Documentation
├── logs            # Logs of pipeline runs
├── notebooks       # Exploratory Jupyter notebooks 
├── pyproject.toml  # Identifies the project root
├── README.md       # README.md explaining your project
├── setup.cfg       # Configuration options for testing and linting
└── src             # Source code for pipelines

Workflow Engines


  • Resolves dependencies
  • Reruns jobs only when needed
  • Long history:
    • (e.g., make, Scons)

Workflow Engines: Prefect

from prefect import flow, task
from typing import List
import httpx


@task(retries=3)
def get_stars(repo: str):
    url = f"https://api.github.com/repos/{repo}"
    count = httpx.get(url).json()["stargazers_count"]
    print(f"{repo} has {count} stars!")


@flow(name="GitHub Stars")
def github_stars(repos: List[str]):
    for repo in repos:
        get_stars(repo)


# run the flow!
github_stars(["PrefectHQ/Prefect"])

Simple Scripts?

Simple Project 
├── a010_download_data.py
├── a020_preprocess_seismograms.py
├── a030_detect_earthquakes.py            
├── a040_plot_detections.py
├── local.py
├── utils.py  
├── inputs/
├── outputs/
├──── a010_raw_data
├──── a020_preprocessed_seismograms
├──── ...
├── environment.yml  
├── README.md   
└── test_*.py...

Jupyter Notebooks?

  • Some people really like notebooks for science
    • Easy to interact with
    • Simple to get started
  • Others think they are harmful
    • Discourage modular code
    • Difficult to test
    • Stateful GOTOs?
  • Yet others have built tools to remedy these issues

Class Discussion


How do you organize your projects?