kd.ckpts.Checkpointer#
- class kauldron.checkpoints.Checkpointer(
- *,
- _fake_refs: type[_FakeRefsUnset] | dict[str,
- _FakeRootCfg] = <class 'kauldron.utils.config_util._FakeRefsUnset'>,
- workdir: epath.PathLike = _FakeRootCfg('cfg.workdir'),
- save_interval_steps: int,
- max_to_keep: Optional[int] = 3,
- keep_time_interval: Optional[datetime.timedelta] = None,
- keep_period: Optional[int] = None,
- save_on_steps: Optional[Sequence[int]] = None,
- best_metric_path: Optional[str] = None,
- best_mode: str = 'max',
- multiprocessing_options: ocp.options.MultiprocessingOptions = <factory>,
- fast: bool = True,
- create: bool = True,
Bases:
kauldron.checkpoints.checkpointer.BaseCheckpointerWrapper around Orbax CheckpointManager.
- workdir
Root directory of the task
- Type:
str | os.PathLike
- save_interval_steps
See ocp.CheckpointManagerOptions
- Type:
int
- max_to_keep
See ocp.CheckpointManagerOptions
- Type:
int | None
- keep_time_interval
See ocp.CheckpointManagerOptions
- Type:
datetime.timedelta | None
- keep_period
See ocp.CheckpointManagerOptions
- Type:
int | None
- save_on_steps
See ocp.CheckpointManagerOptions
- Type:
Sequence[int] | None
- best_metric_path
Path to evaluator’s metric for best checkpoint selection. Warning: If using a best_metric_path, the evaluator must be run inside the train loop and cannot be run as a separate job.
- Type:
str | None
- best_mode
See ocp.CheckpointManagerOptions
- Type:
str
- multiprocessing_options
See ocp.MultiprocessingOptions
- Type:
orbax.checkpoint.options.MultiprocessingOptions
- fast
(internal) Activate some optimizations
- Type:
bool
- create
(internal) Whether to create the checkpoint directory, this is set by kauldron automatically based on whether the job is a training job (True) or an eval job (False).
- Type:
bool
- workdir: epath.PathLike = _FakeRootCfg('cfg.workdir')
- save_interval_steps: int
- max_to_keep: int | None = 3
- keep_time_interval: datetime.timedelta | None = None
- keep_period: int | None = None
- save_on_steps: Sequence[int] | None = None
- best_metric_path: str | None = None
- best_mode: str = 'max'
- multiprocessing_options: ocp.options.MultiprocessingOptions
- fast: bool = True
- create: bool = True
- restore(
- state: kauldron.checkpoints.checkpointer._StateT,
- *,
- step: int = -1,
- noop_if_missing: bool = False,
- donate: bool = True,
Restore state.
- Parameters:
state – The state object initialized from the trainer. If the state is not known, you can pass kd.ckpt.items.StandardCheckpointItem() to restore the nested dict of weights.
step – The training step of the checkpoint to restore. -1 means last step.
noop_if_missing – If False will raise an error when no checkpoint is found.
donate – Whether delete the initial_state to free up memory when restoring the checkpoint. This avoids 2x memory consumption. It is safe to donate the initial_state if you no longer need it after restoring.
- Returns:
The restored state.
- Raises:
FileNotFoundError – An error occurred when no checkpoint is found.
- should_save(step: int) bool[source]
- delete(step: int) None[source]
- save(
- state: kauldron.checkpoints.checkpoint_items.CheckpointItem,
- *,
- step: int,
- force: bool = False,
- metrics: Any | None = None,
Save state.
- maybe_save(
- state,
- *,
- step: int,
- force: bool = False,
Save state.
- property latest_step: int | None
- property all_steps: Sequence[int]
- reload() None[source]
Refresh the cache.
For performance, the checkpointer caches the directory names. Calling this function resets the cache to allow scanning the checkpoint directory for new checkpoints.
- item_metadata(
- step: int = -1,
Returns the metadata (tree, shape,…) associated with the step.
- iter_new_checkpoints(
- *,
- min_interval_secs: int = 0,
- timeout: int | None = None,
- timeout_fn: collections.abc.Callable[[], bool] | None = None,
Wrapper around ocp.checkpoint_utils.checkpoints_iterator.
- wait_until_finished() None[source]
Synchronizes the asynchronous checkpointing.
- close() None[source]
Closes the checkpointer.