runner.SimpleSampleDatasetsProvider
December 14, 2023 ยท View on GitHub
View source
on GitHub
Builds a sampling tf.data.Dataset from multiple filenames.
Inherits From: DatasetProvider
runner.SimpleSampleDatasetsProvider(
principal_file_pattern: Optional[str] = None,
extra_file_patterns: Optional[Sequence[str]] = None,
principal_weight: Optional[float] = None,
extra_weights: Optional[Sequence[float]] = None,
*,
principal_filenames: Optional[Sequence[str]] = None,
extra_filenames: Optional[Sequence[Sequence[str]]] = None,
principal_cardinality: Optional[int] = None,
fixed_cardinality: bool = False,
shuffle_filenames: bool = False,
interleave_fn: Callable[..., tf.data.Dataset],
examples_shuffle_size: Optional[int] = None
)
For complete explanations regarding sampling see _process_sampled_dataset().
This SimpleSampleDatasetsProvider builds a tf.data.Dataset as follows:
- The object is initialized with a list of filenames specified by
principle_filenamesandextra_filenamesargument. For convenience, the corresponding file patternprincipal_file_patternandextra_file_patternscan be specified instead, which will be expanded to a sorted list. - The filenames are sharded between replicas according to the
InputContext(order matters). - Filenames are shuffled per replica (if requested).
- Examples from all file patterns are sampled according to
principal_weightandextra_weights. - The files in each shard are interleaved after being read by the
interleave_fn. - Examples are shuffled (if requested), auto-prefetched, and returned for use in one replica of the trainer.
Methods
get_dataset
get_dataset(
context: tf.distribute.InputContext
) -> tf.data.Dataset
Creates a tf.data.Dataset by sampling.
The contents of the resulting tf.data.Dataset are sampled from several
sources, each stored as a sharded dataset: * one principal input, whose size
determines the size of the resulting tf.data.Dataset; * zero or more side
inputs, which are repeated if necessary to preserve the requested samping
weights.
Each input dataset is shared before interleaving. The result of interleaving is
only shuffled if a examples_shuffle_size is provided.
Datasets are sampled from with tf.data.Dataset.sample_from_datasets. For
sampling details, please refer to the TensorFlow documentation at:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#sample_from_datasets.
Two methods are supported to determine the end of the resulting
tf.data.Dataset:
fixed_cardinality=True) Returns a dataset with a fixed cardinality, set at
principal_cardinality // principal_weight. principal_dataset and
principal_cardinality are required for this method. principal_weight is
required iff extra_weights are also provided.
fixed_cardinality=False) Returns a dataset that ends after the principal input
has been exhausted, subject to the random selection of samples.
principal_dataset is required for this method. principal_weight is required
iff extra_weights are also provided.
The choice of principal_dataset is important and should, in most cases, be
chosen as the largest underlying dataset as compared to extra_datasets.
positives and negatives where len(negatives) >> len(positives) and with
positives corresponding to principal_dataset, the desired behavior of epochs
determined by the exhaustion of positives and the continued mixing of unique
elements from negatives may not occur: On sampled dataset reiteration
positives will again be exhausted but elements from negatives may be those
same seen in the previous epoch (as they occur at the beginning of the same,
reiterated underlying negatives dataset). In this case, the recommendations
are to:
- Reformulate the sampling in terms of the larger dataset (
negatives), where, withfixed_cardinality=False, if the exhaustion ofnegativesis desired, or, withfixed_cardinality=True, whenprincipal_cardinalitycan be used to specify the desired number of elements fromnegatives.2) Ensure that the underlyingprincipal_datasetofnegativesare well-sharded. In this way, the nondeterminism of interleaving will randomly access elements ofnegativeson reiteration.
| Args | |
|---|---|
context
|
An tf.distribute.InputContext for sharding.
|
| Returns | |
|---|---|
A tf.data.Dataset.
|