Version 0.22.0
May 15, 2020 ยท View on GitHub
Major Features and Improvements
Bug Fixes and Other Changes
- Crop values in natural language stats generator.
- Switch to using PyBind11 instead of SWIG for wrapping C++ libraries.
- CSV decoder support for multivalent columns by using tfx_bsl's decoder.
- When inferring a schema entry for a feature, do not add a shape with dim = 0 when min_num_values = 0.
- Add utility methods
tfdv.get_slice_statsto get statistics for a slice andtfdv.compare_slicesto compare statistics of two slices using Facets. - Make
tfdv.load_stats_textandtfdv.write_stats_textpublic. - Add PTransforms
tfdv.WriteStatisticsToTextandtfdv.WriteStatisticsToTFRecordto write statistics proto to text and tfrecord files respectively. - Modify
tfdv.load_statisticsto handle reading statistics from TFRecord and text files. - Added an extra requirement group
mutual-information. As a result, barebone TFDV does not requirescikit-learnany more. - Added an extra requirement group
visualization. As a result, barebone TFDV does not requireipythonany more. - Added an extra requirement group
allthat specifies all the extra dependencies TFDV needs. Usepip install tensorflow-data-validation[all]to pull in those dependencies. - Depends on
pyarrow>=0.16,<0.17. - Depends on
apache-beam[gcp]>=2.20,<3. - Depends on `ipython>=7,<8;python_version>="3"'.
- Depends on `scikit-learn>=0.18,<0.24'.
- Depends on
tensorflow>=1.15,!=2.0.*,<3. - Depends on
tensorflow-metadata>=0.22.0,<0.23. - Depends on
tensorflow-transform>=0.22,<0.23. - Depends on
tfx-bsl>=0.22,<0.23.
Known Issues
- (Known issue resolution) It is no longer necessary to use Apache Beam 2.17 when running TFDV on Windows. The current release of Apache Beam will work.
Breaking Changes
tfdv.GenerateStatisticsnow accepts a PCollection ofpa.RecordBatchinstead ofpa.Table.- All the TFDV coders now output a PCollection of
pa.RecordBatchinstead of a PCollection ofpa.Table. tfdv.validate_instancesandtfdv.api.validation_api.IdentifyAnomalousExamplesnow takespa.RecordBatchas input instead ofpa.Table.- The
StatsGeneratorinterface (and all its sub-classes) now takespa.RecordBatchas the input data instead ofpa.Table. - Custom slicing functions now accepts a
pa.RecordBatchinstead ofpa.Tableas input and should output a tuple(slice_key, record_batch).
Deprecations
- Deprecating Py2 support.