处理敏感数据

在某些数据敏感的环境（例如，私人健康记录）中，共享包含样本的报告会违反隐私约束。以下配置简写将各种选项组合在一起，以便报告中仅提供汇总信息，不显示任何个人记录。

report = df.profile_report(sensitive=True)

此外，ydata-profiling 不会将数据发送到外部服务，因此适用于处理私有数据。

样本与重复项

可以显式禁用显示数据集的样本和重复行，以确保报告不会直接泄露任何数据

report = df.profile_report(duplicates=None, samples=None)

或者，仍然可以显示样本，但以下代码片段演示了如何在数据集样本部分使用模拟/合成数据生成报告。请注意，name 和 caption 键是可选的。

使用敏感数据生成画像：模拟样本
# Replace with the sample you'd like to present in the report (can be from a mock or synthetic data generator)
sample_custom_data = pd.DataFrame()
sample_description = "Disclaimer: the following sample consists of synthetic data following the format of the underlying dataset."

report = df.profile_report(
    sample={
        "name": "Mock data sample",
        "data": sample_custom_data,
        "caption": sample_description,
    }
)

警告

使用 pandas.read_csv 处理敏感数据（例如电话号码）时请注意。pandas 的类型猜测默认会将电话号码（例如 0612345678）强制转换为数字。这会导致通过聚合（最小值、最大值、分位数）泄露信息。为了防止这种情况发生，请保留字符串表示。

pd.read_csv("filename.csv", dtype={"phone": str})

请注意，类型检测很困难。这就是开发了 visions（一个帮助开发者解决这些情况的类型系统）的原因。

自动化 PII 分类与管理

您可以在此处找到有关此功能的更多详细信息。