数据集描述与元数据

数据集描述

以下示例展示了如何生成一个包含 description、copyright_holder、copyright_year、creator 和 url 的报告。在生成的报告中，这些属性位于 Overview（概览）下的 About（关于）部分。

添加概要报告描述
report = df.profile_report(
    title="Masked data",
    dataset={
        "description": "This profiling report was generated using a sample of 5% of the original dataset.",
        "copyright_holder": "StataCorp LLC",
        "copyright_year": 2020,
        "url": "http://www.stata-press.com/data/r15/auto2.dta",
    },
)

report.to_file(Path("stata_auto_report.html"))

列描述

除了提供数据集详细信息外，用户在与团队成员和利益相关者分享报告时，通常希望包含特定列的描述。ydata-profiling 支持创建这些描述，以便报告包含一个内置的数据字典。默认情况下，这些描述显示在报告的 Overview（概览）部分，紧邻每个变量。

生成包含每个变量描述的报告
profile = df.profile_report(
    variables={
        "descriptions": {
            "files": "Files in the filesystem, # variable name: variable description",
            "datec": "Creation date",
            "datem": "Modification date",
        }
    }
)

profile.to_file(report.html)

或者，列描述可以从 JSON 文件加载

dataset_column_definition.json
{
    column name 1: column 1 definition,
    column name 2: column 2 definition
}

从 JSON 定义文件为每个变量生成包含描述的报告
import json
import pandas as pd
import ydata_profiling

definition_file = dataset_column_definition.json

# Read the variable descriptions
with open(definition_file, r) as f:
    definitions = json.load(f)

# By default, the descriptions are presented in the Overview section, next to each variable
report = df.profile_report(variable={"descriptions": definitions})

# We can disable showing the descriptions next to each variable
report = df.profile_report(
    variable={"descriptions": definitions}, show_variable_description=False
)

report.to_file("report.html")

数据集模式

除了提供数据集详细信息外，用户通常希望包含设置类型的模式。这在将 ydata-profiling 生成与数据目录中已有的信息集成时尤为重要。使用 ydata-profiling 的 ProfileReport 时，用户可以设置 type_schema 属性来控制生成的概要数据类型。默认情况下，type_schema 会通过 visions 自动推断。

设置变量类型模式以生成概要报告
import json
import pandas as pd

from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

file_name = cache_file(
    "titanic.csv",
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
df = pd.read_csv(file_name)

type_schema = {"Survived": "categorical", "Embarked": "categorical"}

# We can set the type_schema only for the variables that we are certain of their types. All the other will be automatically inferred.
report = ProfileReport(df, title="Titanic EDA", type_schema=type_schema)

report.to_file("report.html")