Processes in this Phase
Data Selection
Select the most relevant data subsets for modeling based on business requirements, data quality, and feasibility assessments.
Data Cleaning
Handle missing values, remove duplicates, correct errors, and address outliers to improve data quality for modeling.
Data Construction
Create new features through aggregation, derivation, and domain-specific transformations to enhance model performance.
Data Integration
Combine data from multiple sources into a unified dataset, resolving schema conflicts and ensuring consistency.
Data Formatting
Transform data into the required format for modeling, including encoding, normalization, and train/test splitting.
Data Pipeline Development
Build automated, reproducible data pipelines to ensure consistent data preparation across development and production.