πŸ“Š Phase 2 β€’ Process 2.1 πŸ†• New Phase

Data Collection

Identify, access, and gather all relevant data sources needed for the ML project, creating a comprehensive data inventory and establishing data pipelines.

Duration
5-10 Days
Key Roles
Data Engineer, Data Scientist
Complexity
🟑 Medium-High
🎯

Overview

Data Collection is the first process of Phase 2 and marks the transition from planning to execution. This process involves identifying all potential data sources, securing access, and establishing the mechanisms to gather data for analysis and model development.

The quality and completeness of data collection directly impacts every subsequent phase. Poor data collection leads to poor models β€” no amount of sophisticated algorithms can compensate for missing or inadequate data.

This process produces a Data Inventory that catalogs all available data sources and a Data Access Plan that ensures the team can reliably retrieve data throughout the project.

πŸ—ƒοΈ

Common Data Source Types

ML projects typically draw from multiple data source types. Identify all relevant sources for your project:

🏒
Internal Systems
  • Transactional databases (SQL, NoSQL)
  • Data warehouses (Snowflake, BigQuery)
  • CRM systems (Salesforce, HubSpot)
  • ERP systems (SAP, Oracle)
  • Log files and application data
  • Internal APIs and microservices
🌐
External Sources
  • Third-party APIs (weather, financial)
  • Public datasets (government, research)
  • Data vendors and marketplaces
  • Social media platforms
  • Web scraping (with permissions)
  • Partner data sharing
⚑
Streaming Data
  • IoT sensors and devices
  • Real-time event streams
  • Message queues (Kafka, RabbitMQ)
  • Clickstream data
  • Live feeds and webhooks
  • Monitoring and telemetry
βœ‹
Manual/Generated
  • Spreadsheets and exports
  • Survey responses
  • Manual annotations and labels
  • Expert knowledge capture
  • User feedback and ratings
  • Synthetic data generation
⚑

Key Activities

  • 1
    Data Source Identification
    Map all potential data sources based on the problem framing from Phase 1. Consider both obvious and non-obvious sources that might contain useful signals.
  • 2
    Access Request & Provisioning
    Request access to identified data sources. Work with data owners, IT, and security teams to obtain necessary permissions and credentials.
  • 3
    Data Extraction Setup
    Establish extraction mechanisms: database connections, API integrations, file transfer protocols, or streaming pipelines as appropriate.
  • 4
    Initial Data Pull
    Execute initial data extraction to verify access works and get representative samples. Document any issues or limitations encountered.
  • 5
    Data Inventory Creation
    Document all collected data sources with metadata: source, format, volume, refresh frequency, access method, and data owner.
  • 6
    Storage & Organization
    Set up appropriate storage infrastructure (data lake, staging area) and organize collected data with clear naming conventions and folder structures.
πŸ”§

Collection Methods

Different data sources require different collection approaches. Match the method to your source and requirements:

Method Type Best For Considerations
SQL Queries Batch Relational databases, data warehouses Performance impact, query optimization
API Integration API External services, SaaS platforms Rate limits, authentication, pagination
File Transfer (SFTP/S3) Batch Partner data, exports, bulk transfers Scheduling, file formats, validation
Stream Processing Real-time IoT, events, clickstream Infrastructure, ordering, backpressure
Web Scraping Batch Public web data Legal compliance, rate limiting, changes
Manual Upload Manual Spreadsheets, one-time imports Human error, format consistency
πŸ“‹

Data Inventory Structure

Create a comprehensive inventory for each data source. This becomes a critical reference document throughout the project:

πŸ“Š
Sample Data Source Entry
Source Name
Customer Transactions DB
Source Type
PostgreSQL Database
Data Owner
Finance Team / John Smith
Access Method
Read-only SQL via VPN
Volume
~50M records, 25GB
Refresh Frequency
Daily incremental
Date Range
Jan 2020 - Present
Key Tables
transactions, customers, products
πŸ”

Data Access Checklist

Ensure you have addressed all access requirements before considering data collection complete:

βœ“ Access Requirements
  • Data access permissions formally approved
  • Credentials/API keys obtained and secured
  • Network access configured (VPN, firewall rules)
  • Data privacy requirements documented (GDPR, HIPAA)
  • Data retention policies understood
  • Backup access method identified (if primary fails)
βœ“ Technical Requirements
  • Connection tested and verified working
  • Sample data successfully extracted
  • Data format and schema understood
  • Extraction scripts/queries documented
  • Storage location provisioned
  • Naming conventions established
πŸ“¦

Deliverables

πŸ“‹

Data Inventory

Comprehensive catalog of all data sources with metadata

πŸ”‘

Access Documentation

Credentials, connection details, and access procedures

πŸ“Š

Initial Data Extracts

Sample data from each source for exploration

πŸ“

Extraction Scripts

Documented code for repeatable data extraction

πŸ“

Storage Structure

Organized data lake/staging area setup

⚠️

Gap Analysis

Identified missing data and mitigation plans

πŸ› οΈ

Recommended Tools

πŸ”„
Apache Airflow
Data pipeline orchestration
πŸ“Š
dbt
Data transformation
πŸ”—
Fivetran / Airbyte
Data integration
☁️
AWS S3 / GCS
Cloud storage
🐍
Python (pandas, requests)
Custom extraction scripts
πŸ“‹
Data Catalogs
Alation, Collibra, DataHub
πŸ’‘

Best Practices

  • βœ“
    Start Early with Access Requests
    Data access often takes longer than expected. Begin the approval process immediately and in parallel with other activities.
  • βœ“
    Document Everything
    Future team members will need to understand and replicate your data collection. Document sources, methods, and any quirks discovered.
  • βœ“
    Validate Early and Often
    Don't wait until you have all data to start validation. Check samples early to catch issues before investing significant effort.
  • βœ“
    Think About Reproducibility
    Design collection processes that can be repeated reliably. Avoid one-off manual steps that can't be automated.
  • βœ“
    Consider Production Needs
    Think ahead to deployment. Can the same data be accessed in production? At what latency? Plan for production from the start.
πŸ’‘ Pro Tips
  • Build relationships with data owners: They're your allies for understanding data nuances and getting timely access.
  • Collect more than you think you need: It's easier to filter down later than to go back for more data.
  • Version your extraction code: Data sources change; versioning helps track what was collected when.
  • Test with small samples first: Don't pull terabytes until you've validated the pipeline works.
⚠️ Common Pitfalls
  • Underestimating access time: Enterprise data access can take weeks or months.
  • Ignoring data privacy: Collecting PII without proper approvals can halt projects.
  • Missing data drift: Data schemas and formats change over timeβ€”plan for it.
  • Forgetting about labels: Supervised learning needs labelsβ€”where will they come from?
πŸ“„

Templates & Resources

πŸ“₯

Data Inventory Template

Spreadsheet for cataloging data sources

πŸ“₯

Access Request Form

Standard form for data access requests

πŸ“₯

Data Collection Checklist

Comprehensive checklist for completeness

πŸ“₯

Sample Extraction Scripts

Python templates for common sources