Process 2.1 - Data Collection

🎯

Overview

Data Collection is the first process of Phase 2 and marks the transition from planning to execution. This process involves identifying all potential data sources, securing access, and establishing the mechanisms to gather data for analysis and model development.

The quality and completeness of data collection directly impacts every subsequent phase. Poor data collection leads to poor models — no amount of sophisticated algorithms can compensate for missing or inadequate data.

This process produces a Data Inventory that catalogs all available data sources and a Data Access Plan that ensures the team can reliably retrieve data throughout the project.

🗃️

Common Data Source Types

ML projects typically draw from multiple data source types. Identify all relevant sources for your project:

🏢

Internal Systems

Transactional databases (SQL, NoSQL)
Data warehouses (Snowflake, BigQuery)
CRM systems (Salesforce, HubSpot)
ERP systems (SAP, Oracle)
Log files and application data
Internal APIs and microservices

🌐

External Sources

Third-party APIs (weather, financial)
Public datasets (government, research)
Data vendors and marketplaces
Social media platforms
Web scraping (with permissions)
Partner data sharing

⚡

Streaming Data

IoT sensors and devices
Real-time event streams
Message queues (Kafka, RabbitMQ)
Clickstream data
Live feeds and webhooks
Monitoring and telemetry

✋

Manual/Generated

Spreadsheets and exports
Survey responses
Manual annotations and labels
Expert knowledge capture
User feedback and ratings
Synthetic data generation

⚡

Key Activities

1

Data Source Identification
Map all potential data sources based on the problem framing from Phase 1. Consider both obvious and non-obvious sources that might contain useful signals.
2

Access Request & Provisioning
Request access to identified data sources. Work with data owners, IT, and security teams to obtain necessary permissions and credentials.
3

Data Extraction Setup
Establish extraction mechanisms: database connections, API integrations, file transfer protocols, or streaming pipelines as appropriate.
4

Initial Data Pull
Execute initial data extraction to verify access works and get representative samples. Document any issues or limitations encountered.
5

Data Inventory Creation
Document all collected data sources with metadata: source, format, volume, refresh frequency, access method, and data owner.
6

Storage & Organization
Set up appropriate storage infrastructure (data lake, staging area) and organize collected data with clear naming conventions and folder structures.

🔧

Collection Methods

Different data sources require different collection approaches. Match the method to your source and requirements:

Method	Type	Best For	Considerations
SQL Queries	Batch	Relational databases, data warehouses	Performance impact, query optimization
API Integration	API	External services, SaaS platforms	Rate limits, authentication, pagination
File Transfer (SFTP/S3)	Batch	Partner data, exports, bulk transfers	Scheduling, file formats, validation
Stream Processing	Real-time	IoT, events, clickstream	Infrastructure, ordering, backpressure
Web Scraping	Batch	Public web data	Legal compliance, rate limiting, changes
Manual Upload	Manual	Spreadsheets, one-time imports	Human error, format consistency

📋

Data Inventory Structure

Create a comprehensive inventory for each data source. This becomes a critical reference document throughout the project:

📊

Sample Data Source Entry

Source Name

Customer Transactions DB

Source Type

PostgreSQL Database

Data Owner

Finance Team / John Smith

Access Method

Read-only SQL via VPN

Volume

~50M records, 25GB

Refresh Frequency

Daily incremental

Date Range

Jan 2020 - Present

Key Tables

transactions, customers, products

🔐

Data Access Checklist

Ensure you have addressed all access requirements before considering data collection complete:

✓ Access Requirements

Data access permissions formally approved
Credentials/API keys obtained and secured
Network access configured (VPN, firewall rules)
Data privacy requirements documented (GDPR, HIPAA)
Data retention policies understood
Backup access method identified (if primary fails)

✓ Technical Requirements

Connection tested and verified working
Sample data successfully extracted
Data format and schema understood
Extraction scripts/queries documented
Storage location provisioned
Naming conventions established

📦

Deliverables

📋

Data Inventory

Comprehensive catalog of all data sources with metadata

🔑

Access Documentation

Credentials, connection details, and access procedures

📊

Initial Data Extracts

Sample data from each source for exploration

📝

Extraction Scripts

Documented code for repeatable data extraction

📁

Storage Structure

Organized data lake/staging area setup

⚠️

Gap Analysis

Identified missing data and mitigation plans

🛠️

Recommended Tools

🔄

Apache Airflow

Data pipeline orchestration

📊

dbt

Data transformation

🔗

Fivetran / Airbyte

Data integration

☁️

AWS S3 / GCS

Cloud storage

🐍

Python (pandas, requests)

Custom extraction scripts

📋

Data Catalogs

Alation, Collibra, DataHub

💡

Best Practices

✓

Start Early with Access Requests
Data access often takes longer than expected. Begin the approval process immediately and in parallel with other activities.
✓

Document Everything
Future team members will need to understand and replicate your data collection. Document sources, methods, and any quirks discovered.
✓

Validate Early and Often
Don't wait until you have all data to start validation. Check samples early to catch issues before investing significant effort.
✓

Think About Reproducibility
Design collection processes that can be repeated reliably. Avoid one-off manual steps that can't be automated.
✓

Consider Production Needs
Think ahead to deployment. Can the same data be accessed in production? At what latency? Plan for production from the start.

💡 Pro Tips

Build relationships with data owners: They're your allies for understanding data nuances and getting timely access.
Collect more than you think you need: It's easier to filter down later than to go back for more data.
Version your extraction code: Data sources change; versioning helps track what was collected when.
Test with small samples first: Don't pull terabytes until you've validated the pipeline works.

⚠️ Common Pitfalls

Underestimating access time: Enterprise data access can take weeks or months.
Ignoring data privacy: Collecting PII without proper approvals can halt projects.
Missing data drift: Data schemas and formats change over time—plan for it.
Forgetting about labels: Supervised learning needs labels—where will they come from?

📄

Templates & Resources

📥

Data Inventory Template

Spreadsheet for cataloging data sources

📥

Access Request Form

Standard form for data access requests

📥

Data Collection Checklist

Comprehensive checklist for completeness

📥

Sample Extraction Scripts

Python templates for common sources