The Data Science Process

Evolved Relationship with Information

New Flow of Information

Evolved Relationship

Evolved Relationship

Embodiment Factors

bits/min billions 2,000
billion
calculations/s
~100 a billion
embodiment 20 minutes 5 billion years

The Three Ds of Machine Learning Systems Design

  • Three primary challenges of Machine Learning Systems Design.
  1. Decomposition
  2. Data
  3. Deployment

Data

  • Hard to overstate its importance.
  • Half the equation of \(\text{data} + \text{model}\).
  • Often utterly neglected.

Data Neglect

  • Arises for two reasons.
    1. Data cleaning is perceived as tedious.
    2. Data cleaning is complex.

Data Cleaning

  • Seems difficult to formulate into readily teachable princples.
  • Heavily neglected in data science, statistics and ML courses.
  • In practice most scientists spend around 80% of time data cleaning.

The Software Crisis

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

Edsger Dijkstra (1930-2002), The Humble Programmer

The Data Crisis

The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high-quality data. That implies that we develop processes for improving and verifying data quality that are efficient.

There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.

Me

Data Readiness Levels

https://arxiv.org/pdf/1705.02245.pdf Data Readiness Levels (Lawrence, 2017b)

Three Grades of Data Readiness

  • Grade C - accessibility
    • Transition: data becomes electronically available
  • Grade B - validity
    • Transition: pose a question to the data.
  • Grade A - usability

Data Science as Debugging

80/20 in Data Science

  • Anecdotally for a given challenge
    • 80% of time is spent on data wrangling.
    • 20% of time spent on modelling.
  • Many companies employ ML Engineers focussing on models not data.

Lessons

  1. When you begin an analysis behave as a debugger
  • Write test code as you go.
    • document tests … make them accessible.
  • Be constantly skeptical.
  • Develop deep understanding of best tools.
  • Share your experience of challenges, have others review work

Lessons

  1. When managing a data science process.
  • Don’t deploy standard agile development. Explore modifications e.g. Kanban
  • Don’t leave data scientist alone to wade through mess.
  • Integrate the data analysis with other team activities
    • Have software engineers and domain experts work closely with data scientists

GDPR Origins

How the GDPR May Help

How GDPR May Help

  • Reflection on data eco-systems.
  • GDPR: Good Data Practice Rules
  • When viewed as best practice rather than regulation they hightlight problems in data ecosystems.

GDPR in Practice

  • Understand the lawful basis
  • For websites: provide a “Privacy Notice”

Putting in Practice

  • Access
  • Assess
  • Process

Github Template

Further Reading

  • Chapter 8 of Lawrence (2024)

  • Chapter 1 of Lawrence (2024)

Thanks!

References

Lawrence, N.D., 2024. The atomic human: Understanding ourselves in the age of AI. Allen Lane.
Lawrence, N.D., 2017b. Data readiness levels. ArXiv.
Lawrence, N.D., 2017a. Living together: Mind and machine intelligence. arXiv.