The Data Science Process

Evolved Relationship with Information

New Flow of Information

Evolved Relationship

Embodiment Factors


bits/min	billions	2,000
billion calculations/s	~100	a billion
embodiment	20 minutes	5 billion years

The Three Ds of Machine Learning Systems Design

Three primary challenges of Machine Learning Systems Design.

Decomposition
Data
Deployment

Data

Hard to overstate its importance.
Half the equation of \(\text{data} + \text{model}\).
Often utterly neglected.

Data Neglect

Arises for two reasons.
1. Data cleaning is perceived as tedious.
2. Data cleaning is complex.

Data Cleaning

Seems difficult to formulate into readily teachable princples.
Heavily neglected in data science, statistics and ML courses.
In practice most scientists spend around 80% of time data cleaning.

The Software Crisis

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

Edsger Dijkstra (1930-2002), The Humble Programmer

The Data Crisis

The major cause of the data crisis is that machines have become more interconnected than ever before. Data access is therefore cheap, but data quality is often poor. What we need is cheap high-quality data. That implies that we develop processes for improving and verifying data quality that are efficient.

There would seem to be two ways for improving efficiency. Firstly, we should not duplicate work. Secondly, where possible we should automate work.

Me

Data Readiness Levels

https://arxiv.org/pdf/1705.02245.pdf Data Readiness Levels (Lawrence, 2017b)

Three Grades of Data Readiness

Grade C - accessibility
- Transition: data becomes electronically available
Grade B - validity
- Transition: pose a question to the data.
Grade A - usability

Data Science as Debugging

Analogies: For Software Engineers describe data science as debugging.

80/20 in Data Science

Anecdotally for a given challenge
- 80% of time is spent on data wrangling.
- 20% of time spent on modelling.
Many companies employ ML Engineers focussing on models not data.

Lessons

When you begin an analysis behave as a debugger

Write test code as you go.
- document tests … make them accessible.
Be constantly skeptical.
Develop deep understanding of best tools.
Share your experience of challenges, have others review work

Lessons

When managing a data science process.

Don’t deploy standard agile development. Explore modifications e.g. Kanban
Don’t leave data scientist alone to wade through mess.
Integrate the data analysis with other team activities
- Have software engineers and domain experts work closely with data scientists

GDPR Origins

How the GDPR May Help

How GDPR May Help

Reflection on data eco-systems.
GDPR: Good Data Practice Rules
When viewed as best practice rather than regulation they hightlight problems in data ecosystems.

GDPR in Practice

Understand the lawful basis
For websites: provide a “Privacy Notice”

Putting in Practice

Access
Assess
Process

Github Template

https://github.com/lawrennd/analysis_template/

Thanks!

twitter: @lawrennd
podcast: The Talking Machines
newspaper: Guardian Profile Page
blog posts:

System Zero

The 3Ds of Machine Learning Systems Design

Data Readiness Levels

Data Science as Debugging

References

Lawrence, N.D., 2024. The atomic human: Understanding ourselves in the age of AI. Allen Lane.

Lawrence, N.D., 2017b. Data readiness levels. ArXiv.

Lawrence, N.D., 2017a. Living together: Mind and machine intelligence. arXiv.

The Data Science Process

Evolved Relationship with Information

New Flow of Information

Evolved Relationship

Evolved Relationship

Embodiment Factors

The Three Ds of Machine Learning Systems Design

Data

Data Neglect

Data Cleaning

The Software Crisis

The Data Crisis

Data Readiness Levels

Three Grades of Data Readiness

Data Science as Debugging

80/20 in Data Science

Lessons

Lessons

GDPR Origins

How the GDPR May Help

How GDPR May Help

GDPR in Practice

Putting in Practice

Github Template

Further Reading

Thanks!

References