A Data Science Process

Neil D. Lawrence

LT2, William Gates Building

Data Context

  • Must not forget context of data.
  • Three challenges:
    1. Paradoxes of Data Society
    2. Quantifying value of data
    3. Privacy, loss of control and marginalization
    • Place people at the heart.

Deploying Artificial Intelligence

  • Challenges in deploying AI.
  • Currently this is in the form of “machine learning systems”

Internet of People

  • Fog computing: barrier between cloud and device blurring.
    • Computing on the Edge
  • Complex feedback between algorithm and implementation

Deploying ML in Real World: Machine Learning Systems Design

  • Major new challenge for systems designers.
  • Internet of Intelligence but currently:
    • AI systems are fragile

The Fynesse Framework

  • Access
  • Assess
  • Address

CRISP-DM

More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills—skills that are also necessary for understanding biases in the data, and for debugging logging output from code.

Cathy O’Neil and Rachel Strutt from O’Neil and Schutt (2013)

Experiment, Analyze, Design

A Vision

We don’t know what science we’ll want to do in five years’ time, but we won’t want slower experiments, we won’t want more expensive experiments and we won’t want a narrower selection of experiments.

What do we want?

  • Faster, cheaper and more diverse experiments.
  • Better ecosystems for experimentation.
  • Data oriented architectures.
  • Data maturity assessments.
  • Data readiness levels.

Ride Sharing: Service Oriented

Ride Sharing: Data Oriented

Ride Sharing: Hypothetical

Inspiration

  • Operational data science with:
    • Data Science Africa
    • Amazon (particularly in supply chain)
    • The Royal Society DELVE Group (pandemic advice)

The Fynesse Framework

  • Three aspects
    • Access - before data is available electronically
    • Assess - work that can be done without the question
    • Address - giving answers to question at hand

Access

  • Work to make data electronically accessible.
  • Legal work
  • Ethical work
  • Extraction of data form where it’s held
    • mobile phones, within software ecosystem, physical log books
  • Associated with data readiness level C.

Access Case Study: Crash Map Kampala

Bagonza Jimmy Kinyonyi Michael T. Smith

Crash Map Kampala

Bagonza Jimmy Kinyonyi Michael T. Smith

Access Automation

  • Digital Transformation
  • Post-Digital Transformation

Assess

  • Only things you can do without knowing the “question”.
    • This ensures assess is reusable across tasks.
  • Driven by happenstance data.
  • Associated with data readiness level B

Case Study: Text Mining for Covid Misinformation

Joyce Nakatumba-Nabende

Case Study: Text Mining for Misinformation

Joyce Nakatumba-Nabende

Data Access

Data Assessment

Data Exploration - Used the initial data analysis phase to understand the data, uncover patterns, get insights within the data. - Generated word clouds from the raw dataset. - Performed topic modelling using Latent Dirichlet Allocation (LDA) to discover topics and similarity of the data for Covid-19 on social media.

Data Annotation Process

  • Carried out by 7 annotators who could read/comprehend English and Luganda
  • Used the Doccano Tool - an open source text annotation tool.

Data Annotation Process

  • Annotation attributes:
    1. Data source [Facebook, Twitter]
    2. Language [English, Luganda, and codemixed]
    3. Aspect [truck drivers, hospitals, vaccine, cases, SOPs, NPIs, Testing, Border, Covid19_Impact, Presidential address, death, elections and Covid19]
    4. Sentiment [positive, negative and neutral]
    5. Misinformation [Not Fake, Fake, Partially Fake, and Others]
  • As part of quality assurance, the data was reviewed by an independent team to ensure that the annotation guidelines were followed.}

Table: Portion of data that was annotated.

 | Twitter Data | Facebook Data |
Initial dataset | 15,354 | 430,075 |
Dataset after Annotation | 3,527 | 4,479 |

Cohen’s kappa inter-annotation used to measure annotator agreement.

Table: Cohen’s kappa agreement scores for the data.

Language | 0.89 |
Aspect | 0.69 |
Sentiment | 0.73 |
Misinformation | 0.74 |

Automating Assess

  • Automated scheme detection
  • Automated data type detection (Valera and Ghahramani (2017))
  • The automatic statistician (James Robert Lloyd and Ghahramani. (2014))
  • AI for Data Analytics (Nazábal et al. (2020))
  • Joyce’s case study gives us also POS tagging for new languages.

AI for Data Analytics

Address

  • Address the question.
  • Now we bring the context in.
  • Could require:
    • Confirmatory data analysis
    • An ML prediction model
    • Visualisation through a dashboard
    • An Excel spreadsheet
  • Associated with data readiness level A.

Automating Address

  • Auto ML
  • Automatic Statistician
  • Automatic Visualization

AutoML

Fynesse Template

Create Repository from Template

Assignment

Conclusions

  • Bandwidth constraints of humans
  • Big Data Paradox
  • Big Model Paradox
  • Access, Assess, Address

Q&A

Thanks!

References

Machine Learning Systems Design

Fragility of AI Systems

  • They are componentwise built from ML Capabilities.
  • Each capability is independently constructed and verified.
    • Pedestrian detection
    • Road line detection
  • Important for verification purposes.

Pigeonholing

Robust

  • Need to move beyond pigeonholing tasks.
  • Need new approaches to both the design of the individual components, and the combination of components within our AI systems.

Rapid Reimplementation

  • Whole systems are being deployed.
  • But they change their environment.
  • The experience evolved adversarial behaviour.

Machine Learning Systems Design

Figure: Science on Holborn Viaduct, cradling the Centrifugal Governor.

On Governors, James Clerk Maxwell 1868

Adversaries

  • Stuxnet
  • Mischevious-Adversarial

An Intelligent System

Joint work with M. Milo

An Intelligent System

Joint work with M. Milo

Peppercorns

  • A new name for system failures which aren’t bugs.
  • Difference between finding a fly in your soup vs a peppercorn in your soup.

Peppercorns

Turnaround And Update

  • There is a massive need for turn around and update
  • A redeploy of the entire system.
    • This involves changing the way we design and deploy.
  • Interface between security engineering and machine learning.

The Three Ds of Machine Learning Systems Design

  • Three primary challenges of Machine Learning Systems Design.
  1. Decomposition
  2. Data
  3. Deployment

Data Science and Professionalisation

  • Industrial Revolution 4.0?
  • Industrial Revolution (1760-1840) term coined by Arnold Toynbee (1852-1883).
  • Maybe: But this one is dominated by data not capital
  • A revolution in information rather than energy.
  • That presents challenges and opportunities
  • Consider Apple vs Nokia: How you handle disruption.

compare digital oligarchy vs how Africa can benefit from the data revolution

A Time for Professionalisation?

  • New technologies historically led to new professions:
    • Brunel (born 1806): Civil, mechanical, naval
    • Tesla (born 1856): Electrical and power
    • William Shockley (born 1910): Electronic
    • Watts S. Humphrey (born 1927): Software

Why?

  • Codification of best practice.
  • Developing trust

Where are we?

  • Perhaps around the 1980s of programming.
    • We understand if, for, and procedures
    • But we don’t share best practice.
  • Let’s avoid the over formalisation of software engineering.

Data as a Convener

  • Data allows externalisation of cognition.
  • Even when not existing, can ask: What data would we want?

Delve

Delve Reports

  1. Facemasks 4th May 2020 (The DELVE Initiative, 2020a)
  2. Test, Trace, Isolate 27th May 2020 (The DELVE Initiative, 2020b)
  3. Nosocomial Infections 6th July 2020 (The DELVE Initiative, 2020c)
  4. Schools 24th July 2020 (The DELVE Initiative, 2020d)
  5. Economics 14th August 2020 (The DELVE Initiative, 2020e)
  6. Vaccines 1st October 2020 (The DELVE Initiative, 2020f)
  7. Data 24th November 2020 (The DELVE Initiative, 2020g)

Delve Data Report

  • Surveillance data situation.
    • REACT Study (Imperial)
    • ONS Coronavirus (COVID-19) Infection Survey
    • RECOVERY Trial (Dexamethasone)
  • Happenstance data.
    • Our report’s focus (The DELVE Initiative, 2020g)

Delve Data Report: Recommendations

  • Update statutory objective of ONS to accommodate happenstance data.
  • ONS and ICO to collaborate on data driving license to standardise access processes.
  • Interdisciplinary pathfinder projects across government, business and academia
    • Nowcasting of economic metrics
    • Movement of populations (mobile phone data).
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R., 2000. CRISP-DM 1.0: Step-by-step data mining guide.
James Robert Lloyd, R.G., David Duvenaud, Ghahramani., Z., 2014. Automatic construction and natural-language description of nonparametric regression models, in: AAAI.
Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A., 2020. Data engineering for data analytics: A classification of the issues, and case studies.
O’Neil, C., Schutt, R., 2013. Doing data science: Straight talk from the frontline. O’Reilly.
The DELVE Initiative, 2020g. Data readiness: Lessons from an emergency. The Royal Society.
The DELVE Initiative, 2020e. Economic aspects of the COVID-19 crisis in the UK. The Royal Society.
The DELVE Initiative, 2020a. Face masks for the general public. The Royal Society.
The DELVE Initiative, 2020c. Scoping report on hospital and health care acquisition of COVID-19 and its control. The Royal Society.
The DELVE Initiative, 2020d. Balancing the risks of pupils returning to schools. The Royal Society.
The DELVE Initiative, 2020b. Test, trace, isolate. The Royal Society.
The DELVE Initiative, 2020f. SARS-CoV-2 vaccine development & implementation; scenarios, options, key decisions. The Royal Society.
Valera, I., Ghahramani, Z., 2017. Automatic discovery of the statistical types of variables in a dataset, in: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, pp. 3521–3529.