RoguePlanetoid

Introducing Data Science, to understand what Twitter really thinks of Elon!

24th February 2023

About

Introducing Data Science, to understand what Twitter really thinks of Elon! for Tech on the Tyne was held by Dr Hollie Johnson who is a senior data scientist at the National Innovation Centre for Data.

National Innovation Centre for Data (NICD)

The situation

National Data skills shortage
Recruitment challenges
Data quality issues
Cost of data (data is not inherently an asset)
Low level of data readiness in organisations

Who are NICD?

Hub for data innovation and data skills, are not academics nor are a traditional consultancy. They wear many hats and do collaborative data skills projects. Work with public sector, SMEs, voluntary sector, large corporations, start-ups and university departments.

Data skills projects

Owned by organisation, mixture of collaborative skills, transfer sessions and offline work. Solving a real business problem. Their problem, their data, their solution. Organisations finish with valuable project output and their people have gained new data skills to take forward to future projects.

What they do

Prepare and deliver conventional teaching materials, pair programming with clients, read and understand cutting edge research to find what is relevant, help organisations understand options available and oversee the “data science process” along with many other things.

What makes a good data science project?

A successful data science project is one that delivers value to an organisation, values can be increased sales, reduced churn, better resourcing (human or otherwise) and reduced waste and increased efficiency. Research shows that most failures are due to poor project management and scoping.

NICD data science workflow

Business understanding - goal, objectives, deliverables, resources along with data preparation, modelling, deployment and monitoring.

Getting started with Data Science

Organisations can consider current use of data - is it effective, is data strategy supported by organisational culture, who and what drives decisions on data projects. Consider business challenges - what challenges are being faced that can be addressed with data, pick the lowest hanging fruit and should always do this first and aim for small wins, often and iterate.

As an individual data science need varied skill sets and how can existing skills support data science projects and where can you add value to an organisation. Development of technical skills in either science or engineering and development of data literacy and communication.

Considerations

What is level of data maturity where you work, what are the challenges around data usage, are challenges problem-based or resource-based, what sort of data do you encounter both internal and external. How can demonstrate value and is this from top down or the bottom up.

Natural Language Processing

What is NLP (Natural Language Processing) is a field covering many language based tasks like text segmentation, named entity recognition, sentiment analysis, text summarisation, machine translation. Used for chatbots / virtual assistants, spam filtering, web search and text analysis.

Sentiment Analysis

Filter into positive, neutral and negative and can get feedback into processes and fix any issues. NLP emerged following World War II in the 1940s, 1960s is where split between symbolic and stochastic NLP, further research in 1970s to probabilistic models in the 1980s and LSTM models were introduced in 1990s and then 2010s saw deep learning methods and in 2020s the dominance of the transformer architecture such as ChatGPT.

How can we develop meaningful representation of language?

Bag of words - a unique token is assigned to each word in the text. Word2VE - unsupervised learning approach that learns continuous multidimensional vector representation or each word and can be used to predict a “centre word”. Contextual work embeddings - allows each word to have more than one representation depending on the context. Transformers - use self-attention mechanisms and positional encodings used eg. BERT and GPT.

Word embeddings

Can have word relationships, verb tense and country capital for example where can translate to a lower dimensional space from a higher one that represents the words surrounding words.

BERT

Semi-supervised training on large amounts of text and can have another layer on top of this using supervised training on a specific task with a labelled dataset. BERT can use the output of a masked words position to predict the masked word and find out what the missing word might be.

In Practice

Fortunately we don't need to implement or train any of this ourselves as many pre-trained models are available, many are fine tuned for specific tasks and for example can use a model from HuggingFace with an extracted set of tweets on a given subject, which can then be used for sentiment analysis.