Data and Text Analysis
Summer School

Justin Chun-ting Ho

Text Analysis: The Basics

“Systematic, objective, quantitative analysis of message characteristics"

Kimberly A. Neuendorf, The Content Analysis Guidebook

Types of Text Analysis


Degree of human involvement:

  • Human coding (100%)
  • Supervised
  • Unsupervised (0%)

Type of Objective:

  • Scaling
  • Classification

(Source: Justin Grimmer and Brandon Stewart, 2013)

4 Principles of Text Analysis

(Read: Grimmer, J. and Steward, B. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267-297. doi:10.1093/pan/mps028)

1. All Quantitative Models of Language Are Wrong
—But Some Are Useful

  • Data generation process for any text is a mystery
  • All methods necessarily fail to provide an accurate account of the data-generating process
  • Meanings change drastically: “Time flies like an arrow. Fruit flies like a banana.”

2. Quantitative Methods Augment Humans, Not Replace Them

  • Text Analysis will not eliminate the need for careful thought nor remove the necessity of reading
  • Rather than replace humans, computers amplify human abilities

3. There Is No Globally Best Method for Automated Text Analysis

  • Different research questions and designs need different models
  • The same model will perform well on some data sets, but poorly on other

4. Validate, Validate, Validate

  • Results can be misleading or simply wrong
  • Supervised methods: able to reliably replicate human coding
  • Unsupervised methods: the measures are as conceptually valid

Bag of Words Assumption


- Word order doesn’t matter

- The followings mean exactly the same:

  • I enjoy eating food and being with my family
  • I enjoy eating my family and being with food
  • and being eating enjoy family food I my with

Original Text

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 181
##
## Article 1. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

Remove Punctuation

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 178
##
## Article 1 All human beings are born free and equal in dignity and rights They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood

To Lower Case

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 178
##
## article 1 all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood

Remove Numbers

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 177
##
## article  all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood

Remove Stopwords

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
##
## article   human beings  born free  equal  dignity  rights   endowed  reason  conscience   act towards one another   spirit  brotherhood

Stemming

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 108
##
## articl human be born free equal digniti right endow reason conscienc act toward one anoth spirit brotherhood

Optional: Create N-Gram


##
## article_1 1_all all_human human_beings beings_are are_born born_free free_and and_equal equal_in

Create DFM


##     Terms
## Docs act anoth articl born brotherhood conscienc digniti endow equal free
##    1   1     1      1    1           1         1       1     1     1    1
##    2   0     0      1    0           0         0       0     0     0    1

The Output

Time for R

https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558

Sentiment Analysis

What is Sentiment Analysis?


Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes.

What is Sentiment?

  • Opinions (Good vs Bad)
  • Emotions (Happy vs Sad)
  • Attitudes (Like vs Dislike)

    Levels of Analysis


    - Document level

    - Sentence level

    - Entity and Aspect level

    Examples of Usage


    - Product Review

    - Public Opinion

    - Voters Support

    - Wellbeing/Mental Health

    Lexicons


    - "Dictionary" for sentiment

    - Popular Lexicons:

    • LIWC
    • Lexicoder Sentiment Dictionary (postive, negative)
    • AFINN (postive to negative from +5 to -5)
    • Bing (postive, netgative)
    • NRC (postive, netgative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust)

    Challenges


    1. Opposite orientations in different applications.

    “This camera sucks.” vs
    “This vacuum cleaner really sucks.”

    Challenges


    2. Sentence containing sentiment words may not express any sentiment.

    “If I can find a good camera in the shop, I will buy it.”

    Challenges


    3. Sarcastism

    “What a Genius! You uploaded your passwords to Github!”

    Note: Very common in political discussion, especially on social media.

    Challenges


    4. Sentences without sentiment words can also imply opinions.

    “This car burns a lot of fuel.”

    Time for R

    Network Analysis

    What is Network Analysis?

    • The analytical study of relations among a set of actors/objects, eg people, corporations, countries, words

    What is a Network?

    • A network consists of two elements: a set of vertices(nodes) and a set of edges(links) between vertices.
    • The researcher needs to assign meanings to the links and the nodes.

    What is a Network?

    • Edges represent relationships between nodes.
    • Can be physical: collaboration, marriage, information/resource flow, travel history
    • Can be cognitive: trust, friendship, semantic similarity
    • The researcher needs to assign meanings to the links and the nodes.

    Example: Florentine Marriages

    Example: Small World Experiment

    Example: Weak Ties

    Example: Political Blogosphere

    Example: Political Discourse

    Example: Political Discourse

    Example: Phylogenetic network analysis of SARS-CoV-2 genomes

    Forster, P., Forster, L., Renfrew, C., et al. 2020. 'Phylogenetic network analysis of SARS-CoV-2 genomes', Proceedings of the National Academy of Sciences 117(17): 9241–3.

    Example: Phylogenetic network analysis of SARS-CoV-2 genomes

    Example: Phylogenetic network analysis of SARS-CoV-2 genomes

    Question: Which node is the most "central"?

    Centrality

    • Degree Centrality: number of edges
    • Closeness Centrality: average shortest path from one node to every other node
    • Betweenness Centrality: number of shortest paths on graph that pass through node

    More ways to measure importance: PageRank

    Time for R