Basic Text Analysis using R

Justin Chun-ting Ho


Materials: https://github.com/justinchuntingho/

Text Analysis:
The Basics

What is Text Analysis?

“Systematic, objective, quantitative analysis of message characteristics"

Kimberly A. Neuendorf, The Content Analysis Guidebook

Types of Text Analysis


Degree of human involvement:

  • Human coding (100%)
  • Supervised
  • Unsupervised (0%)

Type of Objective:

  • Scaling
  • Classification

(Source: Justin Grimmer and Brandon Stewart, 2013)

Caution!


"All Quantitative Models of Language Are Wrong 
— But Some Are Useful"

Justin Grimmer and Brandon Stewart, Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Bag of Words Assumption


- Word order doesn’t matter

- The followings are exactly the same:

I enjoy eating food and being with my family
I enjoy eating my family and being with food
and being eating enjoy family food I my with

The Pre-Processing

Original Text

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 181
##
## Article 1. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

Remove Punctuation

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 178
##
## Article 1 All human beings are born free and equal in dignity and rights They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood

To Lower Case

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 178
##
## article 1 all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood

Remove Numbers

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 177
##
## article  all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood

Remove Stopwords

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
##
## article   human beings  born free  equal  dignity  rights   endowed  reason  conscience   act towards one another   spirit  brotherhood

Stemming

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 108
##
## articl human be born free equal digniti right endow reason conscienc act toward one anoth spirit brotherhood

Optional: Create N-Gram


##
## article_1 1_all all_human human_beings beings_are are_born born_free free_and and_equal equal_in

Create DFM


##     Terms
## Docs act anoth articl born brotherhood conscienc digniti endow equal free
##    1   1     1      1    1           1         1       1     1     1    1
##    2   0     0      1    0           0         0       0     0     0    1

The Output

Time for R

Keyword Analysis

Keyword Analysis


What is a Keyword?

“A keyword may be defined as a word which occurs with unusual frequency in a given text. This does not mean high frequency but unusual frequency, by comparison with a reference corpus of some kind”



(Scott, M. (1997). PC analysis of key words - and key key words. System, 25(2), 233-45.)

Keyword Analysis


What is Keyness?

“The keyness of a keyword represents the value of log-likelihood or Chi-square statistics; in other words it provides an indicator of a keyword’s importance as a content descriptor for the appeal.”

(Scott, M. (1997). PC analysis of key words - and key key words. System, 25(2), 233-45.)

Keyword Analysis


What is a Chi-squared test?

Comparison between the observed frequency and expected frequency.

Time for R

Other Applications

Thank you very much!


Twitter: @justin_ct_ho
Github: justinchuntingho
Email: Jusitn.Ho@ed.ac.uk


Edinburgh Text Analysis Research Group: https://jiscmail.ac.uk/TEXTANALYSIS