R for Sign Language Linguistics

Author

Carl “Calle” Börstell

Published

August 17, 2023

Introduction

This is a short tutorial intended to give some helpful pointers to those working on sign language data and the programming language R. The content here does not provide a general introduction to R. For such an introduction, I suggest looking at the free, online book R for Data Science (Wickham, Çetinkaya-Rundel & Grolemund). I also recommend two books specifically targeting R for linguist(ic)s: Levshina’s (2015) How to do Linguistics with R and Winter’s (2020) Statistics for Linguists: An Introduction Using R.

The code here also assumes the use of the {tidyverse} libraries, for easy reading, writing and piping of data.

Working with ELAN files (`.eaf`)

Many (or most) sign language linguists work with ELAN at some point. It is fantastic for annotating multimodal language data, and there is huge potential in the functionality in ELAN itself. Nonetheless, sometimes it is nice to be able to work with ELAN files (.eaf format) directly from R. With the {signglossR} package, I wrote a function called read_elan() for reading ELAN files directly from and into R, outputting a data frame for data analysis in R. However, I have also made this function available as a standalone function in a Github repo(sitory), there called read_eaf(). The functionality is the same, and you can direct the function to the path of either a single .eaf file or a whole directory (i.e. folder) containing multiple .eaf files which will be bound together into one data frame.

Reading ELAN files (`.eaf`)

In order to read ELAN files, you could either use use the read_elan() function from the {signglossR} package, or source the individual function read_eaf() into your current work session:

# The below reads the function into the current session, for you to use
source("https://raw.githubusercontent.com/borstell/r_functions/main/read_eaf.R")

# You can then either read a single EAF file: 
read_eaf("/path/to/annotated_elan_file.eaf")

# Or point it to a directory (folder) containing multiple EAF files
read_eaf("/path/to/annotated_elan_files/")

# Remember to store this in a variable, if you want to work on it
my_elan_df <- read_eaf("/path/to/annotated_elan_files/")

The output of these read functions should look something like this, a data frame (techically a tibble) with the annotation data organized in rows and columns:

# A tibble: 5,682 × 14
   file     a     t1     t2     annotation tier  tier_type participant annotator
   <chr>    <chr> <chr>  <chr>  <chr>      <chr> <chr>     <chr>       <lgl>    
 1 file_001 a1322 ts579  ts582  GLOSS-0864 glos… gloss_LH  Signer_001  NA       
 2 file_001 a1323 ts585  ts609  GLOSS-0521 glos… gloss_LH  Signer_001  NA       
 3 file_001 a1324 ts669  ts677  GLOSS-0521 glos… gloss_LH  Signer_001  NA       
 4 file_001 a1325 ts681  ts707  GLOSS-0864 glos… gloss_LH  Signer_001  NA       
 5 file_001 a1327 ts741  ts744  GLOSS-1657 glos… gloss_LH  Signer_001  NA       
 6 file_001 a1329 ts1488 ts1496 GLOSS-1656 glos… gloss_LH  Signer_001  NA       
 7 file_001 a1330 ts1567 ts1570 GLOSS-1422 glos… gloss_LH  Signer_001  NA       
 8 file_001 a1333 ts2687 ts2697 GLOSS-0865 glos… gloss_LH  Signer_001  NA       
 9 file_001 a1335 ts3168 ts3170 GLOSS-1046 glos… gloss_LH  Signer_001  NA       
10 file_001 a1336 ts3175 ts3178 GLOSS-1047 glos… gloss_LH  Signer_001  NA       
# ℹ 5,672 more rows
# ℹ 5 more variables: ref <chr>, parent_tier <chr>, start <dbl>, end <dbl>,
#   duration <dbl>

Combining ELAN tiers in wide format

Sometimes you get the ELAN annotations in a long format, with each row being a single annotation even though some tiers may be child tiers of others, for instance if you read .eaf files using the reaf_eaf() or read_elan() functions. Then, you may want to combine the data into a wide format, such that child tier annotations are on the same rows as their parent tier annotations. This can be achieved with pivoting.

We’ll test this with some toy data (NB: not real data!) I have created:

library(tidyverse)

all_annotations <- read_csv("https://raw.githubusercontent.com/borstell/r_functions/main/data/elan_example_data.csv")

unique(all_annotations$tier_type)

[1] "gloss_LH"       "gloss_RH"       "articulator_LH" "articulator_RH"

This data frame contains four tier types: "gloss_LH" and "gloss_RH", which are both parent tiers, used for the sign gloss tiers, and "articulator_LH" and "articulator_RH", which are both child tiers to the gloss tiers, and contain information about whether the parent annotation is a one- or two-handed sign.

The problem now is that they are all on separate rows, and we cannot immediately link the child annotations to the parent annotations. Luckily, the ELAN format is smart, and has reference annotations that can works as keys to connect the data points! But to do this, we need to subset and re-join the data.

# We can assign the wide data frame to a new variable
annotations_wide <- all_annotations |> 
  
  # We first filter to only parent tiers (they have NA values in the parent_tier)
  filter(is.na(parent_tier)) |> 
  
  # We then rejoin with the child tier annotations
  left_join(all_annotations |> 
              
              # Filter to only child tiers
              filter(!is.na(parent_tier)) |> 
              
              # Select relevant columns only
              select(file, 
                     parent_tier,
                     a, 
                     ref,
                     annotation,
                     tier,
                     tier_type),
            # Join by unique identifiers
            by = join_by(file,
                         tier == parent_tier,
                         a == ref))

We can compare the lenghts of the two data frames, to see the difference:

nrow(all_annotations)

[1] 5682

nrow(annotations_wide)

[1] 2841

nrow(annotations_wide)*2

[1] 5682

Since the new wide format data frame is half as long as the original long format one, we can conclude that we each gloss annotation had exactly one articulator annotation, and they matched exactly 1-to-1.

One issue here is that there are multiple columns with the same name being matched, and they have now received additional “tags” at the end, to uniquely identify them:

colnames(annotations_wide)

 [1] "file"         "a"            "t1"           "t2"           "annotation.x"
 [6] "tier"         "tier_type.x"  "participant"  "annotator"    "ref"         
[11] "parent_tier"  "start"        "end"          "duration"     "a.y"         
[16] "annotation.y" "tier.y"       "tier_type.y"

If we want to avoid column names like annotatation.x and annotation.y, we could either rename the columns in the child tier data before joining them with the parent tiers, or we could simply rename them afterwards:

# We can assign the wide data frame to a new variable
annotations_wide <- all_annotations |> 
  
  # We first filter to only parent tiers (they have NA values in the parent_tier)
  filter(is.na(parent_tier)) |> 
  
  # We then rejoin with the child tier annotations
  left_join(all_annotations |> 
              
              # Filter to only child tiers
              filter(!is.na(parent_tier)) |> 
              
              # Rename columns before joining
              rename("a_child" = a,
                     "tier_child" = tier,
                     "tier_type_child" = tier_type,
                     "annotation_child" = annotation) |> 
              
              # Select relevant columns only
              select(file, 
                     parent_tier,
                     a_child, 
                     ref,
                     annotation_child,
                     tier_child,
                     tier_type_child),
            # Join by unique identifiers
            by = join_by(file,
                         tier == parent_tier,
                         a == ref))
colnames(annotations_wide)

 [1] "file"             "a"                "t1"               "t2"              
 [5] "annotation"       "tier"             "tier_type"        "participant"     
 [9] "annotator"        "ref"              "parent_tier"      "start"           
[13] "end"              "duration"         "a_child"          "annotation_child"
[17] "tier_child"       "tier_type_child"

If you are joining multiple child tiers with the same parent tier(s), it is probably best to remove unnecessary columns after joining and/or rename them to something more informative. In this case, for example, we might want the annotation_child column to be called something like num_of_hands, seeing as this is what this tier type referred to: whether a sign is one- or two-handed. Good column names (and variable names in general) help a lot with keeping track of what your data actually is, for example when you want to plot certain variables:

annotations_wide |> 
  rename("num_of_hands" = annotation_child) |> 
  ggplot() +
  geom_bar(aes(x=num_of_hands), color="grey30", fill="dodgerblue") +
  labs(x="", y="Tokens", title="Distribution of one- and two-handed signs") +
  theme_minimal(base_size=15, base_family="Futura")

Segmenting “turn-taking” in ELAN files

ELAN annotation data is represented with a start and end time stamp for each annotation cell. These can be used to order all annotation within a file, or per signer within a file, to represent a chronological order of things. Such orderings can be grouped together to represent utterances or “turns” in conversational data.

I created three different methods for ordering and combining individual annotations into utterances or turns, which can be found here: segment_elan_turns. Each of the three methods can be sourced as an individual R function. These functions take an ELAN-style data frame as its input, and outputs a data frame where turns have been added: either as an identifier in a new column, or with the annotations collapsed into a one-row-one-utterance format (if using simplify=TRUE in the function call). These functions may help conversation analysis research, and facilitate finding items based on their conversational distribution.

Let’s see an example:

# We read the same toy data as before, but filter to keep only glosses
my_glosses <- readr::read_csv("https://raw.githubusercontent.com/borstell/r_functions/main/data/elan_example_data.csv") |> 
  filter(tier_type %in% c("gloss_RH", "gloss_LH"))

my_glosses

# A tibble: 2,841 × 14
   file     a     t1     t2     annotation tier  tier_type participant annotator
   <chr>    <chr> <chr>  <chr>  <chr>      <chr> <chr>     <chr>       <lgl>    
 1 file_001 a1322 ts579  ts582  GLOSS-0864 glos… gloss_LH  Signer_001  NA       
 2 file_001 a1323 ts585  ts609  GLOSS-0521 glos… gloss_LH  Signer_001  NA       
 3 file_001 a1324 ts669  ts677  GLOSS-0521 glos… gloss_LH  Signer_001  NA       
 4 file_001 a1325 ts681  ts707  GLOSS-0864 glos… gloss_LH  Signer_001  NA       
 5 file_001 a1327 ts741  ts744  GLOSS-1657 glos… gloss_LH  Signer_001  NA       
 6 file_001 a1329 ts1488 ts1496 GLOSS-1656 glos… gloss_LH  Signer_001  NA       
 7 file_001 a1330 ts1567 ts1570 GLOSS-1422 glos… gloss_LH  Signer_001  NA       
 8 file_001 a1333 ts2687 ts2697 GLOSS-0865 glos… gloss_LH  Signer_001  NA       
 9 file_001 a1335 ts3168 ts3170 GLOSS-1046 glos… gloss_LH  Signer_001  NA       
10 file_001 a1336 ts3175 ts3178 GLOSS-1047 glos… gloss_LH  Signer_001  NA       
# ℹ 2,831 more rows
# ℹ 5 more variables: ref <chr>, parent_tier <chr>, start <dbl>, end <dbl>,
#   duration <dbl>

nrow(my_glosses)

[1] 2841

With this data, let’s try to simplify it into sequential turns:

source("https://raw.githubusercontent.com/borstell/r_functions/main/turn_seq.R")

my_turns <- turn_seq(my_glosses, 
                     file=file, 
                     start=start, 
                     end=end, 
                     participant=participant, 
                     annotation=annotation, 
                     simplify = TRUE, # We pivot into collapsed wide turns
                     long = FALSE # We do not combine overlapping same-producer turns
                     )

my_turns |> select(file, participant, annotations)

# A tibble: 266 × 3
   file     participant annotations                                             
   <chr>    <chr>       <chr>                                                   
 1 file_001 Signer_001  GLOSS-1466 GLOSS-1278 GLOSS-1523 GLOSS-1324 GLOSS-1524 …
 2 file_001 Signer_001  GLOSS-1224                                              
 3 file_001 Signer_001  GLOSS-0418                                              
 4 file_001 Signer_001  GLOSS-1657                                              
 5 file_001 Signer_001  GLOSS-1411                                              
 6 file_001 Signer_001  GLOSS-1569 GLOSS-1279 GLOSS-1656 GLOSS-0418 GLOSS-1452 …
 7 file_001 Signer_001  GLOSS-1657                                              
 8 file_001 Signer_001  GLOSS-0816                                              
 9 file_001 Signer_001  GLOSS-0521 GLOSS-0419 GLOSS-0864 GLOSS-1022 GLOSS-0009 …
10 file_001 Signer_001  GLOSS-1325                                              
# ℹ 256 more rows

nrow(my_turns)

[1] 266

We can now see that the number of rows has decreased massively, because all the “turns” have been pivoted into single rows, with all individual annotations collapsed into a single string!

This can be useful when looking at turn-taking patterns in data:

my_turns |> 
  ggplot() +
  geom_segment(aes(x=start/1000, xend=end/1000, 
                   y=participant, yend=participant, 
                   color=participant),
               linewidth=5, show.legend = F, alpha=.5) +
  scale_color_manual(values=rep(c("dodgerblue", "orange"),3)) +
  scale_x_continuous(labels=scales::comma_format()) +
  labs(y="", x="Time (seconds)") +
  facet_wrap(~file, ncol=1, scales="free") +
  theme_classic(base_size=14)

The function turn_seq() used here simply merges any uninterrupted sequential annotations belonging to the same participant into the same turn. The other two functions for turn-segmentation are based on either a maximal time interval allowed for sequential annotations (end of one to start of next) – turn_interval() – or an estimation of whose turn it is based on who has the most annotations in the current three-annotation-window in the file – turn_quant(). Since they work differently, they sometimes yield different results when grouping annotations into longer turns.

Splitting videos based on ELAN annotation

In some use cases, a researcher may have ELAN files with annotations of individual signs being produced and each having been annotated in ELAN: for example, when collecting lexical items for a dictionary project. I wrote a basic function that can segment the video into shorter video clips based on the position of each ELAN annotation in the .eaf files. This function requires FFmpeg installed, as it runs ffmpeg commands behind the scenes.

# Source the R script to load the functions into the session
source("https://raw.githubusercontent.com/borstell/r_functions/main/split_elan_videos.R")

# Example of use
split_elan_video(elan_path = "/path/to/eaf/file(s)",
                  segmentation_tier = "name_of_segmentation_tier",
                  video_path = "/path/to/video/file(s)",
                  annotation_tag = T, # will add contents of ELAN cells in output filenames
                  padding = 0, # adds (or subtracts if negative) frames (in milliseconds) before+after segment duration
                  video_input_format = ".mov", # specify input video format in directory (default is .mp4)
                  video_output_format = ".mp4") # specify output video format (default is .mp4)

Further possibilities

There is other ELAN functionality available through the {signglossR} package.

Plotting handshape fonts

Sometimes, especially when working on phonological data, it is useful to illustrate handshapes as visual forms rather than resort to language-specific labels (e.g. “baby-C”). One possibility is to use the handshape fonts created by CSLDS, CUHK. Downloading the main handshape font, you can install it on your computer, where it should be available as a local font with the name handshape2002. In principle, any font available on your local computer should also be available for plotting with {ggplot2}.

Let’s first read some – once again – toy data. This data file gives the example of 1000 data points, each describing a handshape using labels referring to the ASL manual alphabet. This could, for instance, be the annotations from an ELAN file, where annotators have annotated each of 1000 signs in a text/dictionary/etc with a single label representing the handshape.

library(tidyverse)

handshapes <- read_csv("https://raw.githubusercontent.com/borstell/r_functions/main/data/handshape_example_data.csv")

handshapes

# A tibble: 1,000 × 1
   hs   
   <chr>
 1 A    
 2 O    
 3 B    
 4 O    
 5 1    
 6 1    
 7 S    
 8 1    
 9 A    
10 1    
# ℹ 990 more rows

We may want to plot this data, to see the distribution of handshapes:

handshapes |> 
  ggplot() +
  geom_bar(aes(x=hs), color="grey30", fill="dodgerblue") +
  labs(x="", y="Tokens", title="Distribution of handshapes") +
  theme_minimal(base_size=15, base_family="Futura")

So, this plot is quite OK looking, but I would personally prefer it if we could change two things (and perhaps a third as a bonus):

Reorder the bars to reflect their frequency
Use a more visual label for the handshapes (e.g. actual handshapes)
Bonus: we may also be more interested in the percentages rather than absolute totals.

# We add the package `{scales}` to get the percent format
library(scales)

handshapes |> 
  count(hs) |> 
  
  # Calculate proportion of n of total ns
  mutate(percent = n/sum(n)) |> 
  
  ggplot() +
  geom_col(aes(x=fct_reorder(hs, -n), y=percent), color="grey30", fill="dodgerblue") +
  scale_y_continuous(labels=percent_format()) +
  labs(x="", y="", title="Distribution of handshapes") +
  theme_minimal(base_size=15, base_family="Futura") +
  theme(axis.text.x = element_text(color="black", size=rel(3), family="handshape2002"))

Oops! This is not what we wanted! Oh, right, the handshape labels “B”, “O”, etc are not equivalent to the intended handshapes in the handshape2002 font. So we will need to create a conversion table for our labels. The below is a manually created conversion table between the handshapes.

hs_conversion <- c("x", "1", "6", "<", "A", ">", "B")
names(hs_conversion) <- c("B", "A", "S", "C", "O", "5", "1")

Let’s try again, but changing the x mapping to the include the handshape to font conversion:

# We add the package `{scales}` to get the percent format
library(scales)

handshapes |> 
  count(hs) |> 
  
  # Calculate proportion of n of total ns
  mutate(percent = n/sum(n)) |> 
  
  ggplot() +
  geom_col(aes(x=fct_reorder(hs_conversion[hs], -n), y=percent), color="grey30", fill="dodgerblue") +
  scale_y_continuous(labels=percent_format()) +
  labs(x="", y="", title="Distribution of handshapes") +
  theme_minimal(base_size=15, base_family="Futura") +
  theme(axis.text.x = element_text(color="black", size=rel(3), family="handshape2002"))

This looks nicer! And a whole lot more informative than opaque and language-specific handshape labels, in my opinion!

Working with image/video data

There is plenty of functionality in the {magick} package for image processing. There is some functionality for video processing with the {av} package. The {signglossR} package relies heavily on these packages for the image and video processing. I refer to these three packages and their respective documentation for further information about functionality and use.

License

This tutorial is published under a CC BY-NC-SA 4.0 license.

Feel free to share or reuse the content for your own research or teaching.

“I don’t have a SoundCloud”, but if you enjoyed this tutorial:

if you have the means, donate to a charity or someone in need,
give praise to others for their work and actions,
support your peers and students,
be an ally, and
work towards a kinder academia.

Citation

If you want to cite this tutorial, here’s a suggested bibtex format:

@manual{r4sl,
  author = {Börstell, Carl},
  title = {R for Sign Language Linguistics},
  year = {2023},
  url = {https://borstell.github.io/misc/r4sl},
}