Reading ELAN annotations

One of the most basic use cases for {readelan} is to read the annotations inside an ELAN annotation file (.eaf) directly into R.

Reading a file

The basic case when reading a single file would look something like this:

library(readelan)

annotations <- read_eaf("/path/to/elan/file.eaf")

We can read a very basic example EAF file as it is stored inside the {readelan} package:

library(readelan)

eaf_file <- system.file("extdata", 
                        "example.eaf", 
                        package = "readelan")

annotations <- read_eaf(eaf_file)

head(annotations)

     filename  a time_1 time_2 annotation   tier  tier_type participant
1 example.eaf a1    ts1    ts2          I  words default-lt        s001
2 example.eaf a2    ts3    ts4       like  words default-lt        s001
3 example.eaf a3    ts5    ts6       cats  words default-lt        s001
4 example.eaf a4    ts1    ts2    pronoun    pos        pos        s001
5 example.eaf a7    ts1    ts2    subject syntax     syntax        s001
6 example.eaf a5    ts3    ts4       verb    pos        pos        s001
  annotator parent_ref a_ref language_ref  cv_id
1       ABC       <NA>  <NA>         <NA>   <NA>
2       ABC       <NA>  <NA>         <NA>   <NA>
3       ABC       <NA>  <NA>         <NA>   <NA>
4       DEF      words    a1          eng    pos
5       GHI      words    a1          eng syntax
6       DEF      words    a2          eng    pos
                                     cve_ref start  end duration    time_unit
1                                       <NA>   490 1590     1100 milliseconds
2                                       <NA>  2010 3670     1660 milliseconds
3                                       <NA>  4370 5980     1610 milliseconds
4 cveid_89b2ac38-7313-4737-aa5f-19e1231ccb18   490 1590     1100 milliseconds
5 cveid_4990ed36-c1d1-40c6-800e-2bd264d9a89b   490 1590     1100 milliseconds
6 cveid_d5558ab7-11c3-47d5-9a0f-403724b0e0b7  2010 3670     1660 milliseconds

Reading an online file

Files do not have to be stored locally, but can be read directly from an online location, assuming you have an internet connection. Thus, we can read files directly from the DGS-Korpus repository like so:

library(readelan)

dgs_file <- read_eaf("https://www.sign-lang.uni-hamburg.de/meinedgs/eaf/1413451-11105600-11163240.eaf")

head(dgs_file)

                       filename        a time_1 time_2
1 1413451-11105600-11163240.eaf a2834768    ts1   ts12
2 1413451-11105600-11163240.eaf a2834772   ts12   ts19
3 1413451-11105600-11163240.eaf a2834780   ts19   ts40
4 1413451-11105600-11163240.eaf a2834793   ts40   ts63
5 1413451-11105600-11163240.eaf a2834798   ts63   ts68
6 1413451-11105600-11163240.eaf a2834800   ts68   ts76
                                                                                    annotation
1                                                                     Wie mein Leben aussieht?
2                                                  Na ja, ich bin als Gehörloser aufgewachsen.
3 Ich habe eher das Gefühl, wenn ich mir vorstelle, dass ich allein bin, dann wäre ich einsam.
4     Da treffe ich lieber viele Gehörlose und mache mit denen was, dann ist mein Leben schön.
5                                                                        Aber das ist ja klar.
6                                                           Dann ist Alleinsein nicht schlimm.
                    tier                 tier_type participant annotator
1 Deutsche_Übersetzung_A L_text__finer_granularity      ber-36      <NA>
2 Deutsche_Übersetzung_A L_text__finer_granularity      ber-36      <NA>
3 Deutsche_Übersetzung_A L_text__finer_granularity      ber-36      <NA>
4 Deutsche_Übersetzung_A L_text__finer_granularity      ber-36      <NA>
5 Deutsche_Übersetzung_A L_text__finer_granularity      ber-36      <NA>
6 Deutsche_Übersetzung_A L_text__finer_granularity      ber-36      <NA>
  parent_ref a_ref language_ref cv_id cve_ref start   end duration    time_unit
1       <NA>  <NA>         <NA>  <NA>    <NA>   240  2160     1920 milliseconds
2       <NA>  <NA>         <NA>  <NA>    <NA>  2160  4120     1960 milliseconds
3       <NA>  <NA>         <NA>  <NA>    <NA>  4120  7860     3740 milliseconds
4       <NA>  <NA>         <NA>  <NA>    <NA>  7860 11180     3320 milliseconds
5       <NA>  <NA>         <NA>  <NA>    <NA> 11180 12340     1160 milliseconds
6       <NA>  <NA>         <NA>  <NA>    <NA> 12340 13500     1160 milliseconds

Additional arguments

Specifying tiers

There are two argument with which you can specify which tiers or tier types you want to read from your EAF file: tiers (simple to use but constrained) and xpath (complicated to use but more customizable).

With tiers, you can input a named list with either tier or tier_type as names, and single character strings or character vectors as their values, to specify tiers to read. For instance, if I know that there are only two tiers that I want to read in a file, I can specify this:

library(readelan)

dgs_file <- 
  read_eaf(file = "https://www.sign-lang.uni-hamburg.de/meinedgs/eaf/1413451-11105600-11163240.eaf",
           tiers = list(tier = c("Lexem_Gebärde_l_A",
                                 "Lexem_Gebärde_r_A")))

head(dgs_file)

                       filename          a time_1 time_2    annotation
1 1413451-11105600-11163240.eaf a2914979_1    ts2    ts3       SEHEN1*
2 1413451-11105600-11163240.eaf a2567654_1    ts4    ts5     SELBST1A*
3 1413451-11105600-11163240.eaf a2567651_1    ts6    ts7      LEBEN1A*
4 1413451-11105600-11163240.eaf a2567657_1    ts8    ts9        SEHEN1
5 1413451-11105600-11163240.eaf a2567656_1   ts10   ts11    $GEST-OFF^
6 1413451-11105600-11163240.eaf a2567658_1   ts13   ts14 AUFWACHSEN1A*
               tier                              tier_type participant
1 Lexem_Gebärde_r_A L_tokens_right_left__finer_granularity      ber-36
2 Lexem_Gebärde_r_A L_tokens_right_left__finer_granularity      ber-36
3 Lexem_Gebärde_r_A L_tokens_right_left__finer_granularity      ber-36
4 Lexem_Gebärde_r_A L_tokens_right_left__finer_granularity      ber-36
5 Lexem_Gebärde_r_A L_tokens_right_left__finer_granularity      ber-36
6 Lexem_Gebärde_r_A L_tokens_right_left__finer_granularity      ber-36
  annotator parent_ref a_ref language_ref cv_id cve_ref start  end duration
1      <NA>       <NA>  <NA>         <NA>  <NA>    <NA>   700  860      160
2      <NA>       <NA>  <NA>         <NA>  <NA>    <NA>   920 1140      220
3      <NA>       <NA>  <NA>         <NA>  <NA>    <NA>  1280 1520      240
4      <NA>       <NA>  <NA>         <NA>  <NA>    <NA>  1660 1800      140
5      <NA>       <NA>  <NA>         <NA>  <NA>    <NA>  1960 2060      100
6      <NA>       <NA>  <NA>         <NA>  <NA>    <NA>  2300 2480      180
     time_unit
1 milliseconds
2 milliseconds
3 milliseconds
4 milliseconds
5 milliseconds
6 milliseconds

If you want to specify tier types instead, that can be done in the same way:

library(readelan)

eaf_file <- system.file("extdata", 
                        "example.eaf", 
                        package = "readelan")

annotations <- 
  read_eaf(file = eaf_file,
           tiers = list(tier_type = c("default-lt", "syntax")))

head(annotations)

     filename  a time_1 time_2 annotation   tier  tier_type participant
1 example.eaf a1    ts1    ts2          I  words default-lt        s001
2 example.eaf a2    ts3    ts4       like  words default-lt        s001
3 example.eaf a3    ts5    ts6       cats  words default-lt        s001
4 example.eaf a7    ts1    ts2    subject syntax     syntax        s001
5 example.eaf a8    ts3    ts4  predicate syntax     syntax        s001
6 example.eaf a9    ts5    ts6     object syntax     syntax        s001
  annotator parent_ref a_ref language_ref  cv_id
1       ABC       <NA>  <NA>         <NA>   <NA>
2       ABC       <NA>  <NA>         <NA>   <NA>
3       ABC       <NA>  <NA>         <NA>   <NA>
4       GHI      words    a1          eng syntax
5       GHI      words    a2          eng syntax
6       GHI      words    a3          eng syntax
                                     cve_ref start  end duration    time_unit
1                                       <NA>   490 1590     1100 milliseconds
2                                       <NA>  2010 3670     1660 milliseconds
3                                       <NA>  4370 5980     1610 milliseconds
4 cveid_4990ed36-c1d1-40c6-800e-2bd264d9a89b   490 1590     1100 milliseconds
5 cveid_de8c8f23-6b49-42da-9bd4-5b8b59d8b1da  2010 3670     1660 milliseconds
6 cveid_f37323ed-9b7e-48d9-bbd6-b9105186ed02  4370 5980     1610 milliseconds

With the xpath argument, you can directly target tiers using XPath syntax. This is more complicated for the basic user, but allows for some additional functionality and customization for the advanced user, such as targeting tiers based on substrings, like tiers starting with “wo” (targeting “words”):

library(readelan)

eaf_file <- system.file("extdata", 
                        "example.eaf", 
                        package = "readelan")

annotations <- 
  read_eaf(file = eaf_file,
           xpath = ".//TIER[starts-with(@TIER_ID,'wo')]")

head(annotations)

     filename  a time_1 time_2 annotation  tier  tier_type participant
1 example.eaf a1    ts1    ts2          I words default-lt        s001
2 example.eaf a2    ts3    ts4       like words default-lt        s001
3 example.eaf a3    ts5    ts6       cats words default-lt        s001
  annotator parent_ref a_ref language_ref cv_id cve_ref start  end duration
1       ABC       <NA>  <NA>         <NA>  <NA>    <NA>   490 1590     1100
2       ABC       <NA>  <NA>         <NA>  <NA>    <NA>  2010 3670     1660
3       ABC       <NA>  <NA>         <NA>  <NA>    <NA>  4370 5980     1610
     time_unit
1 milliseconds
2 milliseconds
3 milliseconds

Time slots for child tiers

The argument fill_times is set to TRUE by default, which means that the read_eaf() function attempts to fill the empty time slots (i.e., start and end times of annotations) for child annotations based on their parents’ annotations. This is usually the output most users would likely want, but it may result in unexpected behavior in some cases. For instance, while it works normally in cases like above, where there are parent annotations to fill times from, it will not work if the file for some reason lacks time slots altogether, or if only child tiers are targeted. In such cases, setting this argument to FALSE should solve the issue of reading the file, but will leave time slots empty as in the original EAF file:

library(readelan)

eaf_file <- system.file("extdata", 
                        "example.eaf", 
                        package = "readelan")

# This will result in an error:
# annotations <- 
#   read_eaf(file = eaf_file,
#            tiers = list(tier_type = c("syntax")))

# This should work
annotations <- 
  read_eaf(file = eaf_file,
           tiers = list(tier_type = c("syntax")),
           fill_times = FALSE)

head(annotations)

     filename  a time_1 time_2 annotation   tier tier_type participant
1 example.eaf a7   <NA>   <NA>    subject syntax    syntax        s001
2 example.eaf a8   <NA>   <NA>  predicate syntax    syntax        s001
3 example.eaf a9   <NA>   <NA>     object syntax    syntax        s001
  annotator parent_ref a_ref language_ref  cv_id
1       GHI      words    a1          eng syntax
2       GHI      words    a2          eng syntax
3       GHI      words    a3          eng syntax
                                     cve_ref start end duration    time_unit
1 cveid_4990ed36-c1d1-40c6-800e-2bd264d9a89b    NA  NA       NA milliseconds
2 cveid_de8c8f23-6b49-42da-9bd4-5b8b59d8b1da    NA  NA       NA milliseconds
3 cveid_f37323ed-9b7e-48d9-bbd6-b9105186ed02    NA  NA       NA milliseconds

Writing full paths

The full_path argument simply determines whether the full file path input should be written to the output data frame as the filename (if TRUE; e.g., “/path/to/elan_file.eaf”), or whether it should be shortened to the base name only (if FALSE, the default; e.g., “elan_file.eaf”).

Progress bar

If progress is set to TRUE, a progress bar will be printed to the console as files are read. This is mostly useful when reading multiple files that take some time to complete (see Multiple files).