Back to Basics
This month I worked out strategies to clean my corpus. I am going to split my corpus into three formats.
- All words in a text separated by \n
- Every unique word and their frequencies (“bag of words”)
- Every sentence in a text with punctuation maintained
Here is a snippet of my text cleaning script that shows the third representation. I am using the
tokenizers[https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html] package to accomplish this. It has a handy, lightning quick output of the unique sentences of an inputted text. However, it was acting a little funny with the entirety of some of my texts so I elected to bifurcate them, separate their unique sentences and then adjoin them again. It worked pretty well.
heartOfDarkness <- scan("conrad-heart-of-darkness.txt",what="character",sep="\n") heartOfDarkness.start<- which(heartOfDarkness == "I") heartOfDarkness.end <- which(heartOfDarkness == "sky--seemed to lead into the heart of an immense darkness.") heartOfDarkness<-heartOfDarkness[heartOfDarkness.start: heartOfDarkness.end] heartOfDarkness.sents<-heartOfDarkness[-(1)] heartOfDarkness.sents<-heartOfDarkness.sents[-(1160)] heartOfDarkness.sents<-heartOfDarkness.sents[-(2128)] #paste is dumb with big inputs so break in half to be more manageable #break near middle at full sentence. first_half <- heartOfDarkness.sents[1:1543] second_half<- heartOfDarkness.sents[-(1:1543)] heartOfDarkness.sents.first <- paste0(first_half, collapse = "\n") heartOfDarkness.sents.first <- unlist(tokenize_sentences(heartOfDarkness.sents.first)) heartOfDarkness.sents.second <- paste0(second_half, collapse = "\n") heartOfDarkness.sents.second <- unlist(tokenize_sentences(heartOfDarkness.sents.second)) #recombine heartOfDarkness.sents <- c(heartOfDarkness.sents.first,heartOfDarkness.sents.second)
In the next month I will need to make up my mind about exactly what type of machine learning technique I’ll use to help answer my question of inquiry. I am leaning towards training a Support Vector Machine, but K-Nearest Neighbor could be helpful, too. In the meantime I will try to carry out my goal of having a perfectly cleaned and machine-usable corpus by the end of October. This includes the addition of a poetic, “ground-truth” corpus if I do in fact using a classification scheme. Look out for more updates in next month’s post!
For reference, here is my corpus: