the gReat RetuRn
Back to Basics
This month I worked out strategies to clean my corpus. I am going to split my corpus into three formats.
- All words in a text separated by \n
- Every unique word and their frequencies (“bag of words”)
- Every sentence in a text with punctuation maintained
Here is a snippet of my text cleaning script that shows the third representation. I am using the tokenizers
[https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html] package to accomplish this. It has a handy, lightning quick output of the unique sentences of an inputted text. However, it was acting a little funny with the entirety of some of my texts so I elected to bifurcate them, separate their unique sentences and then adjoin them again. It worked pretty well.
heartOfDarkness <- scan("conrad-heart-of-darkness.txt",what="character",sep="\n")
heartOfDarkness.start<- which(heartOfDarkness == "I")
heartOfDarkness.end <- which(heartOfDarkness == "sky--seemed to lead into the heart of an immense darkness.")
heartOfDarkness<-heartOfDarkness[heartOfDarkness.start: heartOfDarkness.end]
heartOfDarkness.sents<-heartOfDarkness[-(1)]
heartOfDarkness.sents<-heartOfDarkness.sents[-(1160)]
heartOfDarkness.sents<-heartOfDarkness.sents[-(2128)]
#paste is dumb with big inputs so break in half to be more manageable
#break near middle at full sentence.
first_half <- heartOfDarkness.sents[1:1543]
second_half<- heartOfDarkness.sents[-(1:1543)]
heartOfDarkness.sents.first <- paste0(first_half, collapse = "\n")
heartOfDarkness.sents.first <- unlist(tokenize_sentences(heartOfDarkness.sents.first))
heartOfDarkness.sents.second <- paste0(second_half, collapse = "\n")
heartOfDarkness.sents.second <- unlist(tokenize_sentences(heartOfDarkness.sents.second))
#recombine
heartOfDarkness.sents <- c(heartOfDarkness.sents.first,heartOfDarkness.sents.second)
In the next month I will need to make up my mind about exactly what type of machine learning technique I’ll use to help answer my question of inquiry. I am leaning towards training a Support Vector Machine, but K-Nearest Neighbor could be helpful, too. In the meantime I will try to carry out my goal of having a perfectly cleaned and machine-usable corpus by the end of October. This includes the addition of a poetic, “ground-truth” corpus if I do in fact using a classification scheme. Look out for more updates in next month’s post!
For reference, here is my corpus:
- conrad-heart-of-darkness.txt
- cormac-mccarthy-blood-meridian.txt
- cormac-mccarthy-no-country-for-old-men.txt
- cormac-mccarthy-the-road.txt
- dh-lawrence-the-rainbow.txt
- dh-lawrence-the-serpent
- dh-lawrence-women-in-love.txt
- edgar-allen-poe-eureka
- edgar-allen-poe-narrative-of-arthur-gordon-pym
- fscott-fitzergald-the-great-gatsby.txt
- herman-melville-billy-budd.txt
- herman-melville-moby-dick.txt
- james-joyce-portrait-of-the-artist.txt
- jean-rhys-francis-wide-sargasso-sea.txt
- jm-coetzee-life-and-times-of-michael-k.txt
- malcolm-lowry-under-the-volcano.txt
- oscar-wilde-picture-of-dorian-gray.txt
- thomas-pynchon-gravitys-rainbow.txt
- virginia-woolf-mrs-dalloway.txt
- virginia-woolf-orlando.txt
- virginia-woolf-to-the-lighthouse.txt
- vladimir-nabokov-lolita.txt
- vladimir-nabokov-pale-fire.txt
- william-faulkner-absolam-absolam.txt
- william-faulkner-the-sound-and-the-fury.txt
- william-h-gass-in-the-heart-of-the-heart-of-the-country.txt