Back to Basics

This month I worked out strategies to clean my corpus. I am going to split my corpus into three formats.

  1. All words in a text separated by \n
  2. Every unique word and their frequencies (“bag of words”)
  3. Every sentence in a text with punctuation maintained

Here is a snippet of my text cleaning script that shows the third representation. I am using the tokenizers[] package to accomplish this. It has a handy, lightning quick output of the unique sentences of an inputted text. However, it was acting a little funny with the entirety of some of my texts so I elected to bifurcate them, separate their unique sentences and then adjoin them again. It worked pretty well.

heartOfDarkness <- scan("conrad-heart-of-darkness.txt",what="character",sep="\n")

heartOfDarkness.start<- which(heartOfDarkness == "I")
heartOfDarkness.end <- which(heartOfDarkness == "sky--seemed to lead into the heart of an immense darkness.")

heartOfDarkness<-heartOfDarkness[heartOfDarkness.start: heartOfDarkness.end]


#paste is dumb with big inputs so break in half to be more manageable

#break near middle at full sentence. 
first_half <- heartOfDarkness.sents[1:1543]
second_half<- heartOfDarkness.sents[-(1:1543)]

heartOfDarkness.sents.first <- paste0(first_half, collapse = "\n")
heartOfDarkness.sents.first <- unlist(tokenize_sentences(heartOfDarkness.sents.first))

heartOfDarkness.sents.second <- paste0(second_half, collapse = "\n")
heartOfDarkness.sents.second <- unlist(tokenize_sentences(heartOfDarkness.sents.second))


heartOfDarkness.sents <- c(heartOfDarkness.sents.first,heartOfDarkness.sents.second)

In the next month I will need to make up my mind about exactly what type of machine learning technique I’ll use to help answer my question of inquiry. I am leaning towards training a Support Vector Machine, but K-Nearest Neighbor could be helpful, too. In the meantime I will try to carry out my goal of having a perfectly cleaned and machine-usable corpus by the end of October. This includes the addition of a poetic, “ground-truth” corpus if I do in fact using a classification scheme. Look out for more updates in next month’s post!

For reference, here is my corpus:

  • conrad-heart-of-darkness.txt
  • cormac-mccarthy-blood-meridian.txt
  • cormac-mccarthy-no-country-for-old-men.txt
  • cormac-mccarthy-the-road.txt
  • dh-lawrence-the-rainbow.txt
  • dh-lawrence-the-serpent
  • dh-lawrence-women-in-love.txt
  • edgar-allen-poe-eureka
  • edgar-allen-poe-narrative-of-arthur-gordon-pym
  • fscott-fitzergald-the-great-gatsby.txt
  • herman-melville-billy-budd.txt
  • herman-melville-moby-dick.txt
  • james-joyce-portrait-of-the-artist.txt
  • jean-rhys-francis-wide-sargasso-sea.txt
  • jm-coetzee-life-and-times-of-michael-k.txt
  • malcolm-lowry-under-the-volcano.txt
  • oscar-wilde-picture-of-dorian-gray.txt
  • thomas-pynchon-gravitys-rainbow.txt
  • virginia-woolf-mrs-dalloway.txt
  • virginia-woolf-orlando.txt
  • virginia-woolf-to-the-lighthouse.txt
  • vladimir-nabokov-lolita.txt
  • vladimir-nabokov-pale-fire.txt
  • william-faulkner-absolam-absolam.txt
  • william-faulkner-the-sound-and-the-fury.txt
  • william-h-gass-in-the-heart-of-the-heart-of-the-country.txt