Distribute Cloud Environment on Ubuntu 14.04 with Docker

Text Mining

I've written a simple application to get frineds posts and all comments on Weibo (The biggest social media in China) by using Weibo API.

HBase Table Design

I store all Weibo data into the HBase I builded in previous section and try to analyze the tf-idf of each posts and comments then label a term to it.

The following table is what I design to store all data. It's not the best solution so if there is a better architecture to fit my needed, please let me know and I will to enhance my system and this note.

time-index-table

	post_index
create_time	user_id	post_id

post-table

	post
user_id&post_id	text	date

comment-table

	comment
user_id&comment_id	text	date

Data Pre-Process

Data clean `///TODO`

To finish all scenarios I've desgined, I set this step into todo list

Segmentation

To extract the terms of a topic, I choice a powerful segmentation library supported by Stanford University.If you have more interests in segmenter topic, you can surf as follows:

Stanford Word Segmenter: http://nlp.stanford.edu/software/segmenter.shtml

The following code is a simple demo to show how to use segmenter to extract terms of a Weibo topic.

object Main {
  def main(args: Array[String]) {
   /**I write a class that can scan needed data by date from HBase
   /* and transform to a specific type List[List[String]]
   /* for the following steps to extract all terms of each topic.
   /* A post and all comments of one topic is set in inner List as each instances of outter List.
   **/
    val allDocuments = new ExtractDocuments().get(fromDate = "20150420")

    val seg = new SegmentationFactory()
    val props = seg.properties
    val segmenter = seg.classifier(props)

    allDocuments.foreach{ document =>
      document.foreach(
        new Transform().retrieveTerms(segmenter)(_)
    )}
  }
}

class Transform {
  def retrieveTerms(segmenter: CRCClassifier[CoreLabel])(content: String) = {
   try {
     segTerms(segmenter)(content)
   } catch {
     case e: Exception => e.printStackTrace()
   }
  }

  private def segTerms(segmenter: CRCClassifier[CoreLabel])(content: String) = {
    System.setOut(new PrintStream(System.out, true, "utf-8"))

    val segmented = segmenter.segmentString(content)
    println(segmented)
  }
}

// The segmentation factory
class SegmentationFactory {
  private val baseDir = System.getProperty("SegPropertyFactory", "data")

  def properties: Properties = {
    val props = new Properties()
    props.setProperty("sighanCorporaDict", baseDir)
    props.setProperty("serDictionary", baseDir + "/dict-chris6.ser.gz")
    props.setProperty("inputEncoding", "UTF-8")
    props.setProperty("sighanPostProcessing", "true")
    props
  }

  def classifier(props: Properties): CRFClassifier[CoreLabel] = {
    val segmenter = new CRFClassifier[CoreLabel](props)
    segmenter.loadClassifierNoExceptions(baseDir + "/ctb.gz", props)
    segmenter
  }
}

console:
serDictionary=data/dict-chris6.ser.gz
sighanCorporaDict=data
inputEncoding=UTF-8
sighanPostProcessing=true
Loading classifier from data/ctb.gz ... Loading Chinese dictionaries from 1 file:
  data/dict-chris6.ser.gz
Done. Unique words in ChineseDictionary is: 423200.
done [45.0 sec].
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from data/dict/character_list and data/dict/in.ctb
Loading character dictionary file from data/dict/character_list
Loading affix dictionary from data/dict/in.ctb
[关爱, 身边, 的, 人]
[[, 心, ]]
[李, 开腹]
[好, 视频]
[生动, 形象]
[乘, 未, 入魔, 前, 赶紧, 治治]
[嘿嘿]
[李开复]
[放下, 手机]
[抬起头来]

To be continue...