I've written a simple application to get frineds posts and all comments on Weibo (The biggest social media in China) by using Weibo API.
I store all Weibo data into the HBase I builded in previous section and try to analyze the tf-idf of each posts and comments then label a term to it.
The following table is what I design to store all data. It's not the best solution so if there is a better architecture to fit my needed, please let me know and I will to enhance my system and this note.
time-index-table
post_index | ||
---|---|---|
create_time | user_id | post_id |
post-table
post | ||
---|---|---|
user_id&post_id | text | date |
comment-table
comment | ||
---|---|---|
user_id&comment_id | text | date |
///TODO
To finish all scenarios I've desgined, I set this step into todo list
To extract the terms of a topic, I choice a powerful segmentation library supported by Stanford University.If you have more interests in segmenter topic, you can surf as follows:
Stanford Word Segmenter: http://nlp.stanford.edu/software/segmenter.shtml
The following code is a simple demo to show how to use segmenter to extract terms of a Weibo topic.
object Main {
def main(args: Array[String]) {
/**I write a class that can scan needed data by date from HBase
/* and transform to a specific type List[List[String]]
/* for the following steps to extract all terms of each topic.
/* A post and all comments of one topic is set in inner List as each instances of outter List.
**/
val allDocuments = new ExtractDocuments().get(fromDate = "20150420")
val seg = new SegmentationFactory()
val props = seg.properties
val segmenter = seg.classifier(props)
allDocuments.foreach{ document =>
document.foreach(
new Transform().retrieveTerms(segmenter)(_)
)}
}
}
class Transform {
def retrieveTerms(segmenter: CRCClassifier[CoreLabel])(content: String) = {
try {
segTerms(segmenter)(content)
} catch {
case e: Exception => e.printStackTrace()
}
}
private def segTerms(segmenter: CRCClassifier[CoreLabel])(content: String) = {
System.setOut(new PrintStream(System.out, true, "utf-8"))
val segmented = segmenter.segmentString(content)
println(segmented)
}
}
// The segmentation factory
class SegmentationFactory {
private val baseDir = System.getProperty("SegPropertyFactory", "data")
def properties: Properties = {
val props = new Properties()
props.setProperty("sighanCorporaDict", baseDir)
props.setProperty("serDictionary", baseDir + "/dict-chris6.ser.gz")
props.setProperty("inputEncoding", "UTF-8")
props.setProperty("sighanPostProcessing", "true")
props
}
def classifier(props: Properties): CRFClassifier[CoreLabel] = {
val segmenter = new CRFClassifier[CoreLabel](props)
segmenter.loadClassifierNoExceptions(baseDir + "/ctb.gz", props)
segmenter
}
}
console:
serDictionary=data/dict-chris6.ser.gz
sighanCorporaDict=data
inputEncoding=UTF-8
sighanPostProcessing=true
Loading classifier from data/ctb.gz ... Loading Chinese dictionaries from 1 file:
data/dict-chris6.ser.gz
Done. Unique words in ChineseDictionary is: 423200.
done [45.0 sec].
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from data/dict/character_list and data/dict/in.ctb
Loading character dictionary file from data/dict/character_list
Loading affix dictionary from data/dict/in.ctb
[关爱, 身边, 的, 人]
[[, 心, ]]
[李, 开腹]
[好, 视频]
[生动, 形象]
[乘, 未, 入魔, 前, 赶紧, 治治]
[嘿嘿]
[李开复]
[放下, 手机]
[抬起头来]
To be continue...