--add_segmented
Switch
--add_segmented
Description
Creates a word-segmented version of the message table (for Chinese only!).
Argument and Default Value
None
Details
This will create a table called TABLE_seg (where TABLE is specified by -t) in the database specified by -d. The message column in this new table is a list of segmented words. Note that word segmentation only means something for Chinese messages.
Choose the segmentation model by using --segmentation_model.
After having done this, use --add_ngrams_from_tokenized to extract ngrams.
How it works:
The infrastructure writes the (message_id, message) pairs to a tempfile, runs the segmentor using the "command line" (os.system) and prints the segmented messages to a different temp file.
The segmentor adds weird things (splits up long numbers; URLS incorrectly), so the python code fixes that.
Weibo by default turns 'emoji' into '[emoji_label_word]' which get's split up by the segmentor, so the python code joins them together again.
Example on one message:
Original message:
[神马]欧洲站夏季女装雪纺短袖长裤女士运动时尚休闲套装女夏装2014新款 http://t.cn/RvCypCj
Will turn into:
["[\u795e\u9a6c]", "\u6b27\u6d32", "\u7ad9", "\u590f\u5b63", "\u5973\u88c5", "\u96ea\u7eba",
"\u77ed\u8896", "\u957f\u88e4", "\u5973\u58eb", "\u8fd0\u52a8", "\u65f6\u5c1a", "\u4f11\u95f2",
"\u5957\u88c5", "\u5973", "\u590f\u88c5", "2014", "\u65b0\u6b3e", "http://t.cn/RvCypCj"]
Other Switches
Required Switches:
Optional Switches:
Example Commands
# creates the table msgs_seg via the Penn Chinese Treebank
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_segmented