--add_segmented

Switch

--add_segmented

Description

Creates a word-segmented version of the message table (for Chinese only!).

Argument and Default Value

None

Details

This will create a table called TABLE_seg (where TABLE is specified by -t) in the database specified by -d. The message column in this new table is a list of segmented words. Note that word segmentation only means something for Chinese messages.

Choose the segmentation model by using --segmentation_model.

After having done this, use --add_ngrams_from_tokenized to extract ngrams.

How it works:

The infrastructure writes the (message_id, message) pairs to a tempfile, runs the segmentor using the "command line" (os.system) and prints the segmented messages to a different temp file.

The segmentor adds weird things (splits up long numbers; URLS incorrectly), so the python code fixes that.

Weibo by default turns 'emoji' into '[emoji_label_word]' which get's split up by the segmentor, so the python code joins them together again.

Example on one message:

Original message:

[神马]欧洲站夏季女装雪纺短袖长裤女士运动时尚休闲套装女夏装2014新款  http://t.cn/RvCypCj

Will turn into:

["[\u795e\u9a6c]", "\u6b27\u6d32", "\u7ad9", "\u590f\u5b63", "\u5973\u88c5", "\u96ea\u7eba",
"\u77ed\u8896", "\u957f\u88e4", "\u5973\u58eb", "\u8fd0\u52a8", "\u65f6\u5c1a", "\u4f11\u95f2",
"\u5957\u88c5", "\u5973", "\u590f\u88c5", "2014", "\u65b0\u6b3e", "http://t.cn/RvCypCj"]

Other Switches

Required Switches:

-d, -g, -t

Optional Switches:

--segmentation_model

Example Commands

# creates the table msgs_seg via the Penn Chinese Treebank
./dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_segmented