--add_tokenized
Switch
--add_tokenized
Description
Creates a tokenized version of the message table.
Argument and Default Value
None
Details
This will create a table called TABLE_tok (where TABLE is specified by -t) in the database specified by -d. The message column in this new table is a list of tokens. It uses DLATK's built-in tokenizer Happier Fun Tokenizer, which is an extension of Happy Fun Tokenizer.
If your message is:
"Mom said she's gonna think about getting a truck."
the same row in the tokenized table will look like this:
["mom", "said", "she's", "gonna", "think", "about", "getting", "a", "truck", "."]
To use the tokenized table in standalone scripts, simply do JSON.load(message).
Other Switches
Required Switches:
Example Commands
# Creates the tables: msgs_tok
dlatkInterface.py -d dla_tutorial -t msgs -c message_id --add_tokenized
mysql> select message from msgs_tok limit 1;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| message |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["can", "you", "believe", "it", "?", "?", "my", "mom", "wouln't", "let", "me", "go", "out", "on", "my", "b'day", "...", "i", "was", "really", "really", "mad", "at", "her", ".", "still", "am", ".", "but", "i", "got", "more", "presents", "from", "my", "friends", "this", "year", ".", "so", "thats", "great", "."] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+