Mallet LDA Interface

NOTE: DLATK can now produce LDA topics through a simpler interface: DLATK LDA Interface. Although the steps described below are still valid, you are advised to use the new automated interface.

For a conceptual overview of LDA, see this intro

Step 0: Get Access To Mallet

This tutorial uses Mallet.
Install to your home directory using the following website: http://mallet.cs.umass.edu/download.php
NOTE - if you plan to be running on large datasets (~15M FB messages or similar) you may have to adjust parameters in your mallet script file. See more info in the "Run LDA with Mallet" step.

Step 1: (If necessary) Create sample tweet table

If necessary create a message table to run LDA on

use dla_tutorial;
create table msgs_lda like msgs;
insert into msgs_lda select * from msgs where rand()<(2/6);

We will create an output folder in your home directory

mkdir ~/lda_tutorial

Step 2, option A: Create tokenized table in mySQL and Export it

Using the infrastructure to tokenize the messages and print the tokenized messages in a file

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_tokenized   #creates table *msgs_lda_tok* in *dla_tutorial*

dlatkInterface.py -d dla_tutorial -t msgs_lda --print_tokenized_lines ~/lda_tutorial/msgs_lda.txt

This will create the file ~/lda_tutorial/msgs_lda.txt:

NOTE - if you have newlines in your message this step may cause issues down the line, it is recommended to replace all newlines with a representative character before proceeding

OPTIONAL: If you want to restrict your LDA words to a subset of words in the corpus (e.g., those said by at least N people, or said in no more than Y percent of messages), you can use the --feat_whitelist flag, eg:

dlatkInterface.py -d dla_tutorial -t msgs_lda --print_tokenized_lines ser1_filt.txt --feat_whitelist 'feat$1gram$msgs_lda$user_id$16to16$0_005'

Step 2, option B: Generate a feature table and convert it to a mallet-appropriate formatted text file

You can use this step in place of step 2, option A.

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_ngrams -n 1

dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id -f 'feat$1gram$msgs_lda$message_id$16to16' --print_joined_feature_lines msgs_lda.txt

Step 3: Format for Mallet

Prepares the messages and tokenizes them again for Mallet, removing stopwords and non English characters. For help use ./bin/mallet import-file --help

./bin/mallet import-file --input ~/lda_tutorial/msgs_lda.txt \
--token-regex "(#|@)?(?!(\W)\2+)([a-zA-Z\_\-\'0-9\(-\@]{2,})" \
--output ~/lda_tutorial/msgs_lda.mallet \
--remove-stopwords --keep-sequence [--extra-stopwords EXTRA_STOPWORDS_FILE]

This will create the file ~/lda_tutorial/msgs_lda.mallet.

Step 4: Run LDA with Mallet

This is the actual LDA step, which might take a while (4 days and a half on 20 mil tweets) for help do ./bin/mallet train-topics --help

./bin/mallet train-topics --input  ~/lda_tutorial/msgs_lda.mallet \
--alpha 5 --num-topics 2000 --optimize-burn-in 0 --output-model ~/lda_tutorial/msgs_lda.model \
--output-state ~/lda_tutorial/msgs_lda_state.gz \
--output-topic-keys ~/lda_tutorial/msgs_lda.keys

Here alpha is a prior on he number of topics per document. The other hyper-parameter beta (which we usually do not change) is a prior on the number of words per topic.

This creates the following files:

msgs_lda.model
msgs_lda_state.gz
msgs_lda.keys

The file msgs_lda.keys contains the topics and at this point is it good to inspect them to see if you should change any of the above parameters (number of topics, alpha, beta, etc.)

> head ~/lda_tutorial/msgs_lda.keys
     0.0025  italian pig americans wolf american straw pigs 04 stick film defamation censored pinocchio brick 03 sopranos pig's offensive june wop
     0.0025  yoga levi yea pages yay morning boy mix lorraine evening bit law she's joe honey exam property study gay spring
     0.0025  projection sng recap mtt dried 18 sessions planetpoker 5.50 pokerstars faulty limit summary music-lovers end-users 1540-1608 raritan tester dch brigade
     0.0025  :) urllink misses marion lovely managed melbourne sean pub sort round rock bus :( tent ;) perth we've elephant flat
     0.0025  jill alcohol aaaannnndd soorry georgie celica precariously timid mirrors cam defined cat's praying expects hitler valley asses
     0.0025  apc uhm downsides surpression tly pary undies spybot band-aids pow sprinkles trashed bloated celebrated paramaters thas elses country's immortal association

Note: When dealing with giant sets of data, for example creating Facebook topics, one might encounter the error Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. You must edit the following line in ~/Mallet/bin/mallet: MEMORY=1g. You can then change the 1g value upwards – to 2g, 4g, or even higher depending on your system’s RAM, which you can find out by looking up the machine’s system information.

Step 5: Add message ID’s to state file

Adds the message ID’s to the topic distributions and stores the result in lda_topics

dlatkInterface.py --add_message_id ~/lda_tutorial/msgs_lda.txt ~/lda_tutorial/msgs_lda_state.gz --output_name ~/lda_tutorial/lda_topics

This creates the file ~/lda_tutorial/lda_topics:

> head ~/lda_tutorial/lda_topics
#doc source pos typeindex type topic
#alpha : 0.0025 0.0025 0.0025 0.0025 ....
#beta : 0.01
0 2 0 0 miss 1840
0 2 1 1 su 661
0 2 2 2 living 623
0 2 3 3 skills 466
0 2 4 4 teacher 466
0 2 5 5 form 1319

Step 6: Import state file into database

Imports the topic-message probability distributions in a raw format (type of JSON) not readable by DLA

dlatkInterface.py -d dla_tutorial -t msgs_lda_tok --add_lda_messages  ~/lda_tutorial/lda_topics

This creates the table msgs_lda_tok_lda$lda_topics in the database dla_tutorial:

select message from msgs_lda_tok_lda$lda_topics limit 1;
| [{"topic_id": "296", "doc": "10", "term": "urllink", "index": "0", "term_id": "191", "message_id": "40"}, {"topic_id": "947", "doc": "10", "term": "busy", "index": "1", "term_id": "249", "message_id": "40"}, {"topic_id": "1804", "doc": "10", "term": "roadway", "index": "2", "term_id": "250", "message_id": "40"}, {"topic_id": "296", "doc": "10", "term": "urllink", "index": "3", "term_id": "191", "message_id": "40"}]

NOTE - "Duplicate entry 'xxxx' for key 'PRIMARY'" errors may be indicative of an issues with newlines. See step 2 for a solution.

Step 7: Create topic-word distributions

Creates the readable distributions on the messages

dlatkInterface.py -d dla_tutorial -t msgs_lda -m 'msgs_lda_tok_lda$lda_topics' --create_dists

This creates the following files:

msgs_lda_tok_lda.lda_topics.freq.threshed50.loglik.csv
msgs_lda_tok_lda.lda_topics.lik.csv
msgs_lda_tok_lda.lda_topics.loglik.csv
msgs_lda_tok_lda.lda_topics.topicGivenWord.csv
msgs_lda_tok_lda.lda_topics.wordGivenTopic.csv

Step 8: Add topic-lexicon to lexicon database

Generates the lexicons based on different probability distribution types

topic given word

dlatkInterface.py --lex_interface --topic_csv  \
--topicfile=/home/user/lda_tutorial/msgs_lda_tok_lda.lda_topics.topicGivenWord.csv  \
-c msgs_lda_cp

This will create the table msgs_lda_cp in the database dlatk_lexica. You can change the database with the flag --lexicondb:

mysql> select * from msgs_lda_cp limit 5;
+----+----------------+----------+--------------------+
| id | term           | category | weight             |
+----+----------------+----------+--------------------+
|  1 | erm            | 274      |             0.0625 |
|  2 | productively   | 274      |  0.333333333333333 |
|  3 | jared          | 274      |              0.125 |
|  4 | book           | 274      |                  1 |
|  5 | sketch         | 274      | 0.0909090909090909 |
+----+----------------+----------+--------------------+

frequency, thresholded to loglik >= 50

dlatkInterface.py --lex_interface --topic_csv \
--topicfile=~/lda_tutorial/msgs_lda_tok_lda.lda_topics.freq.threshed50.loglik.csv \
-c msgs_lda_freq_t50ll

This will create the table msgs_lda_freq_t50ll in the database dlatk_lexica. You can change the database with the flag --lexicondb:

mysql> select * from msgs_lda_freq_t50ll limit 5;
+----+------------------+----------+--------+
| id | term             | category | weight |
+----+------------------+----------+--------+
|  1 | wildest          | 766      |      1 |
|  2 | kazadoom         | 766      |      1 |
|  3 | sentirnos        | 766      |      1 |
|  4 | charlie          | 766      |      1 |
|  5 | commondreams.org | 766      |      1 |
+----+------------------+----------+--------+

Step 9: Extract features from lexicon

You’re now ready to start using the topic distribution lexicon

dlatkInterface.py -d DATABASE -t MESSAGE_TABLE -c GROUP_ID --add_lex_table -l msgs_lda_cp --weighted_lexicon

(always extract features using the _cp lexicon. The “freq_t50ll” lexicon is only used when generating topic_tagclouds: --topic_tagcloud --topic_lexicon)