--lex_interface
Switch
--lex_interface
Description
Override the argparser in dlatkInterface and send all arguments to lexInterface. lexInterface is often used to upload csv's to MySQL during the LDA process. See the DLATK LDA Interface tutorial for more details.
Details
The full list of available flags in lexInterface:
python lexInterface.py -h
usage: lexInterface.py [-h] [-f FILENAME] [-g GFILE] [--sparsefile SPARSEFILE]
[--weightedsparsefile WEIGHTEDSPARSEFILE]
[--dicfile DICFILE] [--topicfile TOPICFILE]
[--topic_csv] [--filter] [-n NAME] [-c CREATE] [-p]
[--print_weighted] [--pprint] [-w WHERE] [-u UNION]
[-i INTERSECT] [--super_topic SUPERTOPIC] [-r]
[--depol] [--ungroup] [--compare COMPARE]
[--annotate_senses SENSE_ANNOTATED_LEX]
[--topic_threshold TOPICTHRESHOLD] [-a] [-l]
[--corpus_examples] [--corpus_samples] [-e] [-d DB]
[-t TABLE] [--lexicondb DB] [--corpus_term_field FIELD]
[--corpus_message_field FIELD]
[--corpus_messageid_field FIELD] [--min_word_freq NUM]
[--lexicon_category CATEGORY] [--num_rand_messages NUM]
On Features Class.
optional arguments:
-h, --help show this help message and exit
:
-f FILENAME, --file FILENAME
Lexicon Filename (default: None)
-g GFILE, --gfile GFILE
Lexicon Filename in google format (default: None)
--sparsefile SPARSEFILE
Lexicon Filename in sparse format (default: None)
--weightedsparsefile WEIGHTEDSPARSEFILE
Lexicon Filename in weighted sparse format (default:
None)
--dicfile DICFILE Lexicon Filename in dic (LIWC) format (default: None)
--topicfile TOPICFILE
Lexicon Filename in topic format (default: None)
--topic_csv, --weighted_file
tells interface to use the topic csv format to make a
weighted lexicon (default: False)
--filter Allows lexicon filtering if True (default: False)
-n NAME, --name NAME Existing Lexicon Table Name (will load) (default:
None)
-c CREATE, --create CREATE
Create a new lexicon table (must supply new lexicon
name, and either -f, -g or -n) (default: None)
-p, --print print lexicon to stdout (default csv format) (default:
False)
--print_weighted print lexicon to stdout (weighted csv format)
(default: False)
--pprint print lexicon to stdout as pprint output (default:
False)
-w WHERE, --where WHERE
where phrase to add to sql query (default: None)
-u UNION, --union UNION
Unions two tables and uses the result as myLexicon
(default: None)
-i INTERSECT, --intersect INTERSECT
Intersects two tables and uses the result as myLexicon
(default: None)
--super_topic SUPERTOPIC
Maps the current lexicon with a super topic mapping
lexicon to make a super_topic (default: None)
-r, --randomize Randomizes the categories of terms (default: False)
--depol Depolarize the categories (removes +/-) (default:
False)
--ungroup places each word in its own category (default: False)
--compare COMPARE Unions two tables and uses the result as myLexicon
(default: None)
--annotate_senses SENSE_ANNOTATED_LEX
Asks the user to annotate senses of words and creates
a new lexicon with senses (new lexicon name is the
parameter) (default: None)
--topic_threshold TOPICTHRESHOLD
sets the threshold to use for a csv topicfile
(default: None)
-a, --add_terms Adds terms from the loaded lexicon to a given corpus
(options below) (default: False)
-l, --corpus_lexicon Load a lexicon based on finding words in a given
corpus (BETA) (options below) (default: False)
--corpus_examples Find example instances of words in the given corpus
(using rlike; equal number for all words) (default:
False)
--corpus_samples Find sample of matches for lexicon. (default: False)
-e, --expand_lexicon Expands the lexicon to more terms. (default: False)
Terms OR Corpus Lexicon Options:
-d DB, --corpus_db DB
Corpus database to use [default: dla_tutorial]
-t TABLE, --corpus_table TABLE
Corpus table to use [default: msgs]
--lexicondb DB The database which stores all lexicons. (default:
dlatk_lexica)
--corpus_term_field FIELD
field of the corpus table that contains terms (lexicon
table always uses 'term') [default: term]
--corpus_message_field FIELD
field of the corpus table that contains the actual
message [default: message]
--corpus_messageid_field FIELD
field of the table that contains message ids (set to
'' to not use group by [default: message_id]
--min_word_freq NUM minimum number of instances to include in lexicon (-l
option) [default: 1000]
--lexicon_category CATEGORY
category in lexicon to get random samples from
(default: None)
--num_rand_messages NUM
number of random messages to select when getting
samples from lexicon category (default: 100)
Example Commands
Upload the topic given word probability distributions generated during LDA. This creates a table in dlatk_lexica called msgs_lda_cp.
dlatkInterface.py --lex_interface --topic_csv \
--topicfile=/home/user/lda_tutorial/msgs_lda_tok_lda.lda_topics.topicGivenWord.csv \
-c msgs_lda_cp