--language_filter

Switch

--language_filter lang1 [lang2 lang3 ...]

Description

Creates a language filtered message table.

Argument and Default Value

lang1 is a two letter string identifier for the language we want to filter for. There is no default value.

Details

Uses the langid Python package. By default this will lowercase your messages before running through langid. To turn this off use --no_lower?.

langid is trained on the following languages: af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu.

--clean_messages will remove hashtags, URLs and @mentions before processing the message, which will improve classification.

Other Switches

Required Switches:

Optional Switches:

Example Commands

Remove non English text while cleaning URLs and @mentions:

# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en --clean_messages