Logo
  • Installation
    • Recommended Install
    • Full Install
      • Setup
        • Linux
        • OSX (with brew)
        • Install (pip)
        • Install (Anaconda)
        • Install (GitHub)
      • Install Other Dependencies
        • Load NLTK corpus
        • Install Stanford Parser
        • Install Tweet NLP v0.3 (ark-tweet-nlp-0.3)
        • Python Modules (optional)
        • Install the IBM Wordcloud jar file (optional)
        • Mallet (optional)
    • Full List of Dependencies
      • Python
      • Other
      • Python (optional)
      • Other (optional)
    • Python version support
    • Getting Started
      • Command Line Interface
      • MySQL Configuration
      • Sample Datasets
      • Next Steps
    • Install Issues
  • Github Repo
  • Getting started in Colab
  • Tutorials
    • Getting started
    • Text Cleaning and Transformations
    • Feature Extraction
    • Viewing your data and output
    • Prediction
    • Clustering
    • LDA with Mallet
    • Data Engines
    • Other Topics
    • Video Tutorials
  • Packaged Datasets
    • Language Data
      • Blog Authorship Corpus
    • Lexica
      • Age and Gender Lexica
      • PERMA Lexicon
      • Spanish PERMA Lexicon
      • Other Lexica
    • LDA Topics
      • 2000 Facebook Topics
  • dlatkInterface Flags by type
    • Setup
    • Preprocessing
    • Feature Extraction
    • Feature Refinement
    • Language Insights
    • Clustering
    • Prediction
      • Regression
      • Classification
    • Visualization
  • Papers Utilizing DLATK
    • DLATK Paper
    • Peer Reviewed Publications
      • 2020
      • 2019
      • 2018
      • 2017
      • 2016
      • 2015
      • 2014
      • 2013
DLATK
  • --print_tokenized_lines
  • View page source

--print_tokenized_lines

Switch

--print_tokenized_lines

Description

Prints tokenized version of messages to lines.

Argument and Default Value

You must supply an output file name.

Details

Looks for the table TABLENAME_tok, where TABLENAME is specified by -t. Each line of the output file contains the message id, lanugage, and tokens. Example:

# Sample message from tokenized input table: # ["is", "worth", "it", "just", "follow", "your", "heart", "its", "never", "wrong", ":", "-", "rrb", "-"] # Output line: # 128675651556356096 en is worth it just follow your heart its never wrong : - rrb -

Other Switches

Required Switches: -d, -t Optional Switches: --feat_whitelist Example Commands ================ .. code:doc:fwflag_block:: python

# General command python fwInterface.py -d DATABASE -t TABLE --print_tokenized_lines OUTPUTFILE_NAME

# Example command # searches for the table twt_20mil_tok # outputs the file twt_20mil.txt python fwInterface.py -d twitterGH -t twt_20mil --print_tokenized_lines twt_20mil.txt


© Copyright 2024, H. Andrew Schwartz and Salvatore Giorgi.

Built with Sphinx using a theme provided by Read the Docs.