Package sentspace

Sentspace 0.0.2 (C) 2020-2022 EvLab, MIT BCS. All rights reserved.

Homepage: https://sentspace.github.io/sentspace

For questions, email:

{gretatu,asathe} @ mit.edu

sentspace

About

sentspace is an open-source tool for characterizing text using diverse features related to how humans process and understand language. sentspace characterizes textual input using cognitively motivated lexical, syntactic, and semantic features computed at the token- and sentence level. Features are derived from psycholinguistic experiments, large-scale corpora, and theoretically motivated models of language processing. The sentspace features fall into two core modules: Lexical and Contextual. The Lexical module operates on individual lexical items (words) within a sentence and computes a summary representation by combining information across the words in the sentence. This module includes features such as frequency, concreteness, age of acquisition, lexical decision latency, contextual diversity, etc. The Contextual module operates on sentences as whole and includes syntactic features, such as the depth of center embedding. Note that using the contextual module requires additional set up steps (see in the setup section below).

New modules can be easily added to SentSpace to provide additional ways to characterize text. In this manner, sentspace provides a quantitative and interpretable representation of any sentence.

GitHub repository: http://github.com/sentspace/sentspace

Screencast video demo: https://youtu.be/a66_nvcCakw

CLI usage demo:

drawing

Documentation

Documentation is available online (click on the title above).

Usage

1. CLI

Example: get lexical and embedding features for stimuli from a csv containing columns for 'sentence' and 'index'.

$ python3 -m sentspace -h
usage: 


positional arguments:
  input_file            path to input file or a single sentence. If supplying a file, it must be .csv .txt or .xlsx, e.g., example/example.csv

optional arguments:
  -h, --help            show this help message and exit
  -sw STOP_WORDS, --stop_words STOP_WORDS
                        path to delimited file of words to filter out from analysis, e.g., example/stopwords.txt
  -b BENCHMARK, --benchmark BENCHMARK
                        path to csv file of benchmark corpora For example benchmarks/lexical/UD_corpora_lex_features_sents_all.csv
  -p PARALLELIZE, --parallelize PARALLELIZE
                        use multiple threads to compute features? disable using `-p False` in case issues arise.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        path to output directory where results may be stored
  -of {pkl,tsv}, --output_format {pkl,tsv}
  -lex LEXICAL, --lexical LEXICAL
                        compute lexical features? [False]
  -con CONTEXTUAL, --contextual CONTEXTUAL
                        compute syntactic features? [False]
  --emb_data_dir EMB_DATA_DIR
                        path to output directory where results may be stored

2. As a library

Example: get embedding features in a script

import sentspace

s = sentspace.Sentence('The person purchased two mugs at the price of one.')
emb_features = sentspace.embedding.get_features(s)

Example: parallelize getting features for multiple sentences using multithreading

import sentspace

sentences = [
    'Hello, how may I help you today?',
    'The person purchased three mugs at the price of five!',
    'This is an example sentence we want features of.'
             ]

# construct sentspace.Sentence objects from strings
sentences = [*map(sentspace.Sentence, sentences)]
# make use of parallel processing to get lexical features for the sentences
lex_features = sentspace.utils.parallelize(sentspace.lexical.get_features, sentences,
                                           wrap_tqdm=True, desc='Computing lexical features')

Installing

1. Install using uv

  1. Install uv: refer to the link above.
  2. Install pyicu: Installation instruction varies by your OS, and this package lives outside of the python package system so you'll need to manually install it
  3. ubuntu: pre-built package is available via apt: sudo apt-get install python3-icu
  4. macOS: see here
  5. Install sentspace and its dependencies using uv and the uv.lock lockfile already present in this repo: uv sync --extra polyglot (recommended: install polyglot to support additional optional features such as accurate morphological segmentation/lemmatization)
  6. Activate the virtual env: . ./.venv/bin/activate

2. Container-based usage

CircleCI

Requirements: singularity or docker.

Singularity:

singularity shell docker://aloxatel/sentspace:latest

Alternatively, from the root of the repo, bash singularity-shell.sh). this step can take a while when you run it for the first time as it needs to download the image from docker hub and convert it to singularity image format (.sif). however, each subsequent run will execute rapidly.

Docker: use corresponding commands for Docker.

now you are inside the container and ready to run sentspace!

⚠ Archived: these instructions are for an older version of sentspace; only kept here in case they assist in troubleshooting. Please follow 1 or 2 above.

3. Manual install (use as last resort)* On Debian/Ubuntu-like systems, follow the steps below. On other systems (RHEL, etc.), substitute commands and package names with appropriate alternates. ```bash

optional (but recommended):

create a virtual environment using your favorite method (venv, conda, …)

before any of the following

install basic packages using apt (you likely already have these)

sudo apt update sudo apt install python3.8 python3.8-dev python3-pip sudo apt install python2.7 python2.7-dev sudo apt install build-essential git

install ICU

DEBIAN_FRONTEND="noninteractive" TZ="America/New_York" sudo apt install python3-icu

install ZS package separately (pypi install fails)

python3.8 -m pip install -U pip cython git clone https://github.com/njsmith/zs cd zs && git checkout v0.10.0 && pip install .

install rest of the requirements using pip

cd .. # make sure you're in the sentspace/ directory pip install -r ./requirements.txt polyglot download morph2.en ```

Submodules

SentSpace features fall into two core modules: Lexical and Contextual. In general, each submodule implements a major class of features. You can run each module on its own by specifying its flag and 0 or 1 with the module call:

python -m sentspace -lex {0,1} -con {0,1} <input_file_path>

For a full list of available features, refer to the Feature Descriptions page on the hosted SentSpace frontend.

sentspace.lexical

The Lexical module consists of features that pertain to individual lexical items, words, regardless of the context in which they appear. These features are returned on a word-by-word level and also aggregated at the sentence level to provide each sentence a corresponding value.

contextual

The Contextual module consists of features that quantify contextual and combinatorial inter-word relations that are not captured by individual lexical items. This module encompasses features that relate to the syntactic structure of the sentence (Contextual_syntax features) and features that apply to the sentence context but are not (exclusively) related to syntactic structure (Contextual_misc features).

⚠ Additional steps to set up the contextual module The core sentspace program doesn't include a syntax server. It therefore needs to query a backend where PCFG processing can happen. You'll need to separately run this backend simultaneously to sentspace so that sentspace can query it and obtain features. The module should be running in a terminal for the duration you're using sentspace, and then you can kill it using Ctrl+C. - Here's a link to the module: https://github.com/sentspace/sentspace-syntax-server - Jump to the "Setup" section in the readme to run it: https://github.com/sentspace/sentspace-syntax-server?tab=readme-ov-file#setup-how-to-get-it-up-and-running - There is a pre-built docker image so that this setup should only need 1 command (sudo docker run -it –net=host –expose 8000 -p 8000:8000 aloxatel/berkeleyparser:latest). There is also a corresponding singularity command for HPC cluster environs that works with the same docker image. - This will start loading and eventually [it can take 5-10 minutes, it is slow] expose the syntax server on port 8000 (this is just a virtual address so other processes on your computer know where to look!) - Now, sentspace can query your localhost port 8000 with sentences to be processed, and it will be returned syntax-based features for further post-processing and packaging into a nice output format similar to rest of sentspace. - To make sure it knows to talk to the correct port, you can either pass it into the CLI (–syntax_port 8000) or as an argument to the library function: https://github.com/sentspace/sentspace/blob/4b0f79c7f6dcab6285d3af42101b04b05f421b01/sentspace/__main__.py#L127 However, both, the library and the syntax server should default to localhost:8000 so unless you have a special circumstance, you won't need to worry about this.

Contributing

Any contributions you make are greatly appreciated, and no contribution is too small to contribute.

  1. Fork the project on Github (how to fork)
  2. Create your feature/patch branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request (PR) and we will take a look asap!

Whom to contact for help

  • gretatu % mit ^ edu
  • asathe % mit ^ edu

(C) 2020-2022 EvLab, MIT BCS

Expand source code
"""
### Sentspace 0.0.2 (C) 2020-2022 [EvLab](evlab.mit.edu), MIT BCS. All rights reserved.

Homepage: https://sentspace.github.io/sentspace

For questions, email:

`{gretatu,asathe} @ mit.edu`

.. include:: ../README.md
"""

from pathlib import Path

import sentspace.utils as utils
import sentspace.syntax as syntax
import sentspace.lexical as lexical
# import sentspace.embedding as embedding

from sentspace.Sentence import Sentence

import pandas as pd
from functools import reduce
from itertools import chain
from tqdm import tqdm


def run_sentence_features_pipeline(
    input_file: str,
    stop_words_file: str = None,
    benchmark_file: str = None,
    output_dir: str = None,
    output_format: str = None,
    batch_size: int = 2_000,
    process_lexical: bool = False,
    process_syntax: bool = False,
    process_embedding: bool = False,
    process_semantic: bool = False,
    parallelize: bool = True,
    # preserve_metadata: bool = True,
    syntax_server: str = "http://localhost/",
    syntax_port: int = 8000,
    limit: float = float("inf"),
    offset: int = 0,
    emb_data_dir: str = None,
) -> Path:
    """
    Runs the full sentence features pipeline on the given input according to
    requested submodules (currently supported: `lexical`, `syntax`, `embedding`,
    indicated by boolean flags).

    Returns an instance of `Path` pointing to the output directory resulting from this
    run of the full pipeline. The output directory contains Pickled or TSVed pandas
    DataFrames containing the requested features.


    Args:
        input_file (str): path to input text file containing sentences
                            one per line [required]
        stop_words_file (str): path to text file containing stopwords to filter
                                out, one per line [optional]
        benchmark_file (str): path to a file containing a benchmark corpus to
                                compare the current input against; e.g. UD [optional]

        {lexical,syntax,embedding,semantic,...} (bool): compute submodule features? [False]
    """

    # lock = multiprocessing.Manager().Lock()

    # create output folder
    utils.io.log("creating output folder")
    output_dir = utils.io.create_output_paths(
        input_file, output_dir=output_dir, stop_words_file=stop_words_file
    )
    # config_out = output_dir / "this_session_log.txt"
    # with config_out.open('a+') as f:
    #     print(args, file=f)

    utils.io.log("reading input sentences")
    sentences = utils.io.read_sentences(input_file, stop_words_file=stop_words_file)
    utils.io.log("---done--- reading input sentences")

    for part, sentence_batch in enumerate(
        tqdm(
            utils.io.get_batches(
                sentences, batch_size=batch_size, limit=limit, offset=offset
            ),
            desc="processing batches",
            total=len(sentences) // batch_size + 1,
        )
    ):
        sentence_features_filestem = f"sentence-features_part{part:0>4}"
        token_features_filestem = f"token-features_part{part:0>4}"

        ################################################################################
        #### LEXICAL FEATURES ##########################################################
        ################################################################################
        if process_lexical:
            utils.io.log("*** running lexical submodule pipeline")
            _ = lexical.utils.load_databases(features="all")

            lexical_features = utils.parallelize(
                lexical.get_features,
                sentence_batch,
                wrap_tqdm=True,
                desc="Lexical pipeline",
                max_workers=None if parallelize else 1,
            )

            lexical_out = output_dir / "lexical"
            lexical_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f"outputting lexical token dataframe to {lexical_out}")

            # lexical is a special case since it returns dicts per token (rather than per sentence)
            # so we want to flatten it so that pandas creates a sensible dataframe from it.
            token_df = pd.DataFrame(chain.from_iterable(lexical_features))

            if output_format == "tsv":
                token_df.to_csv(
                    lexical_out / f"{token_features_filestem}.tsv", sep="\t", index=True
                )
                token_df.groupby("sentence").mean().to_csv(
                    lexical_out / f"{sentence_features_filestem}.tsv",
                    sep="\t",
                    index=True,
                )
            elif output_format == "pkl":
                token_df.to_pickle(
                    lexical_out / f"{token_features_filestem}.pkl.gz", protocol=5
                )
                token_df.groupby("sentence").mean(numeric_only=True).to_pickle(
                    lexical_out / f"{sentence_features_filestem}.pkl.gz", protocol=5
                )
            else:
                raise ValueError(f"output format {output_format} not known")

            utils.io.log("--- finished lexical pipeline")

        ################################################################################
        #### SYNTAX FEATURES ###########################################################
        ################################################################################
        if process_syntax:
            utils.io.log("*** running syntax submodule pipeline")

            syntax_features = [
                syntax.get_features(
                    sentence,
                    dlt=True,
                    left_corner=True,
                    syntax_server=syntax_server,
                    syntax_port=syntax_port,
                )
                for i, sentence in enumerate(
                    tqdm(sentence_batch, desc="Syntax pipeline")
                )
            ]

            # put all features in the sentence df except the token-level ones
            token_syntax_features = {"dlt", "leftcorner"}
            sentence_df = pd.DataFrame(
                [
                    {
                        k: v
                        for k, v in feature_dict.items()
                        if k not in token_syntax_features
                    }
                    for feature_dict in syntax_features
                ],
                index=[s.uid for s in sentence_batch],
            )

            # output gives us dataframes corresponding to each token-level feature. we need to combine these
            # into a single dataframe
            # we use functools.reduce to apply the pd.concat function to all the dataframes and join dataframes
            # that contain different features for the same tokens
            token_dfs = [
                reduce(
                    lambda x, y: pd.concat([x, y], axis=1, sort=False),
                    (v for k, v in feature_dict.items() if k in token_syntax_features),
                )
                for feature_dict in syntax_features
            ]

            for i, df in enumerate(token_dfs):
                token_dfs[i]["index"] = df.index
            #     token_dfs[i].reset_index(inplace=True)

            dicts = [
                {k: v[list(v.keys())[0]] for k, v in df.to_dict().items()}
                for df in token_dfs
            ]
            token_df = pd.DataFrame(dicts)
            token_df.index = token_df["index"]
            # by this point we have merged dataframes with tokens along a column (rather than just a sentence)
            # now we need to stack them on top of each other to have all tokens across all sentences in a single dataframe
            # token_df = reduce(lambda x, y: pd.concat([x.reset_index(drop=True), y.reset_index(drop=True)]), token_dfs)
            # token_df = token_df.loc[:, ~token_df.columns.duplicated()]

            syntax_out = output_dir / "syntax"
            syntax_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f"outputting syntax dataframes to {syntax_out}")

            if output_format == "tsv":
                sentence_df.to_csv(
                    syntax_out / f"{sentence_features_filestem}.tsv",
                    sep="\t",
                    index=True,
                )
                token_df.to_csv(
                    syntax_out / f"{token_features_filestem}.tsv", sep="\t", index=True
                )
            elif output_format == "pkl":
                sentence_df.to_pickle(
                    syntax_out / f"{sentence_features_filestem}.pkl.gz", protocol=5
                )
                token_df.to_pickle(
                    syntax_out / f"{token_features_filestem}.pkl.gz", protocol=5
                )
            else:
                raise ValueError(f"unknown output format {output_format}")

            utils.io.log("--- finished syntax pipeline")

        # Calculate PMI
        # utils.GrabNGrams(sent_rows,pmi_paths)
        # utils.pPMI(sent_rows, pmi_paths)

        # Plot input data to benchmark data
        # utils.plot_usr_input_against_benchmark_dist_plots(df_benchmark, sent_embed)

    ################################################################################
    #### \end{run_sentence_features_pipeline} ######################################
    ################################################################################
    return output_dir

Sub-modules

sentspace.Sentence
sentspace.lexical
sentspace.package_lexical
sentspace.syntax
sentspace.utils
sentspace.vis

Functions

def run_sentence_features_pipeline(input_file: str, stop_words_file: str = None, benchmark_file: str = None, output_dir: str = None, output_format: str = None, batch_size: int = 2000, process_lexical: bool = False, process_syntax: bool = False, process_embedding: bool = False, process_semantic: bool = False, parallelize: bool = True, syntax_server: str = 'http://localhost/', syntax_port: int = 8000, limit: float = inf, offset: int = 0, emb_data_dir: str = None) ‑> pathlib.Path

Runs the full sentence features pipeline on the given input according to requested submodules (currently supported: sentspace.lexical, sentspace.syntax, embedding, indicated by boolean flags).

Returns an instance of Path pointing to the output directory resulting from this run of the full pipeline. The output directory contains Pickled or TSVed pandas DataFrames containing the requested features.

Args

input_file : str
path to input text file containing sentences one per line [required]
stop_words_file : str
path to text file containing stopwords to filter out, one per line [optional]
benchmark_file : str
path to a file containing a benchmark corpus to compare the current input against; e.g. UD [optional]

{lexical,syntax,embedding,semantic,…} (bool): compute submodule features? [False]

Expand source code
def run_sentence_features_pipeline(
    input_file: str,
    stop_words_file: str = None,
    benchmark_file: str = None,
    output_dir: str = None,
    output_format: str = None,
    batch_size: int = 2_000,
    process_lexical: bool = False,
    process_syntax: bool = False,
    process_embedding: bool = False,
    process_semantic: bool = False,
    parallelize: bool = True,
    # preserve_metadata: bool = True,
    syntax_server: str = "http://localhost/",
    syntax_port: int = 8000,
    limit: float = float("inf"),
    offset: int = 0,
    emb_data_dir: str = None,
) -> Path:
    """
    Runs the full sentence features pipeline on the given input according to
    requested submodules (currently supported: `lexical`, `syntax`, `embedding`,
    indicated by boolean flags).

    Returns an instance of `Path` pointing to the output directory resulting from this
    run of the full pipeline. The output directory contains Pickled or TSVed pandas
    DataFrames containing the requested features.


    Args:
        input_file (str): path to input text file containing sentences
                            one per line [required]
        stop_words_file (str): path to text file containing stopwords to filter
                                out, one per line [optional]
        benchmark_file (str): path to a file containing a benchmark corpus to
                                compare the current input against; e.g. UD [optional]

        {lexical,syntax,embedding,semantic,...} (bool): compute submodule features? [False]
    """

    # lock = multiprocessing.Manager().Lock()

    # create output folder
    utils.io.log("creating output folder")
    output_dir = utils.io.create_output_paths(
        input_file, output_dir=output_dir, stop_words_file=stop_words_file
    )
    # config_out = output_dir / "this_session_log.txt"
    # with config_out.open('a+') as f:
    #     print(args, file=f)

    utils.io.log("reading input sentences")
    sentences = utils.io.read_sentences(input_file, stop_words_file=stop_words_file)
    utils.io.log("---done--- reading input sentences")

    for part, sentence_batch in enumerate(
        tqdm(
            utils.io.get_batches(
                sentences, batch_size=batch_size, limit=limit, offset=offset
            ),
            desc="processing batches",
            total=len(sentences) // batch_size + 1,
        )
    ):
        sentence_features_filestem = f"sentence-features_part{part:0>4}"
        token_features_filestem = f"token-features_part{part:0>4}"

        ################################################################################
        #### LEXICAL FEATURES ##########################################################
        ################################################################################
        if process_lexical:
            utils.io.log("*** running lexical submodule pipeline")
            _ = lexical.utils.load_databases(features="all")

            lexical_features = utils.parallelize(
                lexical.get_features,
                sentence_batch,
                wrap_tqdm=True,
                desc="Lexical pipeline",
                max_workers=None if parallelize else 1,
            )

            lexical_out = output_dir / "lexical"
            lexical_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f"outputting lexical token dataframe to {lexical_out}")

            # lexical is a special case since it returns dicts per token (rather than per sentence)
            # so we want to flatten it so that pandas creates a sensible dataframe from it.
            token_df = pd.DataFrame(chain.from_iterable(lexical_features))

            if output_format == "tsv":
                token_df.to_csv(
                    lexical_out / f"{token_features_filestem}.tsv", sep="\t", index=True
                )
                token_df.groupby("sentence").mean().to_csv(
                    lexical_out / f"{sentence_features_filestem}.tsv",
                    sep="\t",
                    index=True,
                )
            elif output_format == "pkl":
                token_df.to_pickle(
                    lexical_out / f"{token_features_filestem}.pkl.gz", protocol=5
                )
                token_df.groupby("sentence").mean(numeric_only=True).to_pickle(
                    lexical_out / f"{sentence_features_filestem}.pkl.gz", protocol=5
                )
            else:
                raise ValueError(f"output format {output_format} not known")

            utils.io.log("--- finished lexical pipeline")

        ################################################################################
        #### SYNTAX FEATURES ###########################################################
        ################################################################################
        if process_syntax:
            utils.io.log("*** running syntax submodule pipeline")

            syntax_features = [
                syntax.get_features(
                    sentence,
                    dlt=True,
                    left_corner=True,
                    syntax_server=syntax_server,
                    syntax_port=syntax_port,
                )
                for i, sentence in enumerate(
                    tqdm(sentence_batch, desc="Syntax pipeline")
                )
            ]

            # put all features in the sentence df except the token-level ones
            token_syntax_features = {"dlt", "leftcorner"}
            sentence_df = pd.DataFrame(
                [
                    {
                        k: v
                        for k, v in feature_dict.items()
                        if k not in token_syntax_features
                    }
                    for feature_dict in syntax_features
                ],
                index=[s.uid for s in sentence_batch],
            )

            # output gives us dataframes corresponding to each token-level feature. we need to combine these
            # into a single dataframe
            # we use functools.reduce to apply the pd.concat function to all the dataframes and join dataframes
            # that contain different features for the same tokens
            token_dfs = [
                reduce(
                    lambda x, y: pd.concat([x, y], axis=1, sort=False),
                    (v for k, v in feature_dict.items() if k in token_syntax_features),
                )
                for feature_dict in syntax_features
            ]

            for i, df in enumerate(token_dfs):
                token_dfs[i]["index"] = df.index
            #     token_dfs[i].reset_index(inplace=True)

            dicts = [
                {k: v[list(v.keys())[0]] for k, v in df.to_dict().items()}
                for df in token_dfs
            ]
            token_df = pd.DataFrame(dicts)
            token_df.index = token_df["index"]
            # by this point we have merged dataframes with tokens along a column (rather than just a sentence)
            # now we need to stack them on top of each other to have all tokens across all sentences in a single dataframe
            # token_df = reduce(lambda x, y: pd.concat([x.reset_index(drop=True), y.reset_index(drop=True)]), token_dfs)
            # token_df = token_df.loc[:, ~token_df.columns.duplicated()]

            syntax_out = output_dir / "syntax"
            syntax_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f"outputting syntax dataframes to {syntax_out}")

            if output_format == "tsv":
                sentence_df.to_csv(
                    syntax_out / f"{sentence_features_filestem}.tsv",
                    sep="\t",
                    index=True,
                )
                token_df.to_csv(
                    syntax_out / f"{token_features_filestem}.tsv", sep="\t", index=True
                )
            elif output_format == "pkl":
                sentence_df.to_pickle(
                    syntax_out / f"{sentence_features_filestem}.pkl.gz", protocol=5
                )
                token_df.to_pickle(
                    syntax_out / f"{token_features_filestem}.pkl.gz", protocol=5
                )
            else:
                raise ValueError(f"unknown output format {output_format}")

            utils.io.log("--- finished syntax pipeline")

        # Calculate PMI
        # utils.GrabNGrams(sent_rows,pmi_paths)
        # utils.pPMI(sent_rows, pmi_paths)

        # Plot input data to benchmark data
        # utils.plot_usr_input_against_benchmark_dist_plots(df_benchmark, sent_embed)

    ################################################################################
    #### \end{run_sentence_features_pipeline} ######################################
    ################################################################################
    return output_dir