--- title: "stringfish" output: html_vignette: keep_md: no rmarkdown::github_document: default vignette: > %\VignetteIndexEntry{stringfish} \usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} --- ```{r, setup, echo=FALSE} IS_GITHUB <- Sys.getenv("IS_GITHUB") != "" ``` ```{r results='asis', echo=FALSE, eval=IS_GITHUB} cat(' [![R-CMD-check](https://github.com/traversc/stringfish/workflows/R-CMD-check/badge.svg)](https://github.com/traversc/stringfish/actions) [![CRAN-Status-Badge](https://www.r-pkg.org/badges/version/stringfish)](https://cran.r-project.org/package=stringfish) [![CRAN-Downloads-Badge](https://cranlogs.r-pkg.org/badges/stringfish)](https://cran.r-project.org/package=stringfish) [![CRAN-Downloads-Total-Badge](https://cranlogs.r-pkg.org/badges/grand-total/stringfish)](https://cran.r-project.org/package=stringfish) ') ``` `stringfish` is a framework for string and sequence operations using the ALTREP system (introduced in R 3.5) as a way to represent R objects using custom memory layout. This package has two primary goals: - Provide R users a way to speed up common string operations compared to base R (see benchmarks below) - Create a common ALTREP framework that can be used by other packages with full interoperability `stringfish` currently provides two ALTREP backends with the same semantics: `sf_vec`, a simple vector of string objects, and `slice_store`, which stores strings within large contiguous blocks of memory. They make different storage tradeoffs, but the same `stringfish` operations work across both. For text data, `stringfish` is intentionally UTF-8-centric outside of explicit byte mode, so conversions, comparisons, and ALTREP views stay consistent across normal R vectors and both backends. ## Installation ```{r eval=FALSE} install.packages("stringfish", type="source", configure.args="--with-simd=AVX2") ``` ## Benchmark The simplest way to show the utility of the ALTREP framework is through a quick benchmark comparing `stringfish` and base R. ```{r echo=FALSE, results='asis'} if(IS_GITHUB) { cat('![](vignettes/bench_v3.png "bench_v3"){width=576px}') } else { cat('![](bench_v3.png "bench_v3"){width=576px}') } ``` On favorable workloads, some functions in `stringfish` can be more than an order of magnitude faster than vectorized base R operations, and built-in multithreading can widen that gap further. On large text datasets, this can turn minutes of computation into seconds. ## Currently implemented functions A list of implemented `stringfish` functions and analogous base R functions: * `sf_iconv` (`iconv`) * `sf_nchar` (`nchar`) * `sf_substr` (`substr`) * `sf_paste` (`paste0`) * `sf_collapse` (`paste0`) * `sf_readLines` (`readLines`) * `sf_writeLines` (`writeLines`) * `sf_grepl` (`grepl`) * `sf_gsub` (`gsub`) * `sf_toupper` (`toupper`) * `sf_tolower` (`tolower`) * `sf_starts` (`startsWith`) * `sf_ends` (`endsWith`) * `sf_trim` (`trimws`) * `sf_split` (`strsplit`) * `sf_match` (`match` for strings only) * `sf_compare`/`sf_equals` (`==`, ALTREP-aware semantic string equality) * `sf_concat`/`sfc` (`c`) Utility functions: * `sf_vector_create` -- creates a new empty `sf_vec`-backed stringfish vector * `sf_vector` -- backwards-compatible alias for `sf_vector_create` * `slice_store_create` -- creates a new empty `slice_store`-backed stringfish vector * `slice_store_create_with_size` -- creates a `slice_store`-backed stringfish vector with an explicit initial slice size * `sf_assign` -- assign strings into a `stringfish` vector in place (like `x[i] <- "mystring"`) * `convert_to_sf_vector` -- converts a character vector to a `stringfish` vector * `convert_to_slice_store` -- converts a character vector to a `stringfish` slice store * `get_string_type` -- determines string type (whether ALTREP or normal) * `materialize` -- converts any ALTREP object into a normal R object * `random_strings` -- creates random strings as either a `stringfish` or normal R vector * `string_identical` -- compares strings either semantically or exactly across encodings In addition, many R operations in base R and other packages are already ALTREP-aware (i.e. they don't cause materialization). Functions that subset or index into string vectors generally do not materialize. * `sample` * `head` * `tail` * `[` -- e.g. `x[20:30]` * various tidyverse filters and operations * Etc. `stringfish` functions are not intended to exactly replicate their base R analogues. One difference is that `subject` parameters are always the first argument, which is easier to use with pipes. E.g., `gsub(pattern, replacement, subject)` becomes `sf_gsub(subject, pattern, replacement)`. ## Extensibility `stringfish` as a framework is intended to be easily extensible. Stringfish vectors can be worked into `Rcpp` scripts or even into other packages. The example below creates an `sf_vec`-backed output because it is simple and direct, but the same indexing semantics work across both backends. Below is a detailed `Rcpp` script that creates a function to alternate upper and lower case of strings. ```{c eval=FALSE} // [[Rcpp::depends(stringfish)]] #include #include "sf_external.h" using namespace Rcpp; // [[Rcpp::export]] SEXP sf_alternate_case(SEXP x) { // Iterate through a character vector using the RStringIndexer class // If the input vector x is a stringfish character vector it will do so without materialization RStringIndexer r(x); size_t len = r.size(); // Create an output stringfish vector // Like all R objects, it must be protected from garbage collection SEXP output = PROTECT(sf_vector_create(len)); // Obtain a reference to the underlying output data sf_vec_data & output_data = sf_vec_data_ref(output); // You can use range based for loop via an iterator class that returns RStringIndexer::rstring_info e // rstring info is a struct containing const char * ptr, int len, and an encoding flag // ptr should be treated as a byte pointer plus length, not as a null-terminated C string // a NA string is represented by a nullptr // Alternatively, access the data via the function r.getCharLenCE(i) size_t i = 0; for(auto e : r) { // check if string is NA and go to next if it is if(e.ptr == nullptr) { i++; // increment output index continue; } // Create a temporary output string and process the results. // This example intentionally toggles ASCII letters only. std::string temp(e.len, '\0'); bool case_switch = false; for(int j=0; j= 65) && (e.ptr[j] <= 90)) { // char j is upper case if((case_switch = !case_switch)) { // check if we should convert to lower case temp[j] = e.ptr[j] + 32; continue; } } else if((e.ptr[j] >= 97) && (e.ptr[j] <= 122)) { // char j is lower case if(!(case_switch = !case_switch)) { // check if we should convert to upper case temp[j] = e.ptr[j] - 32; continue; } } else if(e.ptr[j] == 32) { case_switch = false; } temp[j] = e.ptr[j]; } // Create a new vector element sfstring and insert the processed string into the stringfish vector // sfstring has three constructors, 1) taking a std::string and encoding, // 2) a char pointer and encoding, or 3) a CHARSXP object (e.g. sfstring(NA_STRING)) output_data[i] = sfstring(temp, e.enc); i++; // increment output index } // Finally, call unprotect and return result UNPROTECT(1); return output; } ``` Example function call: ```{r eval=FALSE} sf_alternate_case("hello world") [1] "hElLo wOrLd" ```