Title: | Analyze Big Files Without Loading Them in Memory |
---|---|
Description: | A simple set of wrapper functions for data.table::fread() that allows subsetting or filtering rows and selecting columns of table-formatted files too large for the available RAM. 'b stands for 'big files'. bread makes heavy use of Unix commands like 'grep', 'sed', 'wc', 'awk' and 'cut'. They are available by default in all Unix environments. For Windows, you need to install those commands externally in order to simulate a Unix environment and make sure that the executables are in the Windows PATH variable. To my knowledge, the simplest ways are to install 'RTools', 'Git' or 'Cygwin'. If they have been correctly installed (with the expected registry entries), they should be detected on loading the package and the correct directories will be added automatically to the PATH. |
Authors: | Vincent Guegan [aut, cre] |
Maintainer: | Vincent Guegan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.4.1 |
Built: | 2025-02-15 05:09:42 UTC |
Source: | https://github.com/magichead99/bread |
Simply reads the first line of a file with data.table::fread and the head Unix command. This allows analyzing big files that would not fit in memory (and cause an error such as 'cannot allocate vector of size').
bcolnames(file = NULL, ...)
bcolnames(file = NULL, ...)
file |
String. Name or full path to a file compatible with data.table::fread() |
... |
Arguments that must be passed to data.table::fread() like 'sep'. |
A character vector
file <- system.file('extdata', 'test.csv', package = 'bread') ## Retrieving the column names bcolnames(file = file)
file <- system.file('extdata', 'test.csv', package = 'bread') ## Retrieving the column names bcolnames(file = file)
This function helps splitting a big csv file in smaller csv files using one of those 3 methods:
by_nrows: Each new file will contain a number of rows defined by the user
by_nfiles: The user decide the number of files created with the rows equally distributed
by_columns: The file will be split by the combinations of unique values in the columns chosen by the user Like all other functions in the bread package, this is achieved using Unix commands that allow opening, reading and splitting big files that wouldn"t fit in memory (The goal being to help with the 'cannot allocate vector of size' error).
bfile_split( file = NULL, by_nfiles, by_nrows, by_columns, drop_empty_files = T, write_sep = NA, write_dir = NULL, meta_output = NULL, ... )
bfile_split( file = NULL, by_nfiles, by_nrows, by_columns, drop_empty_files = T, write_sep = NA, write_dir = NULL, meta_output = NULL, ... )
file |
String. Name or full path to a file compatible with data.table::fread() |
by_nfiles |
Numeric. Number of files with an equal number of rows to be created. Only the last one will be slightly larger, containing the remainder. |
by_nrows |
Numeric. Number of rows composing the new split files. The last one may be smaller, containing only the remainder. |
by_columns |
Vector of strings or numeric. Indicates either the names or index number of the columns whose combinations of unique values will be used to split the files. |
drop_empty_files |
Logical. Defaults to TRUE. Used only with the 'by_column' argument. If changed to FALSE, empty files may be created. |
write_sep |
One character-length string. Will be provided to data.table::fwrite() for writing the output. If not provided, the delimiter will be guessed from the input file with the bsep() function. It will override and "sep" argument passed to fread() through "..." |
write_dir |
String. Path to the output directory. By default, it will be the working directory. If the directory doesn"t exist, it will be created. |
meta_output |
List. Optional. Output of the bmeta() function on the same file. It indicates the names and numbers of columns and rows. If not provided, it will be calculated. It can take a while on file with several million rows. |
... |
Arguments that must be passed to data.table::fread() or fwrite() like 'sep=' and 'dec='. Checks for compatibility, but (except for write_sep) you cannot choose to pass an argument to only writing or reading. |
Creates a number of csv files from the original larger file
## Not run: file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering on 2 columns, using regex. bfile_split(file = file, by_nrows = 5) bfile_split(file = file, by_nfiles = 3) bfile_split(file = file, by_columns = c('YEAR', 'COLOR')) ## For very big files with several million rows, the bmeta() function takes ##a long time to count the rows without loading the file in memory. ## Best practice is to save the result of bmeta() in a variable and provide it ## to bfile_split() meta <- bmeta(file = file) bfile_split(file = file, by_nrows = 5, meta_output = meta) ## write_sep can be used to write the output files with a different delimiters than the input file bfile_split(file = file, by_nrows = 5, write_sep = '*') ## End(Not run)
## Not run: file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering on 2 columns, using regex. bfile_split(file = file, by_nrows = 5) bfile_split(file = file, by_nfiles = 3) bfile_split(file = file, by_columns = c('YEAR', 'COLOR')) ## For very big files with several million rows, the bmeta() function takes ##a long time to count the rows without loading the file in memory. ## Best practice is to save the result of bmeta() in a variable and provide it ## to bfile_split() meta <- bmeta(file = file) bfile_split(file = file, by_nrows = 5, meta_output = meta) ## write_sep can be used to write the output files with a different delimiters than the input file bfile_split(file = file, by_nrows = 5, write_sep = '*') ## End(Not run)
Simple wrapper for data.table::fread() allowing to filter data from a file with the Unix 'grep' command. This method is useful if you want to load a file too large for your available memory (and encounter the 'cannot allocate vector of size' error for example).
bfilter( file = NULL, patterns = NULL, filtered_columns = NULL, fixed = FALSE, ... )
bfilter( file = NULL, patterns = NULL, filtered_columns = NULL, fixed = FALSE, ... )
file |
String. Name or full path to a file compatible with data.table::fread() |
patterns |
Vector of strings. One or several patterns used to filter the data from the input file. Each element of the vector should correspond to the column to be filtered. Can use regular expressions. |
filtered_columns |
Vector of strings or numeric. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the pattern with which it will be filtered. |
fixed |
Logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments. |
... |
Arguments that must be passed to data.table::fread() like 'sep' and 'dec'. |
A dataframe
file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering on 2 columns, using regex. bfilter(file = file, patterns = c('200[4-6]', "red"), filtered_columns = c('YEAR', 'COLOR'), sep = ';') bfilter(file = file, patterns = c('2004|2005', 'red'), filtered_columns = c('YEAR', 'COLOR'), sep = ';') ## You need to use fixed = T if some patterns contain special characters ## that mess with regex like '(' and ')' bfilter(file = file, patterns = 'orange (purple)', filtered_columns = 'COLOR', fixed = TRUE, sep = ';') ## If you do not provide the filtered_columns, you risk encountering ## false positives because the grep command filters on the whole file, ## not column by column. Here, the value 2002 will be found in the 'PRICE' ## column as well. The filtered_column argument will just make the script ## do a second pass with dplyr::filter() to remove false positives. bfilter(file = file, patterns = '2002', sep = ';')
file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering on 2 columns, using regex. bfilter(file = file, patterns = c('200[4-6]', "red"), filtered_columns = c('YEAR', 'COLOR'), sep = ';') bfilter(file = file, patterns = c('2004|2005', 'red'), filtered_columns = c('YEAR', 'COLOR'), sep = ';') ## You need to use fixed = T if some patterns contain special characters ## that mess with regex like '(' and ')' bfilter(file = file, patterns = 'orange (purple)', filtered_columns = 'COLOR', fixed = TRUE, sep = ';') ## If you do not provide the filtered_columns, you risk encountering ## false positives because the grep command filters on the whole file, ## not column by column. Here, the value 2002 will be found in the 'PRICE' ## column as well. The filtered_column argument will just make the script ## do a second pass with dplyr::filter() to remove false positives. bfilter(file = file, patterns = '2002', sep = ';')
Generates a list that can be passed to the bfile_split() function in order to indicate the number of rows and the name of columns in the file. The output can be saved in a variable and provided in the meta_output argument to save time. Counting rows in very large files can take some time.
bmeta(file = NULL, ...)
bmeta(file = NULL, ...)
file |
String. Name or full path to a file compatible with data.table::fread() |
... |
Arguments that must be passed to data.table::fread() like 'sep'. |
A list of 2
file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering on 2 columns, using regex. meta_output <- bmeta(file = file)
file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering on 2 columns, using regex. meta_output <- bmeta(file = file)
Counts the number of rows using data.table::fread() and the 'wc' Unix command. This allows analyzing big files that would not fit in memory (and cause an error such as 'cannot allocate vector of size').
bnrow(file = NULL)
bnrow(file = NULL)
file |
String. Name or full path to a file compatible with data.table::fread() |
A numeric
file <- system.file('extdata', 'test.csv', package = 'bread') ## Counting rows (almost like the band) bnrow(file = file)
file <- system.file('extdata', 'test.csv', package = 'bread') ## Counting rows (almost like the band) bnrow(file = file)
Simple wrapper for data.table::fread() allowing to filter data by numerical value from a file with the Unix 'awk' command. This method is useful if you want to load a file too large for your available memory (and encounter the 'cannot allocate vector of size' error #' for example).
bnumrange( file = NULL, range_min = NULL, range_max = NULL, numrange_columns = NULL, ... )
bnumrange( file = NULL, range_min = NULL, range_max = NULL, numrange_columns = NULL, ... )
file |
String. Name or full path to a file compatible with data.table::fread() |
range_min |
Vector of numeric. One or several minimal values used to filter (inclusively, as in superior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered. |
range_max |
Vector of numeric. One or several maximal values used to filter (inclusively, as in inferior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered. |
numrange_columns |
Vector of strings or numeric. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the range_min and range_man values with which it will be filtered. |
... |
Arguments that must be passed to data.table::fread() like 'sep' and 'dec'. |
A dataframe
The value comparisons are inclusive, meaning inferior/superior OR EQUAL
file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering with only min value ## Filtering on 2 columns bnumrange(file = file, range_min = c(2006, 1500), range_max = c(2010, 1990), numrange_columns = c(1,3)) bnumrange(file = file, range_min = c(2000, 1500), range_max = c(2005, 1990), numrange_columns = c('YEAR', 'PRICE'), sep = ';')
file <- system.file('extdata', 'test.csv', package = 'bread') ## Filtering with only min value ## Filtering on 2 columns bnumrange(file = file, range_min = c(2006, 1500), range_max = c(2010, 1990), numrange_columns = c(1,3)) bnumrange(file = file, range_min = c(2000, 1500), range_max = c(2005, 1990), numrange_columns = c('YEAR', 'PRICE'), sep = ';')
Wrapper for data.table::fread() simplifying the use of Unix commands like 'grep', 'cut', 'awk' and 'sed' on a data file before loading it in memory. The Unix commands are automatically generated from the arguments. This is useful if you want to load a big file too large for your available memory (and encounter the 'cannot allocate vector of size' error) and know you can work on a subsample. 'b' stands for 'big file'. This function allows to subset rows by their index number, select columns and filter with a pattern.
bread( file = NULL, first_row = NULL, last_row = NULL, head = NULL, tail = NULL, colnames = NULL, colnums = NULL, patterns = NULL, filtered_columns = NULL, fixed = FALSE, range_min = NULL, range_max = NULL, numrange_columns = NULL, ... )
bread( file = NULL, first_row = NULL, last_row = NULL, head = NULL, tail = NULL, colnames = NULL, colnums = NULL, patterns = NULL, filtered_columns = NULL, fixed = FALSE, range_min = NULL, range_max = NULL, numrange_columns = NULL, ... )
file |
String. Name or full path to a file compatible with data.table::fread() |
first_row |
Numeric. First row of the portion of the file to subset. |
last_row |
Numeric. Last row of the portion of the file to subset. |
head |
Numeric. How many rows starting from the first in the file. |
tail |
Numeric. How many rows starting from the last in the file. |
colnames |
Vector of strings. Exact names of columns to select. If both colnames and colnums are provided, colnums will be prefered. |
colnums |
Vector of numeric. Columns index numbers. |
patterns |
Vector of strings. One or several patterns used to filter the data from the input file. Each element of the vector should correspond to the column to be filtered. Can use regular expressions. |
filtered_columns |
Vector of strings or numeric. Optional. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the pattern with which it will be filtered. |
fixed |
Logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments. |
range_min |
Vector of numeric. One or several minimal values used to filter (inclusively, as in superior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered. |
range_max |
Vector of numeric. One or several maximal values used to filter (inclusively, as in inferior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered. |
numrange_columns |
Vector of strings or numeric. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the range_min and range_man values with which it will be filtered. |
... |
Arguments that must be passed to data.table::fread() like 'sep' or 'dec'. |
You can mix and match the row subsetting, the filtering by value and the selecting of columns. In order, the function:
subsets the rows by their numbers (with 'sed' & 'awk'). You need to input the index number of the first and last rows you want to load in memory with fread(), or alternatively use either the head or tail arguments to subset the first or last rows of the file.
selects columns by index number or name (with 'cut'). If both colnames and colnums are provided, colnums will be prefered.
filters the data selected so far with a pattern by column (with 'grep'). The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the pattern with which it will be filtered.
filters (inclusively, as in inferior/superior OR EQUAL) the data selected so far by numerical value on a different set of provided columns with the 'sed' command.
A data frame with the selected columns and the subsetted and filtered data
Best practice would probably be to load the big file in a SQL database or something. Or not working on huge CSV files in the first place. But if you have to, you hopefully won"t have to delve into the fascinating grammar of Unix commands.
file <- system.file('extdata', 'test.csv', package = 'bread') ## Select the columns numbered 1 and 3 bread(file = file, colnums = c(1,3)) ## Select the columns named 'YEAR' and 'PRICE', then filter to keep only the ## value '2022' in column 'YEAR' bread(file = file, colnames = c('YEAR', 'PRICE'), patterns = 2002, filtered_columns = 'YEAR') ## Select the columns names 'YEAR' and 'PRICE', then filter to keep only values ## superior or equal to 2004 in YEAR and to 2000 in PRICE bread(file = file, colnames = c("YEAR", "PRICE"), range_min = c(2004,2000), numrange_columns = c(1,3)) ## Subset to keep only the rows 10 to 18, select the columns named 'YEAR' ## and 'COLOR' then filter to keep only the value 'red' in column 'COLOR' bread(file = file, colnames = c('YEAR', 'COLOR'), patterns = 'red', filtered_columns = 'COLOR', first_row = 10, last_row = 18)
file <- system.file('extdata', 'test.csv', package = 'bread') ## Select the columns numbered 1 and 3 bread(file = file, colnums = c(1,3)) ## Select the columns named 'YEAR' and 'PRICE', then filter to keep only the ## value '2022' in column 'YEAR' bread(file = file, colnames = c('YEAR', 'PRICE'), patterns = 2002, filtered_columns = 'YEAR') ## Select the columns names 'YEAR' and 'PRICE', then filter to keep only values ## superior or equal to 2004 in YEAR and to 2000 in PRICE bread(file = file, colnames = c("YEAR", "PRICE"), range_min = c(2004,2000), numrange_columns = c(1,3)) ## Subset to keep only the rows 10 to 18, select the columns named 'YEAR' ## and 'COLOR' then filter to keep only the value 'red' in column 'COLOR' bread(file = file, colnames = c('YEAR', 'COLOR'), patterns = 'red', filtered_columns = 'COLOR', first_row = 10, last_row = 18)
Simple wrapper for data.table::fread() allowing to select columns of data from a file with the Unix 'cut' command. This method is useful if you want to load a file too large for your available memory (and encounter the 'cannot allocate vector of size' error).
bselect(file = NULL, colnames = NULL, colnums = NULL, ...)
bselect(file = NULL, colnames = NULL, colnums = NULL, ...)
file |
String. Full path to a file |
colnames |
Vector of strings. Exact names of columns to select. If both colnames and colnums are provided, colnums will be prefered. |
colnums |
Vector of numeric. Columns index numbers. |
... |
Arguments that must be passed to data.table::fread() like 'sep' or 'dec'. |
A dataframe with the selected columns
file <- system.file('extdata', 'test.csv', package = 'bread') ## Select the columns numbered 1 and 3 bselect(file = file, colnums = c(1,3)) ## Select the columns named 'PRICE' and 'COLOR' bselect(file = file, colnames = c('PRICE', 'COLOR'))
file <- system.file('extdata', 'test.csv', package = 'bread') ## Select the columns numbered 1 and 3 bselect(file = file, colnums = c(1,3)) ## Select the columns named 'PRICE' and 'COLOR' bselect(file = file, colnames = c('PRICE', 'COLOR'))
The function reads the first row and tests the following common separators by default:
';' '\t' ' ' '|' ':' ','
bsep(file, ntries = 10, separators = c(";", "\t", " ", "|", ":", ","))
bsep(file, ntries = 10, separators = c(";", "\t", " ", "|", ":", ","))
file |
String. Name or full path to a file compatible with data.table::fread() |
ntries |
Numeric. Number of rows to check for |
separators |
Vector of strings. Additional uncommon delimiter to check for |
A string
file <- system.file('extdata', 'test.csv', package = 'bread') ## Checking the delimiter on the first 12 rows, including headers bsep(file = file, ntries = 12)
file <- system.file('extdata', 'test.csv', package = 'bread') ## Checking the delimiter on the first 12 rows, including headers bsep(file = file, ntries = 12)
Simple wrapper for data.table::fread() allowing to subset rows of data from a file with the Unix 'sed' or 'awk' commands. This method is useful if you want to load a file too large for your available memory (and encounter the 'cannot allocate vector of size' error). You need to input the index number of the first and last rows you want to load in memory with fread(), or alternatively use either the head or tail arguments to subset the first or last rows of the file.
bsubset( file = NULL, head = NULL, tail = NULL, first_row = NULL, last_row = NULL, ... )
bsubset( file = NULL, head = NULL, tail = NULL, first_row = NULL, last_row = NULL, ... )
file |
String. Full path to a file |
head |
Numeric. How many rows starting from the first in the file. |
tail |
Numeric. How many rows starting from the last in the file. |
first_row |
Numeric. First row of the portion of the file to subset. |
last_row |
Numeric. Last row of the portion of the file to subset. |
... |
Arguments that must be passed to data.table::fread() like 'sep'. |
A dataframe containing the subsetted rows
file <- system.file('extdata', 'test.csv', package = 'bread') ## Head or Tail... for the first n or last n rows bsubset(file = file, head = 5) ## Subset from the middle of a file bsubset(file = file, first_row = 5, last_row = 10) ## first_row defaults as 1 and last_row as the last row of the file bsubset(file = file, first_row = 5) bsubset(file = file, last_row = 10)
file <- system.file('extdata', 'test.csv', package = 'bread') ## Head or Tail... for the first n or last n rows bsubset(file = file, head = 5) ## Subset from the middle of a file bsubset(file = file, first_row = 5, last_row = 10) ## first_row defaults as 1 and last_row as the last row of the file bsubset(file = file, first_row = 5) bsubset(file = file, last_row = 10)