Supported Pandas Operations

Supported Pandas Operations

Below is the list of the Pandas operators that Bodo supports. Optional arguments are not supported unless if specified. Since Numba doesn't support Pandas, only these operations can be used for both large and small datasets.

In addition:

  • Accessing columns using both getitem (e.g. df['A']) and attribute (e.g. df.A) is supported.
  • Using columns similar to Numpy arrays and performing data-parallel operations listed previously is supported.
  • Filtering data frames using boolean arrays is supported (e.g. df[df.A > .5]).

Input/Output

  • pandas.read_csv
    • Arguments filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, skiprows and parse_dates are supported.
    • filepath_or_buffer should be a string and is required.
    • Either names and dtype arguments should be provided to enable type inference, or filepath_or_buffer should be a constant string for Bodo to infer types by looking at the file at compile time.
    • names, usecols, parse_dates should be constant lists.
    • dtype should be a constant dictionary of strings and types.
  • pandas.read_parquet
    • Arguments path and columns are supported. columns should be a constant list of strings.
    • If path is constant, Bodo finds the schema from file at compilation time. Otherwise, schema should be provided. For example:

      @bodo.jit(locals={'df':{'A': bodo.float64[:],
                              'B': bodo.string_array_type}})
      def impl(f):
        df = pd.read_parquet(f)
        return df
      

General functions

Data manipulations:

  • pandas.crosstab
    • Annotation of pivot values is required. For example, @bodo.jit(pivots={'pt': ['small', 'large']}) declares the output table pt will have columns called small and large.
  • pandas.merge
    • Arguments left, right should be dataframes.
    • how, on, left_on, right_on, left_index, and right_index are supported but should be constant values.
  • pandas.merge_asof (similar arguments to merge)
  • pandas.concat Input list or tuple of dataframes or series is supported.

Top-level missing data:

  • pandas.isna
  • pandas.isnull
  • pandas.notna
  • pandas.notnull

Top-level conversions:

  • pandas.to_numeric Input can be a Series. Output requires type annotation. errors='coerce' required.

Top-level dealing with datetimelike:

  • pandas.date_range
    • start, end, periods, freq, name and closed arguments are supported. This function is not parallelized yet.

Series

Bodo provides extensive Series support. However, operations between Series (+, -, /, ,*) do not implicitly align values based on their associated index values yet.

  • pandas.Series
    • Arguments data, index, and name are supported. data is required and can be a list, array, Series or Index. If data is Series and index is provided, implicit alignment is not performed yet.

Attributes:

  • Series.index
  • Series.values
  • Series.dtype (object data types such as dtype of string series not supported yet)
  • Series.shape
  • Series.ndim
  • Series.size
  • Series.T
  • Series.hasnans
  • Series.empty
  • Series.dtypes
  • Series.name
  • Series.put (only numeric data types)

Methods:

Conversion:

  • Series.astype (only dtype argument, can be a Numpy numeric dtype or str)
  • Series.copy (including deep argument)
  • Series.to_list
  • Series.get_values

Indexing, iteration:

Location based indexing using [], iat, and iloc is supported. Changing values of existing string Series using these operators is not supported yet.

  • Series.iat
  • Series.iloc

Binary operator functions:

The fill_value optional argument for binary functions below is supported.

  • Series.add
  • Series.sub
  • Series.mul
  • Series.div
  • Series.truediv
  • Series.floordiv
  • Series.mod
  • Series.pow
  • Series.combine
  • Series.lt
  • Series.gt
  • Series.le
  • Series.ge
  • Series.ne

Function application, GroupBy & Window:

  • Series.apply (only the func argument)
  • Series.map (only the arg argument, which should be a function)
  • Series.rolling (window and center arguments supported)

Computations / Descriptive Stats:

Statistical functions below are supported without optional arguments unless support is explicitly mentioned.

  • Series.abs
  • Series.corr
  • Series.count
  • Series.cov
  • Series.cumsum
  • Series.cumprod
  • Series.describe currently returns a string instead of Series object.
  • Series.max
  • Series.mean
  • Series.median
  • Series.min
  • Series.nlargest (non-numerics not supported yet)
  • Series.nsmallest (non-numerics not supported yet)
  • Series.pct_change(supports numeric types and only the periods argument supported)
  • Series.prod
  • Series.quantile
  • Series.std
  • Series.sum
  • Series.var
  • Series.unique
  • Series.nunique
  • Series.value_counts

Reindexing / Selection / Label manipulation:

  • Series.head (n argument is supported)
  • Series.idxmax
  • Series.idxmin
  • Series.rename (only set a new name using a string value)
  • Series.tail (n argument is supported)
  • Series.take

Missing data handling:

  • Series.isna
  • Series.notna
  • Series.dropna
  • Series.fillna

Reshaping, sorting:

  • Series.argsort
  • Series.sort_values (does not push NAs to first/last positions yet)
  • Series.append ignore_index is supported. setting name for output Series not supported yet)

Time series-related:

  • Series.shift (supports numeric types and only the periods argument supported)

String handling:

  • Series.str.contains
  • Series.str.len

DataFrame

Bodo provides extensive DataFrame support documented below.

  • pandas.DataFrame

    data argument can be a constant dictionary or 2d Numpy array. Other arguments are also supported.

Attributes and underlying data:

  • DataFrame.index (can access but not set new index yet)
  • DataFrame.columns (can access but not set new columns yet)
  • DataFrame.values (only for numeric dataframes)
  • DataFrame.get_values (only for numeric dataframes)
  • DataFrame.ndim
  • DataFrame.size
  • DataFrame.shape
  • DataFrame.empty

Conversion:

  • DataFrame.astype (only accepts a single data type of Numpy dtypes or str)
  • DataFrame.copy (including deep flag)
  • DataFrame.isna
  • DataFrame.notna

Indexing, iteration:

  • DataFrame.head (including n argument)
  • DataFrame.iat
  • DataFrame.iloc
  • DataFrame.tail (including n argument)
  • DataFrame.isin (values can be a dataframe with matching index or a list or a set)

Function application, GroupBy & Window:

  • DataFrame.apply
  • DataFrame.groupby by should be a constant column label or column labels. sort=False is set by default. as_index argument is supported but MultiIndex is not supported yet (will just drop output MultiIndex).
  • DataFrame.rolling window argument should be integer or a time offset as a constant string. center and on arguments are also supported.

Computations / Descriptive Stats:

  • DataFrame.abs
  • DataFrame.corr (min_periods argument supported)
  • DataFrame.count
  • DataFrame.cov (min_periods argument supported)
  • DataFrame.cumprod
  • DataFrame.cumsum
  • DataFrame.describe
  • DataFrame.max
  • DataFrame.mean
  • DataFrame.median
  • DataFrame.min
  • DataFrame.pct_change
  • DataFrame.prod
  • DataFrame.quantile
  • DataFrame.sum
  • DataFrame.std
  • DataFrame.var
  • DataFrame.nunique

Reindexing / Selection / Label manipulation:

  • DataFrame.drop (only dropping columns supported, either using columns argument or setting axis=1)
  • DataFrame.head (including n argument)
  • DataFrame.idxmax
  • DataFrame.idxmin
  • DataFrame.reset_index (only drop=True supported)
  • DataFrame.set_index keys can only be a column label (a constant string).
  • DataFrame.tail (including n argument)
  • DataFrame.take

Missing data handling:

  • DataFrame.dropna
  • DataFrame.fillna

Reshaping, sorting, transposing:

  • DataFrame.pivot_table
    • Arguments values, index, columns and aggfunc are supported.
    • Annotation of pivot values is required. For example, @bodo.jit(pivots={'pt': ['small', 'large']}) declares the output pivot table pt will have columns called small and large.
  • DataFrame.sort_values by argument should be constant string or constant list of strings. ascending argument is supported.
  • DataFrame.sort_index ascending argument is supported.

Combining / joining / merging:

  • DataFrame.append appending a dataframe or list of dataframes supported. ignore_index=True is necessary and set by default.
  • DataFrame.join only dataframes.
  • DataFrame.merge only dataframes. how, on, left_on, right_on, left_index, and right_index are supported but should be constant values.

Time series-related:

  • DataFrame.shift (supports numeric types and only the periods argument supported)

Numeric Index

Numeric index objects RangeIndex, Int64Index, UInt64Index and Float64Index are supported as index to dataframes and series. Constructing them in Bodo functions, passing them to Bodo functions (unboxing), and returning them from Bodo functions (boxing) are also supported.

  • pandas.RangeIndex
    • start, stop and step arguments are supported.
  • pandas.Int64Index
  • pandas.UInt64Index
  • pandas.Float64Index
    • data, copy and name arguments are supported. data can be a list or array.

DatetimeIndex

DatetimeIndex objects are supported. They can be constructed, boxed/unboxed, and set as index to dataframes and series.

  • pandas.DatetimeIndex
    • Only data argument is supported, and can be array-like of datetime64['ns'], int64 or strings. Strings should be in ISO 8601 format, YYYY-MM-DDT[HH[:MM[:SS[.mmm[uuu]]]]][+HH:MM] (e.g. '2017-09-27').

Date fields of DatetimeIndex are supported:

  • DatetimeIndex.year
  • DatetimeIndex.month
  • DatetimeIndex.day
  • DatetimeIndex.hour
  • DatetimeIndex.minute
  • DatetimeIndex.second
  • DatetimeIndex.microsecond
  • DatetimeIndex.nanosecond
  • DatetimeIndex.date

The min/max methods are supported without optional arguments (NaT output for empty or all NaT input not supported yet):

  • DatetimeIndex.min
  • DatetimeIndex.max

Returning underlying data array:

  • DatetimeIndex.values

Subtraction of Timestamp from DatetimeIndex and vice versa is supported.

Comparison operators ==, !=, >=, >, <=, < between DatetimeIndex and a string containing datetime in ISO 8601 format are supported.

TimedeltaIndex

TimedeltaIndex objects are supported. They can be constructed, boxed/unboxed, and set as index to dataframes and series.

  • pandas.TimedeltaIndex
    • Only data argument is supported, and can be array-like of timedelta64['ns'] or int64.

Time fields of TimedeltaIndex are supported:

  • TimedeltaIndex.days
  • TimedeltaIndex.second
  • TimedeltaIndex.microsecond
  • TimedeltaIndex.nanosecond

PeriodIndex

PeriodIndex objects can be boxed/unboxed and set as index to dataframes and series. Operations on them will be supported in upcoming releases.

Timestamp

  • Timestamp.day
  • Timestamp.hour
  • Timestamp.microsecond
  • Timestamp.month
  • Timestamp.nanosecond
  • Timestamp.second
  • Timestamp.year
  • Timestamp.date

Window

  • Rolling.count
  • Rolling.sum
  • Rolling.mean
  • Rolling.median
  • Rolling.var
  • Rolling.std
  • Rolling.min
  • Rolling.max
  • Rolling.corr
  • Rolling.cov
  • Rolling.apply

GroupBy

  • GroupBy.agg arg should be a function, and the compiler should be able to simplify it to a single parallel loop and analyze it. For example, arithmetic expressions on input Series are supported. A list of functions is also supported if one output column is selected (which avoids MultiIndex). For example:

    @bodo.jit
    def f(df):
        def g1(x): return (x<=2).sum()
        def g2(x): return (x>2).sum()
        return df.groupby('A')['B'].agg((g1, g2))
    
  • GroupBy.aggregate same as agg.
  • GroupBy.count
  • GroupBy.max
  • GroupBy.mean
  • GroupBy.median
  • GroupBy.min
  • GroupBy.prod
  • GroupBy.std
  • GroupBy.sum
  • GroupBy.var

Integer NA issue in Pandas

DataFrame and Series objects with integer data need special care due to integer NA issues in Pandas. By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed (which can result in loss of precision). This is because Pandas uses the NaN floating point value as NA, and Numpy does not support NaN values for integers. Bodo does not perform this conversion unless enough information is available at compilation time.

Pandas introduced a new nullable integer data type that can solve this issue, which is also supported by Bodo. For example, this code reads column A into a nullable integer array (the capital "I" denotes nullable integer type):

@bodo.jit
def example(fname):
  dtype = {'A': 'Int64', 'B': 'float64'}
  df = pd.read_csv(fname,
      names=dtype.keys(),
      dtype=dtype,
  )
  ...

Bodo can use nullable integer arrays when reading Parquent files if the bodo.io.parquet_pio.use_nullable_int_arr flag is set by the user. For example:

bodo.io.parquet_pio.use_nullable_int_arr = True
@bodo.jit
def example(fname):
  df = pd.read_parquet(fname)
  ...