Pandas DataFrames are Excel are a great match as both deal with 2d tables of data. With PyXLL, you can pass data as pandas DataFrames between Python and Excel easily.
Note
You can also use polars
DataFrames as well as pandas.
To distinguish between polars and pandas, use the polars.dataframe
and
pandas.dataframe
types instead of just dataframe
in your function
signatures.
If you use just dataframe
PyXLL will default to using pandas.
See Polars DataFrames for more information about polars types in PyXLL.
By far one of the most common ways to work with pandas data in Excel is with a worksheet function that returns a DataFrame.
If you are not already familiar with writing Excel worksheet functions in Python using PyXLL, please see Worksheet Functions first.
This video demonstrates how to write worksheet functions in Python, using PyXLL. It also covers returning pandas DataFrames from functions towards the end of the video.
Consider the following function that returns a panda DataFrame:
import pandas as pd
from pyxll import xl_func
@xl_func
def return_dataframe():
"""A function that returns a DataFrame."""
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
return df
The xl_func
decorator exposes the return_dataframe
function to Excel as a worksheet function,
but when we call this function from Excel the result might not be exactly what you expected:
The DataFrame object here is being returned as an object handle. This object handle can be passed to other PyXLL functions and the Python function will be called with the actual pandas DataFrame object. This is really useful (and very fast) as it allows us to pass large, complex objects between Python functions easily.
However, it is not what we want in this case!
We need to tell the PyXLL add-in to convert the returned DataFrame object into a range of values so that all of the DataFrames values get returned to the Excel grid.
To do that, we need to specify the return type when declaring the function. That is done either by adding
a function signature to the xl_func
decorator; using the xl_return
decorator; or
by using a Python type annotation.
The simplest method is to add a function signature to our function as follows:
import pandas as pd
from pyxll import xl_func
@xl_func(": dataframe")
def return_dataframe():
"""A function that returns a DataFrame."""
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
return df
Note that we don’t have any arguments. The function signature is of the form argument types: return type
and so
in our signature above we don’t have anything before the :
as there are no arguments to the function.
With the return type now specified, when we call the function in Excel the PyXLL add-in will convert the pandas DataFrame object to an array of values:
The dataframe return type can be parameterized using various options. These options control how the pandas DataFrame object is converted to the array that you see in Excel.
For example, if you want the index to be included in the array shown in Excel, you can set the index
option
to True
(by default, the index is only included if it is a named index):
import pandas as pd
from pyxll import xl_func
@xl_func(": dataframe<index=True>")
def return_dataframe():
"""A function that returns a DataFrame."""
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
return df
For more details of how to specify return types and their type parameters, please see Argument and Return Types.
You can find the documentation for all of the available parameters for the dataframe
return type
on the Pandas Types page.
In the above section we saw how by specifying the function return type, PyXLL can convert the returned pandas DataFrame to an array of values on the Excel grid.
Similarly, PyXLL can also convert a range of values from the Excel grid passed to a function into a pandas DataFrame.
Let’s look at an example function that takes a DataFrame argument, as well as a string and returns a number:
import pandas as pd
from pyxll import xl_func
@xl_func("dataframe df, str col: float")
def sum_column(df, col):
"""Sum the values in a column."""
column_sum = df[col].sum()
return column_sum
The function signature passed to xl_func
specifies that the argument df
should be converted
from a range of Excel values to a pandas DataFrame.
To call this from Excel we enter it as a formula, passing in our data range including the column headers, and the column we want to sum.
PyXLL converts the input range argument to a pandas DataFrame for us and calls our sum_column function with that converted DataFrame.
We can also specify various type parameters to the dataframe
argument type, as with the return type. For
example, the index
option can be used to specify the number of columns to use as the index for the
DataFrame (if supplying an index).
Instead of using the function signature as shown above, another option is to use the xl_arg
decorator. This works in the same way, but lets us specify the argument type parameters as keyword
arguments to the xl_arg
decorator.
For example, the function above could be re-written as follows:
import pandas as pd
from pyxll import xl_func, xl_arg, xl_return
@xl_func
@xl_arg("df", "dataframe")
@xl_arg("col", "str")
@xl_return("float")
def sum_column(df, col):
"""Sum the values in a column."""
column_sum = df[col].sum()
return column_sum
And to specify that the first column is to be used as the DataFrame index, we would set the index
type
parameter like so:
import pandas as pd
from pyxll import xl_func, xl_arg, xl_return
@xl_func
@xl_arg("df", "dataframe", index=1) # equivalent to "dataframe<index=1>"
@xl_arg("col", "str")
@xl_return("float")
def sum_column(df, col):
"""Sum the values in a column."""
column_sum = df[col].sum()
return column_sum
This time, the DataFrame passed to our sum_column
function has its index constructed from the first
column in the Excel range. Setting index
to greater than one results in a DataFrame with a MultiIndex.
For more details of how to specify arguments types and their type parameters, please see Argument and Return Types.
You can find the documentation for all of the available parameters for the dataframe
argument type
on the Pandas Types page.
Sometimes you might have a function that returns more than one DataFrames.
When returning values to Excel, the result has to fit into a grid and so returning multiple DataFrames doesn’t fit as well as a single DataFrame.
If it makes sense to break your function into multiple functions, with each one returning a different DataFrame, that is a good solution. But, if that’s not feasible or results in unwanted recomputation and you need to keep it all in one function, you can return multiple DataFrames from a single function.
Rather than returning multiple DataFrames as multiple grids of data from a single function, instead we have to return them as a list of objects. By returning them as a list of objects this then fits into Excel’s grid.
import pandas as pd
from pyxll import xl_func
@xl_func(": object[]")
def multiple_dataframes():
"""A function that returns multiple DataFrames."""
df1 = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
df2 = pd.DataFrame({
"X": [-1, -2, -3],
"Y": [-4, -5, -6]
})
return [
df1,
df2
]
Above we’re using the object[]
return type. The list of DataFrames returned by the function
are returned to Excel as an array of object handles:
Our function returns multiple DataFrames, but as an array of object handles.
Next we need to convert those object handles into Excel arrays, which we can do with another
function. This next function doesn’t do anything, but by specifying dataframe
as the
return type PyXLL will do the conversion for us:
import pandas as pd
from pyxll import xl_func
@xl_func("object df: dataframe")
def expand_df(df):
if not isinstance(df, pd.DataFrame):
raise TypeError("Expected a DataFrame object")
# Will be converted to an Excel array because of the return type
return df
We can now call this new function on the objects returned previously to get the contents of the DataFrames:
PyXLL has a feature for applying cell formatting to the results of a worksheet function. This can be very useful for giving your spreadsheets a consistent look and feel, as well as saving time applying formatting manually.
For complete details about cell formatting, please see the Cell Formatting and Pandas DataFrame Formatting sections of the user guide.
The following video also explains how the cell formatting functionality can be used:
Pandas DataFrames can be plotted using the DataFrame.plot
method. This is a convenient way to quickly produce plots from DataFrames.
The DataFrame.plot
method uses the plotting package matplotlib
. The PyXLL add-in can display matplotlib figures directly in
Excel, and so it is also possible to display plots of pandas DataFrames in Excel.
The function plot
takes a matplotlib figure and displays it in Excel. To display a pandas plot we first create a matplotlib
Figure
and Axes
. Then we call DataFrame.plot
, passing in the Axex
object as we want it to plot to the matplotlib figure
we created. Finally, we call plot
to display the figure in Excel.
from pyxll import xl_func, plot
import matplotlib.pyplot as plt
import pandas as pd
@xl_func
def pandas_plot():
# Create a DataFrame to plot
df = pd.DataFrame({
'name':['john','mary','peter','jeff','bill','lisa','jose'],
'age':[23,78,22,19,45,33,20],
'gender':['M','F','M','M','M','F','M'],
'state':['california','dc','california','dc','california','texas','texas'],
'num_children':[2,0,0,3,2,1,4],
'num_pets':[5,1,0,5,2,2,3]
})
# Create the matplotlib Figure and Axes objects
fig, ax = plt.subplots()
# Plot a bar chart to the Axes we just created
df.plot(kind='bar',x='name',y='age', ax=ax)
# Show the matplotlib Figure created above
plot(fig)
return "OK!"
The above code creates a DataFrame. It then creates a matplotlib figure and plots the DataFrame onto that figure. Finally
it displays the plot in Excel using the plot
function.
For more details on plotting with PyXLL, and plotting pandas DataFrames in Excel, please see Plotting with Pandas.
Macro functions can be called from Excel buttons and other controls, as well as from VBA.
Excel macros can automate Excel in the same way as VBA. They can also read and write values in the Excel workbooks.
If you’re not already familiar with writing macro functions using PyXLL, please see Macro Functions.
With PyXLL, the entire Excel object model is exposed (see Python as a VBA Replacement), but for dealing with pandas DataFrames (and other types) the Python values need to converted to something that Excel understands.
This can be done using PyXLL’s XLCell
class.
In a macro function we can get the Excel Application
object using xl_app
. This is the
same as the VBA Application object. From that, we can get the Range
object for which we want to either
read or write the DataFrame.
Once we have the Range
object, we use XLCell.from_range
to get a PyXLL XLCell
object. The XLCell object is similar to the Range object. It’s simpler in most ways, and can only be used
for a contigous range, but it allows us to use PyXLL’s type conversion when getting or setting its value
property.
To read or write the cells’ value as a DataFrame we pass type="dataframe"
to the XLCell.options
method before accessing the value
property.
from pyxll import xl_macro, xl_app, XLCell
@xl_macro
def read_value_from_excel():
# Get the Excel.Application COM object
xl = xl_app()
# Get a Range in the current active sheet
xl_range = xl.ActiveSheet.Range("A1:D10")
# Get an XLCell object from the Range object
cell = XLCell.from_range(xl_range)
# Get the value as a DataFrame
df = cell.options(type="dataframe").value
The above macro reads cells A1:D10
of the currently active sheet as a pandas DataFrame.
The type
argument to XLCell.options
is the same as the type used in the function signature
in the previous sections. It can be parameterized in the same way to control exactly how the conversion
is done between the Excel values and the pandas DataFrame.
Tip
You can pass auto_resize=True
to XLCell.options
when reading and writing DataFrames
and the cell with automatically resize to fit the data, so you don’t need to specify the full range.
When passing large DataFrames between Python functions, it is not always necessary to return the full DataFrame
to Excel and it can be expensive reconstructing the DataFrame from the Excel range each time. In those cases you
can use the object
return type to return a handle to the Python object. Functions taking the dataframe
and series
types can accept object handles.
The following returns a random DataFrame as a Python object, so will appear in Excel as a single cell with a handle to that object:
from pyxll import xl_func
import pandas as pd
import numpy as np
@xl_func("int rows, int columns: object")
def random_dataframe(rows, columns):
data = np.random.rand(rows, columns)
column_names = [chr(ord('A') + x) for x in range(columns)]
return pd.DataFrame(data, columns=column_names)
The result of a function like this can be passed to another function that expects a DataFrame:
@xl_func("dataframe, int: dataframe<index=True>", auto_resize=True)
def dataframe_head(df, num_rows):
return df.head(num_rows)
This allows for large datasets to be used in Excel efficiently, especially where the data set would be cumbersome to deal with in Excel when unpacked.
Note
DataFrames returned from Excel to Python as cached objects are not copied. When the object handle is passed to another function, the object retrieved from the cache is the same object that was previously returned.
You should be careful not to modify DataFrames passed this way (i.e. don’t make changes inplace). Instead create a copy first and modify the copy, or use inplace=False if the pandas method you’re using supports that.
pandas Series objects can be used in a similar way to DataFrames.
Instead of using the dataframe
type as described in the sections below, the series
type is also available.
Please see Pandas Types for details of all of the pandas types available, including the series
type.
Sometimes it’s useful to be able to convert a range of data into a DataFrame, or a DataFrame
into a range of data for Excel, in a context other than function decorated with xl_func
.
You might have a function that takes the var
type, which could be a DataFrame depending
on other arguments.
Or, you might want to return a DataFrame but want to decide in your function what type parameters to use to control the conversion. For example, you might want to leave it to use the user to decide whether to include the index or not and give that as an argument to your function.
In these cases the function get_type_converter
can be used. For example:
from pyxll import get_type_converter
to_dataframe = get_type_converter("var", "dataframe<index=True>")
df = to_dataframe(data)
Or the other way:
to_array = get_type_converter("dataframe", "var")
data = to_array(df)
Combined with PyXLL’s var
type you can pass in values from Excel as a plain array and perform
the conversion in your function. Or, you can specify the return type of your function as var
and convert from the DataFrame before returning the final converted value.
For example:
import pandas as pd
from pyxll import xl_func, get_type_converter
import pandas as pd
from pyxll import xl_func
@xl_func("bool include_index: var")
def return_dataframe(include_index=False):
"""A function that returns a DataFrame."""
# Create the dataframe
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
# Get the function to convert from a DataFrame to the PyXLL 'var'
# return type, specifying the 'index' type parameter.
to_var = get_type_converter("dataframe", "var", src_kwargs={
"index": include_index
})
# Convert the DataFrame and return the result
converted_df = to_var(df)
return converted_df
The above function returns a DataFrame. But, rather than using the dataframe
return type to do the
conversion implicitly, it uses the var
return type and performs the conversion inside the function.