Home Product Integrations Customers Company Jobs Resources SQL School Playbook Sign In

“Group By” in SQL and Python: a Comparison

Analysts and data scientists with a deep understanding of multiple analytical programming languages find themselves at an advantage in today’s data landscape. The prevailing dialogue around this kind of multi-lingual approach, especially with SQL and Python, typically portrays the languages as complementary, yet functionally discrete. But there are actually many overlapping functions that can be accomplished by both SQL and Python.

Exploring the overlapping functionality of SQL and Python can help those of us familiar with one language become more adept with the other. And with a deep understanding of both, we can all make smarter decisions about how to combine and leverage each, making it easy to always choose the right tool for every task.

For instance; many functions are more concisely written in Python. But if the benefit of concision is outweighed by the value of having all analytical code exist in a single language, you’ll still want to use SQL. On the other hand, Python enables many complex functions that are not feasible in SQL, and the value of a SQL-only approach might be outweighed by the wider array of operations that are made available when using both languages together. Only by learning about the functional overlap of analytical languages can we make these decisions effectively.

We’ve written previously about the differences in how Python and SQL execute window functions. Today, we’ll focus on GroupBy operations, which are another great example of a task that can be executed by both SQL and Python, and for which the decision of language depends on your goals.

To see how all the examples mentioned in this post are implemented in practice, check out this example report. This post is structured so that you can skip around based on your current knowledge; already comfortable with grouping operations in SQL? Jump directly to the Python section, and vice versa.

SQL

Aggregation Functions

When we apply a grouping operation to a dataset in SQL, we split the dataset into distinct “groups.” In practice, the type of function most commonly applied to a group of data is an aggregation function. At a high level, the process of aggregating data can be described as applying a function to a number of rows to create a smaller subset of rows. In practice, this often looks like a calculation of the total count of the number of rows in a dataset, or a calculation of the sum of all of the rows in a particular column. For a more comprehensive explanation of the basics of SQL aggregate functions, check out the aggregate functions module in Mode’s SQL School.

By default, SQL aggregate functions like count() apply to all of the rows in a dataset and return a single value. For example, the following query returns the total number of rows in the sf_bike_share_trip table:

select count(*) as num_trips
from modeanalytics.sf_bike_share_trip

Interestingly, the count() aggregate function has another, often overlooked piece of functionality. If you specify a column within the count() function, it will return the number of rows with non-null instances of that column instead of the absolute total number of rows. For example, if we wanted to calculate the number of trips where the zip_code field is not null, we could use the following query:

select count(zip_code) as num_trips_with_zip_code
from modeanalytics.sf_bike_share_trip

Applying aggregation functions to entire datasets is useful functionality, but as previously mentioned, we often want to apply the function separately to distinct groups of data within a dataset. This is where the group by operation comes in.

Group By

At a high level, the SQL group by clause allows you to independently apply aggregation functions to distinct groups of data within a dataset. Our SQL School further explains the basics of the group by clause.

Going back to our previous query, let’s count bike trips (count(*)) by the station where they started (start_station_name) column. We do this by adding start_station_name to the list of columns we’re selecting, and adding a group by clause explicitly telling SQL to group the count(*) aggregation by the start_station_name column:

select
  bst.start_station_name,
  count(*) as num_trips
from modeanalytics.sf_bike_share_trip bst
group by bst.start_station_name

What if we also wanted the average, min, and max trip duration per start station? SQL allows us to add an arbitrary amount of aggregation functions to this query:

select
  bst.start_station_name,
  count(*) as num_trips,
  avg(bst.duration) as avg_duration_seconds,
  min(bst.duration) as min_duration_seconds,
  max(bst.duration) as max_duration_seconds
from modeanalytics.sf_bike_share_trip bst
group by bst.start_station_name

SQL also allows us to add an arbitrary amount of columns to our group by clause. For example, if we want to calculate the number of trips and the average, min, and max trip duration for every combination of start stations and end stations, we could do that using the following query:

select
  bst.start_station_name,
  bst.end_station_name,
  count(*) as num_trips,
  avg(bst.duration) as avg_duration_seconds,
  min(bst.duration) as min_duration_seconds,
  max(bst.duration) as max_duration_seconds
from modeanalytics.sf_bike_share_trip bst
group by bst.start_station_name,bst.end_station_name

As you can see, we added the end_station_name column to both our select list and our group by clause. Most SQL platforms offer some syntactic sugar for this operation, allowing you to use the ordinal position of a column in the select statement instead of explicitly referencing column names. In practice, this would look like:

select
  bst.start_station_name,
  bst.end_station_name,
  count(*) as num_trips,
  avg(duration) as avg_duration_seconds,
  min(duration) as min_duration_seconds,
  max(duration) as max_duration_seconds
from modeanalytics.sf_bike_share_trip bst
group by 1,2

Note: not all SQL platforms allow the referencing of columns by their ordinal position in the select statement. For example, on Oracle and SQL Server databases, you will need to explicitly reference the columns by name in the group by clause.

What if we want to filter the values returned from this query strictly to start station and end station combinations with more than 1,000 trips? Since the SQL where clause only supports filtering records and not results of aggregation functions, we’ll need to find another way. This is where the having clause comes in.

Having

The having clause allows users to filter the values returned from a grouped query based on the results of aggregation functions. Mode’s SQL School offers more detail about the basics of the having clause.

As previously mentioned, we want to filter the values returned from our query to start and end station combinations with greater than 1,000 trips. We can achieve this using the following query:

select
  bst.start_station_name,
  bst.end_station_name,
  count(1) as num_trips,
  avg(duration) as avg_duration_seconds,
  min(duration) as min_duration_seconds,
  max(duration) as max_duration_seconds
from modeanalytics.sf_bike_share_trip bst
group by 1,2
having count(1) > 1000

At this point, we’ve explored a lot of what SQL can do with group by functionality. Let’s look at what this exact same implementation would look like in Python, using the pandas library.

Python

As is typically the case, SQL and pandas differ quite dramatically in terms of syntax, but have a lot in common functionality-wise. SQL may be the more straightforward option with only simpler tasks at hand, but the limit of grouping functionality in SQL only scratches the surface of Python’s grouping functionality using pandas.

In pandas, “groups” of data are created with a python method called groupby(). The groupby() method can be called directly on a pandas Dataframe object.

As an example, we are going to use the output of the SQL query named Python as an input to our Dataframe (df) in our Python notebook. Note that this Dataframe does not have any of the aggregation functions being calculated via SQL. It’s simply using SQL to select the required fields for our analysis, and we’ll use pandas to do the rest. An added benefit of conducting this operation in Python is that the workload is moved out of the data warehouse.

In our Python notebook, we’re going to use the groupby() method to group our Dataframe by the start_station_name column and assign that result to a new variable called grouped_single:

grouped_single = df.groupby('start_station_name')

Applying the groupby() method to our Dataframe object returns a GroupBy object, which is then assigned to the grouped_single variable. An important thing to note about a pandas GroupBy object is that no splitting of the Dataframe has taken place at the point of creating the object. The GroupBy object simply has all of the information it needs about the nature of the grouping. No aggregation will take place until we explicitly call an aggregation function on the GroupBy object.

If we want to group our Dataframe by both the start_station_name and end_station_name column, as we did in our SQL query, we can simply add the end_station_name column to our list of grouping keys inside the groupby() method:

grouped_multiple = df.groupby(['start_station_name','end_station_name'])

Unlike SQL, the pandas groupby() method does not have a concept of ordinal position references. Thus, you will need to explicitly reference the grouping keys by name.

Now that we have our GroupBy object created with the appropriate groupings, we can apply aggregation methods to it. By default, pandas will apply any aggregation method you call on a GroupBy object to all non-nuisance columns. Since we are only interested in applying aggregation methods to a single column (trip_duration_seconds), we will select only that column from our new GroupBy object. The way you select individual columns from a GroupBy object is similar to how you would with a Dataframe object:

grouped_multiple_column = df.groupby(['start_station_name','end_station_name'])['trip_duration_seconds']

Pandas allows you select any number of columns using this operation. Pandas will return a grouped Series when you select a single column, and a grouped Dataframe when you select multiple columns. Now that we have our single column selected from our GroupBy object, we can apply the appropriate aggregation methods to it. There are multiple ways to do this. We can assign a variable to the GroupBy object and then call an aggregation method on that variable:

grouped_multiple = df.groupby(['start_station_name','end_station_name'])
avg_trip_duration = grouped_multiple['trip_duration_seconds'].mean()

Or we can use method chaining:

avg_duration_seconds = df.groupby(['start_station_name','end_station_name'])['trip_duration_seconds'].mean()

We can repeat this process for each of the aggregations we are looking to compute:

# Calculate aggregations separately
num_trips = df.groupby(['start_station_name','end_station_name']).size()
avg_duration_seconds = df.groupby(['start_station_name','end_station_name'])['trip_duration_seconds'].mean()
min_duration_seconds = df.groupby(['start_station_name','end_station_name'])['trip_duration_seconds'].min()
max_duration_seconds = df.groupby(['start_station_name','end_station_name'])['trip_duration_seconds'].max(

You can see how this has the potential to get messy if you are computing a large number of aggregations. Luckily, pandas offers a more pythonic way of calculating multiple aggregations on a single GroupBy object. The agg() method can take take a list of aggregation methods for individual columns:

# Calculate aggregations at once
all_together = (df.groupby(['start_station_name','end_station_name']
                  .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]}))

You can also chain the pandas rename() method to rename the new aggregation columns to your desired names:

# Calculate and rename aggregations
all_together = (df.groupby(['start_station_name','end_station_name'])
                  .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
                  .rename(columns={'size': 'num_trips', 'mean': 'avg_duration_seconds', 'amin': 'min_duration_seconds', 'amax': 'max_duration_seconds'}))

Now we’ve replicated our SQL query up to the point of filtering our groups based on an aggregation function result. In pandas, you can use standard indexing to return a subset of the original object based on the result of an aggregation function. For example, if we wanted to filter our result set to only start station and end station combinations with greater than 1,000 trips, we could use the following statement:

all_together[all_together['trip_duration_seconds']['num_trips'] > 1000]

At this point, we’ve fully replicated the output of our original SQL query while offloading the grouping and aggregation work to pandas. Again, this example only scratches the surface of what is possible using pandas grouping functionality. Many group-based operations that are complex (or even impossible) using SQL are optimized within the pandas framework. This includes things like dataset transformations, quantile and bucket analysis, group-wise linear regression, and application of user-defined functions, amongst others. Access to these types of operations significantly widens the spectrum of questions we’re capable of answering.

Blur The Line

A deep understanding of grouping functions in both SQL and Python can help you determine which language should be used for which function at which time. If you don’t have complex grouping requirements, you’ll likely want to work in SQL, so that you can keep all your work in one language. But if your analysis includes heavier group-based operations like those mentioned above, moving all grouping operations into the Python notebook will let you take advantage of the more sophisticated functionality available in pandas.

That’s the benefit of being able to use multiple data analysis languages; you can customize a hybrid approach to suit your needs as they evolve.