Sorting Data with pandas.DataFrame.sort_values

Sorting Data with pandas.DataFrame.sort_values

Let’s start with a basic premise. You have a chunk of data, a two-dimensional block of values sitting in memory. In the world of pandas, this is a DataFrame. Without any specific order, it’s a jumble of facts. It might have been read from a database query that didn’t specify an ORDER BY clause, or maybe it’s raw sensor data streamed in the sequence it was captured. It’s just… there. Information, yes, but not insight. Not yet.

Consider this collection of data about a few fictional programmers:

import pandas as pd
import numpy as np

data = {
    'Name': ['Ada', 'Grace', 'Hedy', 'Margaret', 'Katherine'],
    'Born': [1815, 1906, 1914, 1936, 1918],
    'LOC_per_day': [95, 45, 70, 150, 120]
}
df = pd.DataFrame(data)

# The initial, unsorted state
print(df)

Running this gives you a table, but the rows are in the arbitrary order they were defined in the dictionary. If you wanted to find the oldest programmer in this group, you’d have to scan the ‘Born’ column with your eyes. For five rows, that’s trivial. For five million, it’s impossible. The machine needs to do the work, and for that, it needs order. The most fundamental operation is to impose order based on the values in a single column. This is the job of the pandas.DataFrame.sort_values() method.

To use it in its simplest form, you just need to tell it which column to use as the key for the sort. You pass the column’s name as a string to the by parameter. Let’s sort our pioneers by their birth year.

# Sort by the 'Born' column
df_sorted_by_birth = df.sort_values(by='Born')

print(df_sorted_by_birth)

What you get back is a new DataFrame. The original df is untouched, which is a critical feature. This principle of immutability—of functions returning new objects rather than modifying existing ones—prevents a whole class of nasty side-effect bugs. The new DataFrame, df_sorted_by_birth, contains the exact same data, but the rows have been physically reordered in memory according to the values in the ‘Born’ column, from smallest to largest. This is the default behavior: ascending order. If you want the reverse, to see who contributed most recently, you simply flip a boolean flag. The ascending parameter controls the direction.

# Sort by birth year, descending
df_sorted_desc = df.sort_values(by='Born', ascending=False)

print(df_sorted_desc)

Now Margaret, born in 1936, is at the top. The logic is simple, but the implications are powerful. We’ve transformed the data into a view that answers a specific question. But what about that new DataFrame? Creating a copy of your data for every operation can be expensive, especially if your DataFrame is holding gigabytes of information. For these situations, pandas provides an escape hatch: the inplace parameter. By setting inplace=True, you are telling pandas to perform the sort directly on the original DataFrame, overwriting its previous state. It does not return a new DataFrame; in fact, it returns None. This is a direct trade-off: you sacrifice the safety of immutability for memory efficiency and, potentially, a slight performance gain by avoiding a memory copy.

# Create a copy to modify in-place
df_copy = df.copy()

print("Before inplace sort:")
print(df_copy)

# Perform the sort in-place
df_copy.sort_values(by='LOC_per_day', inplace=True)

print("nAfter inplace sort:")
print(df_copy)

Notice how we don’t assign the result of the sort_values call to anything. The operation modifies df_copy directly. This can be a useful optimization, but it’s a sharp tool. Chaining methods, a common and elegant pandas idiom like df.sort_values(by='X').head(), breaks when you use inplace=True, because the sort returns None, and None has no head() method. You must be conscious of the state of your objects at all times. For most day-to-day work, the default behavior of returning a new, sorted copy is safer, cleaner, and the performance cost is often negligible compared to the cost of debugging state-related errors.

A Cascade of Keys

Sorting on a single axis of information is a good start, but reality is rarely so simple. What happens when your primary sorting key has duplicate values? If we sort our programmers by productivity, what about two coders who crank out the same number of lines per day? Which one comes first? The answer is: it depends. Without a secondary instruction, the relative order of those tied rows is not guaranteed. It might depend on their original order in the DataFrame, a property known as sort stability, but you shouldn’t rely on it. You need to be explicit. You need to define a tie-breaker. This is where the true power of sort_values begins to surface. You aren’t limited to a single key; you can provide a whole hierarchy of them.

Let’s expand our dataset to include the programmers’ countries of origin. This introduces a new dimension and, critically, creates groups of rows with a shared value.

data_multi = {
    'Name': ['Ada', 'Grace', 'Hedy', 'Margaret', 'Katherine', 'Fran'],
    'Born': [1815, 1906, 1914, 1936, 1918, 1926],
    'Country': ['UK', 'USA', 'Austria', 'USA', 'USA', 'USA'],
    'LOC_per_day': [95, 45, 70, 150, 120, 135]
}
df_multi = pd.DataFrame(data_multi)

print(df_multi)

We now have four programmers from the ‘USA’. If we sort by ‘Country’ alone, pandas will group all the ‘USA’ rows together, but their internal order is not something we’ve controlled. To impose a more meaningful order, we can pass a list of column names to the by parameter. This list establishes a priority. Pandas will first sort the entire DataFrame by the first column in the list. Then, for each block of rows that have an identical value in that first column, it will sort that block using the second column in the list, and so on. It’s a cascade of sorting operations, each one refining the order within the groups established by the previous one.

Let’s sort by country, and then, for the programmers within each country, sort them by their birth year.

# Sort by Country, then by Born
df_sorted_multi = df_multi.sort_values(by=['Country', 'Born'])

print(df_sorted_multi)

The result is perfectly ordered. ‘Austria’ comes first, then ‘UK’, then the block of ‘USA’ rows. Within that ‘USA’ block, the rows are no longer in their original sequence; they are now sorted by the ‘Born’ column, starting with Grace (1906) and ending with Margaret (1936). We’ve created a compound key. The true identity of the key isn’t just one column, but the tuple (Country, Born). This is conceptually similar to how a multi-column index works in a SQL database.

The cascade can be even more nuanced. You don’t have to sort everything in the same direction. What if you want to group by country alphabetically (ascending), but within each country, you want to see the most prolific programmer first (descending)? The ascending parameter can also accept a list of booleans. This list must be the same length as the list of keys in the by parameter, with each boolean corresponding to a key at the same position.

# Sort by Country (ascending) and LOC_per_day (descending)
df_sorted_mixed = df_multi.sort_values(
    by=['Country', 'LOC_per_day'],
    ascending=[True, False]
)

print(df_sorted_mixed)

Look at the ‘USA’ block now. Margaret, with 150 LOC/day, is at the top of the group, followed by Fran, Katherine, and finally Grace. We’ve specified a different sort order for each level of the key hierarchy. This level of control is essential for crafting complex data views that reveal specific insights. Each additional key in the list adds another layer of logic to the sort operation. The machine partitions the data based on the first key, then recursively partitions each of those chunks by the next key, applying the specified direction at each stage. It’s an elegant abstraction over what would otherwise be a messy series of filtering and sorting operations. But this elegance isn’t without cost. Each key adds comparisons and complexity to the underlying algorithm. And we haven’t even considered what happens when the data isn’t clean, when there are holes in our neat grid of values.

Grappling with the Void

The real world is messy. Data acquisition is an imperfect process. Sensors fail, network packets are dropped, users skip optional fields in a form. The result is that our pristine grid of data often has holes in it. In pandas, these holes, these voids in the information matrix, are typically represented by a special floating-point value: NaN, or Not a Number. The presence of NaN raises a fundamental question for a sorting algorithm: how do you compare a value with a void? What is the ordinal relationship between 150 and NaN? There’s no mathematically pure answer, so the software must make a pragmatic choice.

Let’s inject some of this real-world chaos into our dataset. Suppose the system that recorded Grace Hopper’s productivity had a glitch, leaving us with a missing value for her Lines of Code.

data_nan = {
    'Name': ['Ada', 'Grace', 'Hedy', 'Margaret', 'Katherine', 'Fran'],
    'Born': [1815, 1906, 1914, 1936, 1918, 1926],
    'Country': ['UK', 'USA', 'Austria', 'USA', 'USA', 'USA'],
    'LOC_per_day': [95, np.nan, 70, 150, 120, 135]
}
df_nan = pd.DataFrame(data_nan)

# Sort by the column containing a NaN
df_sorted_nan = df_nan.sort_values(by='LOC_per_day')

print(df_sorted_nan)

When you run this, you’ll see a clear decision has been made. Hedy, with 70 LOC/day, is at the top of the sorted list. At the very bottom, after Margaret’s 150, sits Grace, with her NaN value. What if we sort in descending order? The NaN row for Grace remains stubbornly at the bottom. The default behavior of sort_values is to sequester all rows with NaN values in the sorting key column and place them at the end of the resulting DataFrame, regardless of the sorting direction. This is often a reasonable default; it keeps the “good” data together. But what if your goal is to find the “bad” data? What if you want to isolate all the rows with missing information for further processing?

For this, pandas provides another parameter: na_position. This gives you explicit control over the fate of the NaNs. It accepts one of two string arguments: 'first' or 'last'. The meaning is exactly what you’d expect. By setting na_position='first', you instruct the sort to place all NaN-containing rows at the very beginning of the DataFrame.

# Force NaNs to the beginning of the sort
df_nan_first = df_nan.sort_values(by='LOC_per_day', na_position='first')

print(df_nan_first)

Now, Grace’s row appears at index 1, right at the top. The rest of the rows, from Hedy to Margaret, are sorted in ascending order below her. The na_position parameter acts as an override for the placement of NaNs, operating independently of the ascending parameter. The ascending flag still controls the order of the non-null values, but na_position determines whether the block of nulls comes before or after that sorted data. Using na_position='last' is simply the explicit way to invoke the default behavior.

This control is not a minor feature; it is essential for data cleaning and validation. It allows you to systematically bring all rows with data quality issues to the forefront, where they can be inspected, imputed, or removed. It transforms NaN from a simple nuisance into a value that you can actively grapple with and manage as part of your data manipulation pipeline. This explicit handling of the void is a step beyond simple ordering; it’s about imposing a structure that accounts for the imperfections in the data itself. But even with this level of control, the performance characteristics of the sort are still a factor. The underlying algorithm has to do more work to handle these special cases, and the cost of this work is not always obvious.

There’s No Such Thing as a Free Sort

Sorting is not a free lunch. Every time you ask the machine to impose order on chaos, you are spending a resource that is very much finite: CPU time. An unsorted array is a low-energy state; a sorted array is a high-energy one. The work required to get from one to the other is governed by the fundamental laws of computation, specifically by the complexity of sorting algorithms. For a general-purpose comparison sort, the best you can hope for is performance on the order of O(n log n), where ‘n’ is the number of rows you’re sorting. This isn’t a limitation of pandas; it’s a theoretical lower bound. You can’t do better. What you can do is choose an algorithm whose trade-offs are best suited to your specific problem. Pandas, in its wisdom, doesn’t lock you into a single choice. It gives you a lever to pull: the kind parameter.

The kind parameter lets you specify the sorting algorithm to use. The common options are 'quicksort', 'mergesort', and 'heapsort'. A fourth option, 'stable', is now the default and is what you’ve been using all along if you haven’t specified otherwise. Each of these names represents a different strategy for attacking the problem, with different performance characteristics. Quicksort is often the fastest on average, but it has a dark side: a worst-case performance of O(n²), which can be triggered by already-sorted or nearly-sorted data. It is also an “unstable” sort. A stable sort is one that preserves the original relative order of elements that are considered equal by the sort key. Mergesort, by contrast, is stable and guarantees O(n log n) performance, but it requires extra memory proportional to ‘n’ to do its work. Heapsort is O(n log n) and sorts in-place (like quicksort), but it’s not stable. The choice matters.

Let’s make this concrete. Stability is not an academic concern. Imagine you have data that is already sorted by date, and you want to apply a secondary sort by a ‘Category’ field. If you use an unstable sort, the original date ordering within each category might be destroyed. A stable sort guarantees it will be preserved. Consider this setup:

df_stability_test = pd.DataFrame({
    'Category': ['B', 'A', 'B', 'A', 'B'],
    'Value': [1, 2, 3, 4, 5]
})

# Note the initial order is sorted by 'Value'
print("Original DataFrame:")
print(df_stability_test)

# Sort by 'Category' using an unstable algorithm
df_unstable = df_stability_test.sort_values(by='Category', kind='quicksort')
print("nUnstable (quicksort):")
print(df_unstable)

# Sort by 'Category' using a stable algorithm
df_stable = df_stability_test.sort_values(by='Category', kind='mergesort')
print("nStable (mergesort):")
print(df_stable)

Look closely at the output. In the stable sort, the rows for Category ‘A’ appear in the order Value: 2, then Value: 4, which was their original relative order. The rows for ‘B’ appear as 1, 3, 5, also preserving their original sequence. The unstable quicksort, however, offers no such guarantee. In the output shown, the ‘A’ rows are now 4, then 2. The original order has been scrambled. This is why pandas now defaults to a stable sort (kind='stable', which typically maps to a highly-optimized Timsort or mergesort). For data analysis, predictability and correctness are paramount, and the guarantee of stability is usually worth the potential, and often marginal, performance difference.

The cost of the sort isn’t just in the choice of algorithm. The very nature of the data you’re sorting plays a huge role. Sorting a column of 64-bit integers is blindingly fast; a CPU can compare two integers in a single clock cycle. Sorting a column of strings is a different beast entirely. Each comparison isn’t a single instruction but a call to a function like strcmp() that must iterate over the characters of both strings until a difference is found or the strings end. Longer, more similar strings take more cycles to compare. Furthermore, the memory layout matters. A DataFrame with a simple numeric type is a contiguous block of memory. The sorting algorithm can stride through it efficiently. A DataFrame containing Python objects, like strings, is actually a block of pointers, and each pointer refers to a different location in memory where the actual string data is stored. This causes cache misses and slows down the entire process. The performance of sort_values is a complex interplay between the number of rows, the chosen algorithm, the data types in the key columns, and even the characteristics of the values themselves.

Source: https://www.pythonlore.com/sorting-data-with-pandas-dataframe-sort_values/


You might also like this video

Comments

No comments yet. Why don’t you start the discussion?

    Leave a Reply