Whatever your programs are doing, they often have to deal with vast amounts of data. This data is usually represented and manipulated in the form of strings. However, handling such a large quantity of input in strings can be very ineffective once you start manipulating them by copying, slicing, and modifying. Why?

Let's consider a small program which reads a large file of binary data, and
copies it partially into another file. To examine out the memory usage of this program, we will use memory_profiler, an excellent Python package that allows us to see the memory usage of a program line by line.

@profile
def read_random():
    with open("/dev/urandom", "rb") as source:
        content = source.read(1024 * 10000)
        content_to_write = content[1024:]
    print("Content length: %d, content to write length %d" %
          (len(content), len(content_to_write)))
    with open("/dev/null", "wb") as target:
        target.write(content_to_write)

if __name__ == '__main__':
    read_random()

Running the above program using memory_profiler produces the following:

$ python -m memory_profiler memoryview/copy.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy.py

Mem usage    Increment   Line Contents
======================================
                         @profile
 9.883 MB     0.000 MB   def read_random():
 9.887 MB     0.004 MB       with open("/dev/urandom", "rb") as source:
19.656 MB     9.770 MB           content = source.read(1024 * 10000)
29.422 MB     9.766 MB           content_to_write = content[1024:]
29.422 MB     0.000 MB       print("Content length: %d, content to write length %d" %
29.434 MB     0.012 MB             (len(content), len(content_to_write)))
29.434 MB     0.000 MB       with open("/dev/null", "wb") as target:
29.434 MB     0.000 MB           target.write(content_to_write)

The call to source.read reads 10 MB from /dev/urandom. Python needs to allocate around 10 MB of memory to store this data as a string. The instruction on the line just after, content[1024:], copies the entire block of data minus the first KB — allocating 10 more megabytes.

So what's interesting here, is to notice that the memory usage of the program increased by about 10 MB when building the variable content_to_write. The slice operator is copying the entirety of content, minus the first KB, into a new string object.

When dealing with extensive data, performing this kind of operation on large byte arrays is going to be a disaster. If you already have written C code, you know that using memcpy() has a significant cost, both in term of memory usage and regarding general performance: copying memory is slow.

However, as a C programmer, you also know that strings are arrays of characters and that nothing stops you from looking at only part of this array without copying it, through the use of basic pointer arithmetic – assuming that the entire string is in a contiguous memory area.

This is possible in Python using objects which implement the buffer protocol. The buffer protocol is defined in PEP 3118, which explains the C API used to provide this protocol to various types, such as strings.

When an object implements this protocol, you can use the memoryview class constructor on it to build a new memoryview object that references the original object memory.

>>> s = b"abcdefgh"
>>> view = memoryview(s)
>>> view[1]
98
>>> limited = view[1:3]
>>> limited
<memory at 0x7fca18b8d460>
>>> bytes(view[1:3])
b'bc'

Note: 98 is the ASCII code for the letter b.

In the example above, we use the fact that the memoryview object's slice operator itself returns a memoryview object. That means it does not copy any data but merely references a particular slice of it.

The graph below illustrates what happens:

serious-python__3

Therefore, it is possible to rewrite the program above in a more efficient manner. We need to reference the data that we want to write using a memoryview object, rather than allocating a new string.

@profile
def read_random():
    with open("/dev/urandom", "rb") as source:
        content = source.read(1024 * 10000)
        content_to_write = memoryview(content)[1024:]
    print("Content length: %d, content to write length %d" %
          (len(content), len(content_to_write)))
    with open("/dev/null", "wb") as target:
        target.write(content_to_write)

if __name__ == '__main__':
    read_random()

Let's run the program above with the memory profiler:

$ python -m memory_profiler memoryview/copy-memoryview.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy-memoryview.py

Mem usage    Increment   Line Contents
======================================
                         @profile
 9.887 MB     0.000 MB   def read_random():
 9.891 MB     0.004 MB       with open("/dev/urandom", "rb") as source:
19.660 MB     9.770 MB           content = source.read(1024 * 10000) <1>
19.660 MB     0.000 MB           content_to_write = memoryview(content)[1024:] <2>
19.660 MB     0.000 MB       print("Content length: %d, content to write length %d" %
19.672 MB     0.012 MB             (len(content), len(content_to_write)))
19.672 MB     0.000 MB       with open("/dev/null", "wb") as target:
19.672 MB     0.000 MB           target.write(content_to_write)

In that case, the source.read call still allocates 10 MB of memory to read the content of the file. However, when using memoryview to refer to the offset content, no more memory is allocated.

This version of the program ends up allocating 50% less memory than the original version!

This kind of trick is especially useful when dealing with sockets. When sending data over a socket, all the data might not be sent in a single call.

import socket
s = socket.socket(…)
s.connect(…)
# Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
while data:
    sent = s.send(data)
    # Remove the first `sent` bytes sent
    data = data[sent:] <2>

Using a mechanism as implemented above, the program copies the data over and over until the socket has sent everything. By using memoryview, it is possible to achieve the same functionality with zero-copy, and therefore higher performance:

import socket
s = socket.socket(…)
s.connect(…)
# Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
mv = memoryview(data)
while mv:
    sent = s.send(mv)
    # Build a new memoryview object pointing to the data which remains to be sent
    mv = mv[sent:]

As this won't copy anything, it won't use any more memory than the 100 MB
initially needed for the data variable.

So far we've used memoryview objects to write data efficiently, but the same method can also be used to read data. Most I/O operations in Python know how to deal with objects implementing the buffer protocol. They can read from it, but also write to it. In this case, we don't need memoryview objects – we can ask an I/O function to write into our pre-allocated object:

>>> ba = bytearray(8)
>>> ba
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00')
>>> with open("/dev/urandom", "rb") as source:
...     source.readinto(ba)
... 
8
>>> ba
bytearray(b'`m.z\x8d\x0fp\xa1')

With such techniques, it's easy to pre-allocate a buffer (as you would do in C to mitigate the number of calls to malloc()) and fill it at your convenience.

Using memoryview, you can even place data at any point in the memory area:

>>> ba = bytearray(8)
>>> # Reference the _bytearray_ from offset 4 to its end
>>> ba_at_4 = memoryview(ba)[4:]
>>> with open("/dev/urandom", "rb") as source:
... # Write the content of /dev/urandom from offset 4 to the end of the
... # bytearray, effectively reading 4 bytes only
...     source.readinto(ba_at_4)
... 
4
>>> ba
bytearray(b'\x00\x00\x00\x00\x0b\x19\xae\xb2')

The buffer protocol is fundamental to achieve low memory overhead and great performances. As Python hides all the memory allocations, developers tend to forget what happens under the hood, at a high cost for the speed of their programs!

It's also good to know that both the objects in the array module and the functions in the struct module can handle the buffer protocol correctly, and can, therefore, efficiently perform when targeting zero copy.