cpython整型溢出

##C语言中的整型溢出

如果说我比别人看得更远些,那是因为我站在了巨人的肩上. ——牛顿

在C语言中,整型溢出分为有符号和无符号两种:

1. signed integer overflow : undefined behavior
2. unsigned integer overflow : safely wraps around (UINT_MAX + 1 yeild 0)

对于unsigned整型溢出,C的规范是有定义的——“溢出后的数会以2^(8*sizeof(type))作模运算”;对于signed整型的溢出,C的规范定义是“undefined behavior”。详细介绍请参考Basics of Integer OverflowC语言的整型溢出问题

##CPython如何判断整型溢出
当两个整型进行加、减、乘、除时都有可能产生溢出。下面以Cpython两个整数相加的实现为例说说Cpython如何判断整型溢出。

1
2
3
4
5
6
7
8
9
10
11
12
static PyObject *
int_add(PyIntObject *v, PyIntObject *w)
{
register long a, b, x;
CONVERT_TO_LONG(v, a);
CONVERT_TO_LONG(w, b);
/* casts in the line below avoid undefined behaviour on overflow */
x = (long)((unsigned long)a + b); /*转换为无符号*/
if ((x^a) >= 0 || (x^b) >= 0) /*判断*/
return PyInt_FromLong(x);
return PyLong_Type.tp_as_number->nb_add((PyObject *)v, (PyObject *)w);
}

正如上篇博文讲的整型在内存中只是一串二进制数,看你怎样解释然后分为有符号数和无符号数,所以这里首先转换为无符号整型。如果条件判断语句((x\^a) >= 0 || (x\^b) >= 0)false,就发生了溢出。这个if条件判断的是符号位。对于有符号整型,符号位为1表示负数,符号位为0表示正数。这个表达式>=0判断的就是符号位是否为0。

异或运算^中,两个bit相同为0,不同为1:

1
2
3
4
0^0 == 0;
0^1 == 1;
1^0 == 1;
!^1 == 0;

一个long能表达的数的范围是有限制的,两个long相加的情况不外乎下面6种:

1
2
3
4
5
6
7
8
9
// 没有溢出的情况
非负数 + 非负数 = 非负数
非负数 + 负数 = 负/非负数
负数 + 非负数 = 负/非负数
负数 + 负数 = 负数
// 溢出的情况
非负数 + 非负数 = 负数
负数 + 负数 = 非负数

可以看到,没有溢出的情况刚好x和a、b其中一个的符号位相同,而溢出的情况x跟a、b的符号位都不同。所以((x\^a) >= 0 || (x\^b) >= 0)就刚好能判断出来a+b有没有溢出。
这部分的内容是我在浏览cpython源码时遇到的疑问,google后发现大牛的讨论

cpython源码中关于整型溢出的注释:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/*
Integer overflow checking for * is painful: Python tried a couple ways, but
they didn't work on all platforms, or failed in endcases (a product of
-sys.maxint-1 has been a particular pain).
Here's another way:
The native long product x*y is either exactly right or *way* off, being
just the last n bits of the true product, where n is the number of bits
in a long (the delivered product is the true product plus i*2**n for
some integer i).
The native double product (double)x * (double)y is subject to three
rounding errors: on a sizeof(long)==8 box, each cast to double can lose
info, and even on a sizeof(long)==4 box, the multiplication can lose info.
But, unlike the native long product, it's not in *range* trouble: even
if sizeof(long)==32 (256-bit longs), the product easily fits in the
dynamic range of a double. So the leading 50 (or so) bits of the double
product are correct.
We check these two ways against each other, and declare victory if they're
approximately the same. Else, because the native long product is the only
one that can lose catastrophic amounts of information, it's the native long
product that must have overflowed.
*/