Pandas sep='\s+' 的问题

Posted on 2018-12-04 In Python Tips Views:

Pandas 的 read_table (官方文档) 和 read_csv (官方文档) (指定分隔符的情况下两者无区别), 有关键字 sep= (read_table 默认为 sep='\t', read_csv 默认为 sep=',').

但当需要指定多个空格的情况下, 指定 sep='\s+' 可能无法被正确识别.

请使用 ' +' 代替.

官方文档中关于 sep 关键字的说明有如下描述:

In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine.

指定 sep='\s+' 可能无法被正确识别,

暂时的办法是:
加上参数 engine='python', 或使用 ' +' 代替 (此时 engine='python').
如果设为 engine='c', 速度较快, 但不便识别正则表达, 追求效率的话可以配合使用:
quoting=3 (即使用 csv.QUOTE_NONE)

Pandas 的读取方法简单介绍参见: 用 pandas 读 csv