mongodb: “invalid character ',' looking for beginning of value”
一个超大的json(也就800M),需要导入mongodb,然后遇到了一个问题:
$ mongoimport --db weibo --collection data --file test.json
2018-05-09T16:10:22.357+0800 connected to: localhost
2018-05-09T16:10:22.360+0800 Failed: error processing document #2: invalid character ',' looking for beginning of value
2018-05-09T16:10:22.360+0800 imported 0 documents
first
一开始,我先去 菜鸟工具<sup id="fnref-1">1</sup>验证了一下我的json格式不是不是正确,json格式是没有没问题的。
second
我以为是编码的问题,可能是mac下的编码有问题,因为stackoverflow<sup id="fnref-2">2</sup>有一个谈论这个问题的。上边的回复说是有UTF8不支持的字符,但是他们遇到的问题都是\\
,我还是去Windows服务器上装上了mongodb,然后还是这个问题,可见我这个可能不是字符问题。
其中JP Lew遇到了也是,
的问题提到的这个做法很不错,使用-vvvv
这个参数确定位置。
$ mongoimport --db weibo --collection data --file test.json -vvvvv
2018-05-09T16:30:09.538+0800 using 4 decoding workers
2018-05-09T16:30:09.539+0800 using 1 insert workers
2018-05-09T16:30:09.539+0800 will listen for SIGTERM, SIGINT, and SIGKILL
2018-05-09T16:30:09.542+0800 filesize: 823127226 bytes
2018-05-09T16:30:09.542+0800 using fields:
2018-05-09T16:30:09.552+0800 connected to: localhost
2018-05-09T16:30:09.552+0800 ns: weibo.data
2018-05-09T16:30:09.552+0800 connected to node type: standalone
2018-05-09T16:30:09.553+0800 standalone server: setting write concern w to 1
2018-05-09T16:30:09.553+0800 using write concern: w='1', j=false, fsync=false, wtimeout=0
2018-05-09T16:30:09.553+0800 standalone server: setting write concern w to 1
2018-05-09T16:30:09.553+0800 using write concern: w='1', j=false, fsync=false, wtimeout=0
2018-05-09T16:30:09.555+0800 Failed: error processing document #2: invalid character ',' looking for beginning of value
2018-05-09T16:30:09.555+0800 imported 0 documents
嗯,还是这个问题,所以我这个问题应该跟JP的那个也不一样。而且我这个应该是第一个json就出问题了!
插曲
因为文件里好多东西都没用,所以我想只把有用的那几行挑出来,但是结果感人,还是想个正经办法把。
附上 cat
+grep
提取个别行<sup id="fnref-3">3</sup>:
[root@localhost test]# cat test.txt
hnlinux
peida.cnblogs.com
ubuntu
ubuntu linux
redhat
Redhat
linuxmint
[root@localhost test]# cat test2.txt
linux
Redhat
[root@localhost test]# cat test.txt | grep -f test2.txt
hnlinux
ubuntu linux
Redhat
linuxmint
third
最后,在我一次又一次的实验下,终于找到了问题:
{
...
},
{
...
},
...
泥煤两个json中间多了一个逗号啊,然后写了一个脚本把这个逗号去掉吧。。。
import os
import re
import sys
args = sys.argv
if len(args) != 3 or args[1] == args[2]:
raise Warning()
abs_path = os.path.abspath('.')
org_path = os.path.join(abs_path, args[1])
new_path = os.path.join(abs_path, args[2])
re_com = re.compile(r'^},')
try:
fr = open(org_path, 'r')
fw = open(new_path, 'w')
for line in fr:
if re_com.match(line):
line = '}\n'
fw.writelines(line)
except IOError as e:
print(e)
finally:
if fr:
fr.close()
if fw:
fw.close()
这么pythonic的处理大文件的方式来自:https://www.cnblogs.com/wulaa... :
with open(filename, 'r') as file:
for line in file:
....
ok,把新文件导入进去~
$ mongoimport --db weibo --collection data --file new.json
2018-05-09T15:58:36.211+0800 connected to: localhost
2018-05-09T15:58:39.194+0800 [##......................] weibo.data77.5MB/785MB (9.9%)
2018-05-09T15:58:42.195+0800 [####....................] weibo.data160MB/785MB (20.4%)
2018-05-09T15:58:45.195+0800 [#######.................] weibo.data243MB/785MB (31.0%)
2018-05-09T15:58:48.203+0800 [#########...............] weibo.data323MB/785MB (41.1%)
2018-05-09T15:58:51.197+0800 [############............] weibo.data402MB/785MB (51.2%)
2018-05-09T15:58:54.195+0800 [##############..........] weibo.data478MB/785MB (60.9%)
2018-05-09T15:58:57.196+0800 [#################.......] weibo.data560MB/785MB (71.4%)
2018-05-09T15:59:00.195+0800 [###################.....] weibo.data642MB/785MB (81.8%)
2018-05-09T15:59:03.196+0800 [######################..] weibo.data722MB/785MB (92.0%)
2018-05-09T15:59:05.521+0800 [########################] weibo.data785MB/785MB (100.0%)
2018-05-09T15:59:05.522+0800 imported 95208 documents